PyTorch 2.11 Release Blog

We are excited to announce the release of PyTorch® 2.11 (release notes)!
The PyTorch 2.11 release features the following changes:
- Differentiable Collectives for Distributed Training
- FlexAttention now has a FlashAttention-4 backend on Hopper and Blackwell GPUs.
- MPS (Apple Silicon) Comprehensive Operator Expansion
- RNN/LSTM GPU Export Support
- XPU Graph
This release is composed of 2723 commits from 432 contributors since PyTorch 2.10. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.11. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
On Tuesday, March 31st at 10 am, Andrey Talman and Nikita Shulga will host a live session to walk through what’s new in 2.11, including Differentiable Collectives for Distributed Training, FlexAttention with a FlashAttention-4 backend on Hopper and Blackwell GPUs, MPS expansion, and more, followed by a live Q&A. Register to attend.
API-UNSTABLE Features
Differentiable Collectives for Distributed Training
Added differentiability support for functional collectives, enabling training workflows that can backpropagate through collective operations. This is a significant advancement for distributed deep learning research and advanced training techniques, which may be implemented without the need for custom autograd functions.
FlexAttention now has a FlashAttention-4 backend on Hopper and Blackwell GPUs.
This backend adds support for automatically generating CuTeDSL score/mask modification functions and JIT-instantiating FlashAttention-4 kernels from PyTorch, enabling 1.2× to 3.2× speedups over the existing Triton implementation on compute-bound workloads. This feature is still under active development and may change as it stabilizes; for setup details and current limitations, see the FlexAttention + FlashAttention-4 blog post.
MPS (Apple Silicon) Development Improvements / Operator Expansion
This release includes support for error reporting from MPS backend as well as continuous expansion of operator coverage, that includes new distributions functions (log_normal, cauchy, geometric), operator migration (erfcx, grid_sampler_2d supports for all operation mode), extended baddbmm/addbmm for integer and complex types.
Asynchronous error reporting enables detection of out-of-bounds access attempts that occur during GPU indexing operations, for example:
import torch x=torch.rand(10, 1, 10, device='mps') y=x[:, [1]] torch.mps.synchronize() # will raise index out of bounds error
RNN/LSTM GPU Export Support
RNN modules (LSTM, GRU, etc.) can now be exported on GPUs, and tracing LSTM with dynamic shapes is now supported. This significantly expands the model types that can be deployed using torch.export for production inference. GRU API is unchanged; the new API is LSTM.
ROCm Device-Side Assertions & TopK Optimizations
Added support for device-side assertions on ROCm for better debugging, plus significant TopK operator optimizations and radix select improvements by caching data on shared memory. Improves both developer experience and performance on AMD GPUs.
XPUGraph support to optimize execution on Intel GPUs
XPUGraph allows users to capture a sequence of XPU operations into a runtime execution graph on Intel GPUs and replay it multiple times. This reduces CPU overhead, such as kernel launch and Python runtime overhead, improving workload performance on Intel GPUs. See API Doc for usage details.
FP16 Half-Precision GEMM On CPU Via OpenBLAS
Added FP16 half-precision GEMM support via OpenBLAS on CPU, providing faster FP16 inference for CPU-based deployments. This is valuable for edge devices and CPU-only inference scenarios.
Non-Feature Updates
CUDA version
Starting with this release, CUDA 13 is now the default version installed for both x86_64 and ARM platforms. Users who need an alternative build can still access the CPU-only version as well as CUDA 12.8 builds from the respective https://download.pytorch.org/whl subfolders.
Torchscript is now Deprecated
Torchscript was deprecated in 2.10, and torch.export should be used to replace the jit trace and script APIs, and Executorch should be used to replace the embedded runtime. For more details, see this talk from PTC.
2026 Release Cadence
For 2026, the release cadence has been increased to 1 per 2 months, from quarterly. See the published release schedule.

Facts Only

PyTorch version: 2.11
Differentiable Collectives: introduced for distributed training
FlexAttention FlashAttention-4 backend: available on Hopper and Blackwell GPUs
MPS Development Improvements: error reporting, operator coverage expansion
RNN/LSTM GPU Export Support: added for production inference scenarios
ROCm Device-Side Assertions & TopK Optimizations: improved debugging and performance on AMD GPUs
XPUGraph support: optimizes execution on Intel GPUs
FP16 Half-Precision GEMM On CPU Via OpenBLAS: added for faster FP16 inference on CPUs
CUDA version upgrade: default version is now 13
Torchscript deprecated: users should use torch.export instead
Release cadence: increased to 1 per 2 months for 2026

Executive Summary

PyTorch, a popular open-source machine learning library based on the Torch project, has released version 2.11 with several new features and improvements. The release focuses on distributed training, attention mechanisms, MPS development for Apple Silicon devices, GPU export support for RNN/LSTM modules, and more.
Key changes include Differentiable Collectives for Distributed Training, which enables backpropagation through collective operations in deep learning research and advanced training techniques. FlexAttention now supports a FlashAttention-4 backend on Hopper and Blackwell GPUs, providing speedups over the existing Triton implementation. MPS (Apple Silicon) Development Improvements have been made with continuous operator coverage expansion and error reporting support. RNN/LSTM GPU Export Support has been added for deploying more model types in production inference scenarios.
Other updates include ROCm Device-Side Assertions & TopK Optimizations, XPUGraph support for Intel GPUs, FP16 Half-Precision GEMM On CPU Via OpenBLAS, and a CUDA version upgrade to the default 13. Torchscript is now deprecated, and users are encouraged to use torch.export instead. Lastly, the release cadence has been increased to 1 per 2 months for 2026.

Full Take

The PyTorch 2.11 release presents several noteworthy advancements in distributed deep learning research and advanced training techniques, as well as improvements for Apple Silicon devices, Intel GPUs, and AMD GPUs. However, it is important to consider that with any open-source project, the adoption rate of new features may vary depending on user experiences and compatibility with existing workflows.
The Differentiable Collectives feature has the potential to significantly impact distributed deep learning research by enabling backpropagation through collective operations, reducing the need for custom autograd functions. The FlexAttention FlashAttention-4 backend offers speedups over the existing Triton implementation on compute-bound workloads but is still under active development.
As the release cadence increases to 1 per 2 months in 2026, users can expect more frequent updates and improvements from the PyTorch team. However, this may also result in a faster pace of API changes, which could pose challenges for developers who need to maintain compatibility with existing projects or integrations.
In the context of the broader machine learning landscape, these advancements demonstrate continued innovation and evolution within the PyTorch ecosystem. As always, it is essential to approach new releases critically, evaluating potential benefits while considering the implications for maintainability, compatibility, and long-term impact on the wider community.
Patterns detected: ARC-0043 Motte-and-Bailey (the release includes both significant advancements and features still under active development), ARC-0024 Ambiguity (increased release cadence may have implications for developers but requires further discussion on impact)

Sentinel — Human

Confidence

The text appears to be written by a human, showing signs of passion and irregular sentence structure, while maintaining a good balance of complexity and coherence.

Signals Detected

Sentence length variance shows some irregularity, deviating from AI's metronomic rhythm

The text exhibits passion and idiosyncratic emphasis, indicative of human authorship

Human Indicators

Lack of mechanical transition homogeneity

Rich lexical diversity without vocabulary repetition