Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate

Featured projects
TL;DR: Introducing the ExecuTorch MLX Delegate
- The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework.
- The delegate seamlessly integrates with the PyTorch 2 export stack and supports a wide range of quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4).
- It supports various models, including dense transformers (Llama, Qwen, Gemma), sparse Mixture-of-Experts, and speech-to-text models (Whisper, Voxtral, Parakeet) for both offline and real-time transcription.
- Note: The MLX delegate is currently experimental.
Apple Silicon has become a popular platform for running large language models locally. Until now, ExecuTorch users on macOS were limited to CPU-based backends like XNNPACK or the AOTI Metal backend. Now we’ve released the MLX delegate, which brings fully optimized GPU-accelerated inference to Apple Silicon Macs through Apple’s MLX framework.
In this post we’ll cover what the MLX delegate is, why we built it as an ExecuTorch backend, and what you can run with it today.
Note: The MLX delegate is currently experimental and under active development. APIs and supported features may change.
What is the MLX Delegate?
The MLX delegate is a new ExecuTorch backend that compiles and runs PyTorch models on Apple Silicon GPUs. You export your model using the standard ExecuTorch pipeline, and the delegate handles the rest: partitioning the graph, serializing it into an optimized format, and dispatching operations to MLX’s Metal GPU kernels at runtime.
From the user’s perspective, the workflow is the same as any other ExecuTorch backend:
- Export your model with
torch.export
- Lower it with
to_edge_transform_and_lower
using theMLXPartitioner
- Run the resulting
.pte
file with the ExecuTorch runtime
The delegate currently supports around 90 ATen ops, covering the full range of operations needed for transformer inference: quantized matmul, multi-head attention, rotary position embeddings, mixture-of-experts routing, recurrent state-space operations, and more.
Why Build This as an ExecuTorch Delegate?
There are already excellent tools for running models on Apple Silicon, including MLX’s own mlx-lm
. So why build another one? Three reasons:
Performance. The MLX delegate achieves 3-6x higher throughput on generative AI workloads compared to existing ExecuTorch delegates on macOS. Moving inference to MLX’s optimized Metal kernels makes a meaningful difference for ExecuTorch applications like chat and real-time transcription.
PyTorch 2 integration. The delegate plugs directly into the PyTorch 2 export stack. It uses torch.export
for graph capture and TorchAO for quantization, the same tools used by every other ExecuTorch backend. If you can export a model with torch.export
, you can run it on MLX. When new models or quantization techniques land in PyTorch, they become available to the MLX delegate without additional work.
Portable applications. ExecuTorch provides a single runtime API across all backends. An application built against the ExecuTorch C++ or Python runtime can run models exported for MLX, XNNPACK, CoreML, Vulkan, or CUDA without changing application code.
Quantization and Dtype Support
The delegate supports the precision and quantization options you’d expect for on-device inference:
- BF16, FP16, and FP32 for weights and activations
- 2, 4, and 8-bit affine quantization via TorchAO’s
quantize_
API. This uses the same quantization scheme as the XNNPACK and Vulkan backends, which means a single quantized model definition can target multiple backends, and opens the door to fat PTE files that run on whichever backend is available at runtime. - NVFP4 quantization using NVIDIA’s FP4 data type
- Tied quantized embeddings for models that share weights between the embedding layer and the language model head
What Models Can I Run?
We’ve validated the delegate across a range of architectures:
Large Language Models
Dense transformers work out of the box, with support for both full KV caches and sliding window caches:
- Llama 3.2 1B
- Qwen 3 (0.6B, 1.7B, 4B)
- Phi-4 mini (3.8B)
- Gemma 3 (1B, 4B) with sliding window attention
Sparse Mixture-of-Experts models are supported through custom gather operations that efficiently route tokens to the correct experts on the GPU:
- Qwen 3.5 35B-A3B: 256 experts with top-8 routing, combining GatedDeltaNet linear attention layers with full SDPA attention layers
Speech-to-Text
Offline transcription models process a complete audio recording and return the transcript:
- OpenAI Whisper (tiny through large-v3-turbo)
- NVIDIA Parakeet TDT (0.6B) with word-level timestamps
- Mistral Voxtral (3B)
Real-time streaming transcription processes audio in small chunks as it arrives, enabling live use cases:
- Mistral Voxtral Realtime (4B) with live microphone input, ring buffer KV caches, and sliding window attention
Broader Coverage
Beyond these flagship models, over 30 additional models have been validated through our backend test suites, covering dense transformers, encoder-decoder architectures, and vision models.
Getting Started
Each supported model has a README with detailed export and inference instructions:
- LLMs via HuggingFace: covers Llama, Qwen, and Gemma using optimum-executorch
- LLMs via export_llm: covers Phi-4 and Stories 110M using the Hydra-based pipeline
- Qwen 3.5 MoE: covers the sparse MoE export with `–backend mlx`
- Voxtral Realtime: covers streaming and offline speech-to-text
- Parakeet: covers speech recognition with timestamps
- Whisper: covers OpenAI’s speech recognition models
For an overview of the delegate architecture, supported operations, and development guide, see the MLX Delegate README.
We’d love to hear what models and use cases matter most to you. If you run into issues or have feature requests, please open an issue on the ExecuTorch GitHub repo or join our Discord Channel.

Facts Only

The MLX delegate is a new experimental backend for ExecuTorch.
It enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs.
The delegate uses Apple’s MLX framework and Metal GPU kernels.
It supports quantization options: BF16, FP16, FP32, 2/4/8-bit affine, and NVFP4.
Supported models include Llama, Qwen, Gemma, Whisper, Voxtral, and Parakeet.
The delegate integrates with PyTorch 2’s export stack via `torch.export` and `toedgetransformandlower`.
It currently supports around 90 ATen ops for transformer inference.
Performance improvements of 3-6x are reported over existing ExecuTorch delegates on macOS.
The delegate is compatible with ExecuTorch’s portable runtime API across backends.
Over 30 models have been validated, including dense transformers, MoE, and speech-to-text architectures.
The project is under active development, with APIs and features subject to change.
Documentation includes model-specific export and inference instructions.

Executive Summary

The MLX delegate is a new experimental backend for ExecuTorch that enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs using Apple’s MLX framework. It integrates with PyTorch 2’s export stack, supporting a wide range of quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4) and models, including dense transformers (Llama, Qwen, Gemma), sparse Mixture-of-Experts, and speech-to-text models (Whisper, Voxtral, Parakeet). The delegate achieves 3-6x higher throughput on generative AI workloads compared to existing ExecuTorch delegates on macOS by leveraging MLX’s optimized Metal GPU kernels. It maintains compatibility with ExecuTorch’s portable runtime API, allowing applications to switch between backends without code changes. While currently experimental, the delegate has been validated across over 30 models, with detailed export and inference instructions provided for flagship architectures. The project is under active development, with APIs and features subject to change.
The release addresses a gap for Apple Silicon users previously limited to CPU-based backends like XNNPACK or the AOTI Metal backend. By plugging into PyTorch 2’s export tools, the MLX delegate ensures seamless adoption of new models and quantization techniques as they become available in PyTorch. The delegate’s support for both offline and real-time transcription models, along with sparse MoE architectures, positions it as a versatile solution for on-device AI applications. However, its experimental status means users should expect evolving APIs and potential limitations in supported operations.

Full Take

The MLX delegate represents a strategic move to bridge PyTorch’s ecosystem with Apple’s optimized hardware, addressing a growing demand for efficient on-device AI. The strongest version of this narrative highlights genuine technical progress: leveraging MLX’s Metal kernels for performance gains, maintaining PyTorch 2 compatibility, and supporting diverse quantization schemes. This aligns with broader industry trends toward edge deployment and hardware-specific optimizations.
However, the experimental status warrants scrutiny. The claim of 3-6x throughput improvement is compelling but lacks comparative benchmarks or reproducibility details in this context. The emphasis on "seamless integration" with PyTorch 2’s export stack assumes stability in a rapidly evolving toolchain—an assumption that may not hold for all users. The pattern of framing this as a "portable" solution while acknowledging backend-specific optimizations (e.g., MLX’s Metal kernels) could risk a motte-and-bailey dynamic: the "portability" claim (motte) is defensible, but the implied performance uniformity (bailey) is not guaranteed.
Root cause: This reflects the tension between standardization (ExecuTorch’s runtime API) and specialization (hardware-specific optimizations). The narrative assumes that PyTorch’s export tools are sufficiently mature to handle diverse backends—a bet that may pay off but isn’t universally validated yet. Implications for human agency include democratizing access to high-performance inference on consumer hardware, but the costs (e.g., debugging experimental software) fall on early adopters.
Bridge questions: How does the MLX delegate’s performance compare to native MLX implementations like `mlx-lm`? What trade-offs exist between portability and optimization in real-world deployments? Would a fat PTE file with multiple backends introduce overhead that negates performance gains?
Counterstrike scan: A coordinated influence campaign might exaggerate performance claims or downplay experimental limitations to drive adoption. This content avoids such tactics, transparently labeling the delegate as experimental and providing concrete model validations. No structural alignment with manipulation patterns is detected.
Patterns detected: none

Sentinel — Human

Confidence

This text displays characteristics of highly technical, internally consistent human-written documentation, focused on explaining a complex, novel technical feature.

Signals Detected

Varied sentence length and complex technical terminology. The tone shifts between instructional and promotional, which is characteristic of human technical writing.

The flow is logical, moving from introduction (what it is) to motivation (why it was built) to implementation (what it supports). This shows intentional structural planning.

The text uses specific, dense technical details and references known industry standards (torch.export, TorchAO, MLX, Metal). The detailed list of supported models and specific operations suggests genuine, internally validated knowledge rather than simple aggregation.

The claims are highly specific (e.g., 3-6x throughput, specific model support, support for NVFP4). While the claims are promotional, the underlying technical scaffolding appears consistent and internally consistent, suggesting genuine development context.

Human Indicators

The text contains specific, non-trivial technical details and references to specific software stacks (PyTorch 2 export, MLX, Metal kernels) that require deep domain expertise.

The structure is that of a technical announcement or feature release, which requires a specific, invested authorial voice and process.