PyTorch MLP fusion: profiling nn.Linear into a fused MLP
A Hugging Face walkthrough uses PyTorch profiling to show how fusing nn.Linear layers cuts operator overhead and raises throughput.
TL;DR
- 01A Hugging Face walkthrough uses PyTorch profiling to show how fusing nn.Linear layers cuts operator overhead and raises throughput.
- 02PyTorch profiling shows how replacing repeated nn.Linear operations with a fused MLP kernel reduces operator overhead and improves throughput.
- 03The tutorial records operator counts, CPU and CUDA time, and kernel launch overhead for common MLP patterns.
PyTorch profiling shows how replacing repeated nn.Linear operations with a fused MLP kernel reduces operator overhead and improves throughput. Hugging Face published a step‑by‑step walkthrough and working code that uses torch.profiler microbenchmarks to compare the standard sequence of Linear and activation ops against a single fused implementation.
What the profiling measured
The tutorial records operator counts, CPU and CUDA time, and kernel launch overhead for common MLP patterns. Measurements compare an unfused implementation composed of two nn.Linear calls and an activation to a fused kernel that executes the same math in one pass. The profiling traces highlight three sources of waste in the unfused path: multiple kernel launches, extra memory reads and writes between operators, and per-operator scheduling overhead on the host.
Benchmarks in the examples cover a range of batch sizes and hidden dimensions and report improved throughput and lower host-side time for the fused MLP. Gains are largest for workloads with small to moderate batch sizes and many short operators, where kernel-launch and memory-copy overheads dominate. The guide uses torch.profiler to capture both CPU and GPU timelines, and presents flamegraph-style views to make the overhead visible.
How the fused MLP is implemented
The fused implementation in the walkthrough replaces a pattern of Linear -> Activation -> Linear with a single op that performs both matrix multiplies and the activation in one fused kernel. The code path shown remains compatible with TorchScript and can be invoked as a single module in a training or inference loop.
On the CPU side the blog demonstrates reducing Python-to-C++ calls by combining operations and using contiguous buffers to cut memory traffic. On CUDA the fused kernel reduces kernel launches and keeps intermediate results on the device. The post includes a minimal C++/CUDA example and a pure-PyTorch fused variant for environments where a custom CUDA kernel is not available.
The author surfaces practical constraints: fused kernels can require special handling of data layout, added complexity for supporting many activation types, and careful attention to memory formats. The tutorial suggests profiling specific model shapes before adopting fusion broadly and provides code snippets to reproduce the profiler traces.
Why it matters
Fusing common MLP patterns in PyTorch demonstrably reduces operator overhead and can raise throughput, especially for workloads dominated by many small operations. For teams optimizing inference or tight training loops, the profiling workflow and code examples provide a concrete route to measurable gains while preserving TorchScript compatibility.
Primary source
Hugging Face
huggingface.coThe Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.