LoRA vs other PEFT methods: Hugging Face benchmark results
Hugging Face ran unified PEFT benchmarks and found LoRA competitive but sometimes outperformed on memory or accuracy by alternatives.
TL;DR
- 01Hugging Face ran unified PEFT benchmarks and found LoRA competitive but sometimes outperformed on memory or accuracy by alternatives.
- 02Hugging Face ran unified PEFT benchmarks on June 18, 2026, testing parameter-efficient fine-tuning techniques under identical conditions across an LLM math task and an image-generation task.
- 03The results show LoRA remains competitive but other PEFT methods sometimes beat LoRA on memory use or test accuracy, depending on the task and configuration.
Hugging Face ran unified PEFT benchmarks on June 18, 2026, testing parameter-efficient fine-tuning techniques under identical conditions across an LLM math task and an image-generation task. The results show LoRA remains competitive but other PEFT methods sometimes beat LoRA on memory use or test accuracy, depending on the task and configuration.
What did Hugging Face test and how?
Hugging Face evaluated many PEFT techniques using the PEFT library, running a math benchmark (MetaMathQA on GSM8K) and an image-generation benchmark (learning a "cat plushy") with the same base models, datasets, training and evaluation code, and hardware for each technique. The PEFT library implements more than 40 distinct PEFT techniques and integrates with Transformers and Diffusers, and the benchmarks track test performance plus VRAM usage, forgetting/drift, runtime, and checkpoint size while aiming to run on consumer hardware.
The benchmarks are designed so users can add experiments by providing a PEFT config and running a script, and Hugging Face exposes results in a Space for up-to-date comparisons.
How did LoRA compare to other PEFT techniques?
LoRA dominates usage statistics but is not universally best in the benchmarks: of 20,834 Hugging Face model cards that mention exactly one PEFT technique, 20,509 mention LoRA (98.4%); a separate sample of 10,000 image-generation checkpoints found 7,111 LoRAs (95.0%); and a GitHub code-snippet search found 71.3% of peft import results refer to LoRA. In the math benchmark, LoRA with rank stabilized initialization achieved 53.2% test accuracy and required 22.6 GB peak VRAM, Lily reached 54.9% at 25.6 GB, and BEFT scored 32.9% using 20.2 GB. Vanilla LoRA achieved 48.1% at 22.5 GB in the same setup.
Those numbers illustrate tradeoffs along a Pareto frontier: LoRA (stabilized) sits on the frontier for test accuracy versus memory, but other techniques occupy different points. On the image generation task, LoRA produced a dino similarity of 0.697 at 9.97 GB peak VRAM while OFT achieved 0.708 with 9.01 GB, meaning OFT strictly dominated LoRA on those two metrics for that task. The benchmarks also compare runtime, forgetting, and checkpoint size, and the leaderboard can change when you switch which metric you prioritize.
Why it matters
The popularity of LoRA is partly usage-driven: high visibility, tutorials, and tool support make it a default choice. The benchmarks show that defaulting to LoRA can leave tradeoffs on the table because alternatives can be strictly better on memory or slightly better on accuracy depending on your goal. Teams constrained by consumer-grade GPU memory may prefer memory-efficient methods like BEFT or LoRA-FA, while those prioritizing peak accuracy might accept higher VRAM for methods such as Lily.
Benchmarks conducted under identical conditions reduce comparison noise that plagues paper-to-paper claims, so these results give practitioners a practical basis to choose a PEFT method for a specific dataset and hardware profile rather than relying on single-paper claims.
What to watch
Check the Hugging Face Space for updated benchmark results and community-contributed experiments, since the PEFT library makes it straightforward to add new configs and runs. The next meaningful signals will be community submissions that shift the Pareto frontier for specific tasks or that show a technique consistently improving across multiple benchmarks under fair hyper-parameter tuning.
| Item | |||
|---|---|---|---|
| LoRA (rank stabilized initialization) | 53.2 | 22.6 | |
| Normal LoRA | 48.1 | 22.5 | |
| BEFT | 32.9 | 20.2 | |
| Lily | 54.9 | 25.6 |
Written by The Brieftide · Source: Hugging Face
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsBIM-Edit: Benchmarking LLMs for IFC-based BIM Editing
BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.
QMFOL benchmark: QMFOLBench with 2880 logic instances
QMFOL generates monadic first-order logic problems and ships QMFOLBench with 2880 instances to measure LLM deductive reasoning across.
DeFAb: Defeasible Abduction Benchmark, 372,648+ instances
DeFAb converts four decades of publicly funded knowledge bases into 372.
LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.