Open Source AIJuly 3, 20265 min read

Hawk: NPU kernel generation raises accuracy to 80.0%

Hawk is a training-free framework that lifts NPU kernel generation accuracy from 49.4% to 80.0% and delivers up to a 2.2x runtime speedup.

The BrieftideJuly 3, 2026

TL;DR

01Hawk is a training-free framework that lifts NPU kernel generation accuracy from 49.4% to 80.0% and delivers up to a 2.2x runtime speedup.
02Hawk, a training-free framework for Neural Processing Unit kernel generation from Junyi Wen and nine co-authors, was submitted to arXiv on 2 Jul 2026.
03Evaluated on real-world NPU workloads, Hawk raises generation accuracy from 49.4% to 80.0% and achieves up to a 2.2x execution speedup over state-of-the-art baselines.

Hawk, a training-free framework for Neural Processing Unit kernel generation from Junyi Wen and nine co-authors, was submitted to arXiv on 2 Jul 2026. Evaluated on real-world NPU workloads, Hawk raises generation accuracy from 49.4% to 80.0% and achieves up to a 2.2x execution speedup over state-of-the-art baselines.

What is Hawk and how does it work?

Hawk is a training-free system that injects hardware-aware knowledge into kernel generation through three core modules: a Run-Time Knowledge Synthesis Module, a Bottleneck-Aware Knowledge Retrieval Module, and an Effect-Driven Knowledge Distillation Module. The first couples error context with executable semantics via a Triple-Part Executable Knowledge Representation, the second projects queries into orthogonal syntactic and hardware-aligned semantic spaces using a 2D-Retrieval paradigm, and the third uses LLM-driven semantic arbitration to prune errors and consolidate redundancies based on empirical execution feedback.

The paper frames these components as an explicit way to supply the hardware-specific priors that large language models lack. The authors argue naive code transplantation from similar kernels often passes compilation but causes runtime crashes and degraded performance because it violates implicit hardware constraints and strict memory hierarchies. Hawk embeds executable knowledge and execution feedback to avoid those failure modes.

How much better is Hawk than prior methods?

In the authors' evaluations on real-world NPU workloads, Hawk elevates generation accuracy from 49.4% to 80.0% and achieves up to a 2.2x execution speedup compared with state-of-the-art baselines. Those two figures are the paper's headline results and represent both correctness of generated kernels and runtime performance gains.

The accuracy jump quantifies fewer faulty or nonfunctional kernels after generation, while the 2.2x figure describes execution speed relative to existing baseline kernels. The paper emphasizes these gains arise from explicitly encoding hardware constraints and using empirical execution feedback to iteratively distill correct patterns rather than from additional model training.

Why does Hawk target NPUs and why does that matter?

Large language models can automate code generation, but they lack hardware-specific priors that matter for specialized accelerators. The paper states that LLMs "fail catastrophically on NPUs" when they ignore implicit hardware constraints. That failure can mean runtime crashes or severe performance regression despite producing code that compiles. Hawk aims to close that gap by making hardware constraints part of the retrieval and synthesis process.

For teams building kernels for NPUs, that matters because it reduces the manual engineering needed to respect strict memory hierarchies and obscure device constraints. Faster, more accurate kernel generation shortens development cycles and can improve deployed performance by avoiding runtime faults and inefficient implementations.

What did the authors evaluate and where can readers look?

The paper presents extensive evaluations on real-world NPU workloads to back the accuracy and speedup claims. The arXiv entry lists code, data and media toggles and links such as Hugging Face and DagsHub under the paper's Code, Data and Media Associated with this Article section, indicating where supplementary materials may be found. The paper is available via DOI https://doi.org/10.48550/arXiv.2607.01590.

What to watch next

Look for independent reproductions of the 49.4% to 80.0% accuracy improvement and the up-to-2.2x execution speedup on additional NPU models and workloads, and for any public releases of the code and datasets linked from the paper's Code, Data and Media section. Those steps will confirm whether Hawk's hardware-aware knowledge approach generalizes beyond the authors' test set.

Hawk versus Baseline on key metrics

Item
Generation accuracy	49.4%	80.0%
Execution speed (relative)	1x	2.2x

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

OpenAI joins Appia Foundation to build shared AI standards

OpenAI supports evaluation frameworks, safety practices and global cooperation through the Appia Foundation.

The BrieftideDAILY BRIEF

Zhipu AI GLM-5.2: 1M-token context, closes gap with Opus 4.8

GLM-5.2 ships under the MIT license with a stable one-million-token context and scores 74.4% on FrontierSWE, one point behind Opus 4.8.

The BrieftideDAILY BRIEF

OpenAI: PRC-linked influence operations target US AI debates

OpenAI says PRC-linked campaigns are using AI to push narratives on U.S. tech debates, data centers, tariffs and false ChatGPT claims.

The BrieftideDAILY BRIEF

OpenAI: LSEG scales trusted AI, empowers 4,000 staff

LSEG uses OpenAI to scale trusted AI across its global business, accelerating insights, shrinking release cycles and empowering 4.