ITNet: Integral transform that subsumes convolution, attention
ITNet, submitted to arXiv on 17 Jun 2026, presents a learnable kernel (an MLP) that can reproduce convolution.
TL;DR
- 01ITNet, submitted to arXiv on 17 Jun 2026, presents a learnable kernel (an MLP) that can reproduce convolution.
- 02ITNet, introduced in a paper submitted to arXiv on 17 Jun 2026, centers on a single learnable kernel implemented as a small neural network (an MLP) that depends jointly on positions and features.
- 03ITNet is a unified architecture built around a learnable kernel that models pairwise interactions; that kernel is implemented as an MLP and depends on both positions and features.
ITNet, introduced in a paper submitted to arXiv on 17 Jun 2026, centers on a single learnable kernel implemented as a small neural network (an MLP) that depends jointly on positions and features. The authors present ITNet as a unified operator that they say is a "learnable integral transform" and a universal approximator of continuous operators, and they report that a single ITNet matches or exceeds specialized baselines on ImageNet-1K, GLUE, ModelNet40, VQA v2 and NLVR2.
What is ITNet and how does it work?
ITNet is a unified architecture built around a learnable kernel that models pairwise interactions; that kernel is implemented as an MLP and depends on both positions and features. The paper describes practical implementation techniques—tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization—to make the integral transform computationally efficient and scalable.
The core idea replaces separate inductive biases with one parameterized operator. The kernel models pairwise interactions so the model can adapt behavior from data rather than hard-wiring locality, sequential memory, or content-dependent pairwise interaction. To scale ITNet the authors introduce tiled kernel fusion to combine computations across tiles, importance-weighted Monte Carlo integration to estimate the integral efficiently, and learned low-rank factorization to reduce the parameter and compute footprint.
How does ITNet subsume convolution, attention and recurrence?
Convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases of the ITNet operator under appropriate parameterizations, the paper states. The authors claim that by choosing kernel parameter settings and factorization strategies, ITNet can recover the mathematical forms of those architectures, meaning one learned interaction mechanism can reproduce the behaviors of the three architectural families from data.
The paper also frames ITNet as a universal approximator of continuous operators, positioning the approach as a mathematically general class that contains those existing mechanisms. The authors trained a single ITNet architecture with a shared operator and lightweight modality-specific encoders, and they report it matches or exceeds specialized baselines across multiple benchmarks, specifically ImageNet-1K, GLUE, ModelNet40, VQA v2 and NLVR2.
Why it matters
Unifying convolution, attention and recurrence into a single learnable operator reduces architectural fragmentation: instead of designing separate blocks for locality, sequence memory, or pairwise content-dependent interactions, one mechanism can adapt to each role from data. That matters for model design and for research into what inductive biases are necessary versus what can be learned. The paper includes concrete engineering steps for efficiency—tiled kernel fusion, importance-weighted Monte Carlo integration, learned low-rank factorization—which address the usual scalability objections to integral-operator approaches.
What to watch
Watch for the paper's linked code and demos: the arXiv entry lists toggles for Links to Code and Demos including Hugging Face and Replicate, which would make replication and performance checks possible. Also note the arXiv page records an issued DOI via DataCite pending registration; the DOI registration and public code will be concrete signals that others can benchmark ITNet against the existing convolution-, attention- and recurrence-based models.
Written by The Brieftide · Source: arXiv
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AIThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.
Hugging Face Spaces agents.md: chain image to 3D splats
An agent used two Hugging Face Spaces and their agents.md files to auto-generate images, reconstruct 3D Gaussian splats.