Benchmarks & EvalsJune 26, 20264 min read

Selective Parametric Consolidation: EVAF, loop-drift benchmarks

Haoliang Han's paper (25 Jun 2026) introduces EVAF, a surprise- and valence-gated LoRA consolidation that yields 2–3 writes per 200 events.

The BrieftideJune 26, 2026

TL;DR

01Haoliang Han's paper (25 Jun 2026) introduces EVAF, a surprise- and valence-gated LoRA consolidation that yields 2–3 writes per 200 events.
02The paper introduces the loop-drift protocol and evaluates EVAF, a surprise- and valence-gated LoRA consolidation mechanism, across GPT-2, TinyLlama and Mistral-7B.
03The protocol isolates whether experiences continue to shape behavior after the working context is removed.

Haoliang Han submitted "Memory Depth, Not Memory Access" on 25 Jun 2026, framing a new distinction for long-running language agents: durable, goal-conditioned tendencies stored parametrically rather than mere retrieval. The paper introduces the loop-drift protocol and evaluates EVAF, a surprise- and valence-gated LoRA consolidation mechanism, across GPT-2, TinyLlama and Mistral-7B.

What is memory depth and how was it tested?

Memory depth is durable, goal-conditioned behavior encoded into a small parametric store, tested by a controlled stress test called the loop-drift protocol in which the retrieval index remains intact while the working context is unloaded. The loop-drift protocol forces agents to persist goal-conditioned behavior under long-loop interference even when retrieval remains available; it separates durable behavioral change (depth) from on-demand factual recall (access).

The protocol isolates whether experiences continue to shape behavior after the working context is removed. Public Memora event streams served as an external diagnostic within the probe and exposed stale-memory invalidation as an unresolved boundary.

How does EVAF perform compared with retrieval access?

EVAF produced stronger goal persistence and post-unload recovery, with measured scores between 0.812 and 0.904, while retrieval proved strongest on shallow factual recall, with short-fact accuracy between 0.956 and 0.973. EVAF achieved those persistence gains with only 2 to 3 parametric writes per 200 events.

The paper reports that selective consolidation factorizes into two controllable dimensions: selection and actuation. Mechanism controls show matched random gates can isolate selection beyond sparse writing. Fixed-inner controls across GPT-2, TinyLlama, and Mistral-7B indicate that inner-loop write strength depends on the base model. A Mistral-7B matched-gate inversion revealed asymmetric coupling between selection and actuation when actuation is miscalibrated.

Why it matters

Memory depth reframes the memory problem for long-running agents: retrieval keeps facts available but does not decide which experiences should durably change behavior after context unload. The paper demonstrates a concrete, low-bandwidth consolidation method (EVAF) that can write only a few times per hundreds of events yet materially improve goal persistence. That suggests systems that combine retrieval for factual recall and selective parametric consolidation for durable tendencies could behave more consistently over long tasks.

The diagnostic using Public Memora event streams also flags a practical limitation: stale-memory invalidation remains unresolved, meaning parametric consolidation can introduce its own maintenance costs and failure modes even as it supplies complementary depth.

What to watch

Follow work that tests EVAF-style consolidation on broader event streams and task families, and any methods that address stale-memory invalidation on Public Memora-style traces. Also watch for further experiments that quantify inner-loop write strength across more models beyond GPT-2, TinyLlama and Mistral-7B to see whether the model-dependence observed here generalizes.

Summary of key source facts: the paper was submitted 25 Jun 2026; EVAF is described as a surprise- and valence-gated LoRA consolidation mechanism; retrieval short-fact accuracy reported 0.956--0.973; EVAF goal persistence and post-unload recovery reported 0.812--0.904; EVAF used 2--3 parametric writes per 200 events; experiments included GPT-2, TinyLlama and Mistral-7B; Public Memora event streams served as an external diagnostic exposing stale-memory invalidation.

Selected loop-drift probe results (as reported)

Item
Short-fact accuracy	0.956--0.973	n/a
Goal persistence / post-unload recovery	weaker (not quantified)	0.812--0.904
Parametric writes per 200 events	n/a	2--3
Models evaluated	n/a	GPT-2, TinyLlama, Mistral-7B

Written by The Brieftide · Source: arXiv

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

CORE-Bench: Life After Benchmark Saturation, v1.1 Findings

arXiv paper shows CORE-Bench v1.1 and CORE-Bench OOD expose construct validity, efficiency, reliability and a twofold human-agent speedup.

The BrieftideDAILY BRIEF

T2D-Bench: Benchmarking LLMs for Type 2 Diabetes Evidence

A multi-layer clinical-lifestyle knowledge graph flags unsupported LLM diabetes recommendations and corrects them across 100 vignettes.

The BrieftideDAILY BRIEF

BIM-Edit: Benchmarking LLMs for IFC-based BIM Editing

BIM-Edit evaluates LLMs on 324 IFC editing tasks across 11 real models and 36 synthetic scenes; the top model averages 49.5%.

The BrieftideDAILY BRIEF

ORAgentBench benchmark: LLM agents on 107 OR tasks pass rates

ORAgentBench packages 107 execution-grounded operations research tasks; best agent passed 35.51% overall and 20.59% of hard tasks.