CUGA by IBM: 24 single-file agent apps on a lightweight harness
Open-source CUGA handles planning, execution, state and guardrails so you only write a tool list and a prompt.
TL;DR
- 01Open-source CUGA handles planning, execution, state and guardrails so you only write a tool list and a prompt.
- 02The project is installable via pip (pip install cuga) and the hosted gallery includes live demos developers can inspect and clone.
- 03The harness carries planning and reflection responsibilities so a smaller open-weight model can function where it normally would not; hosted examples run on gpt-oss-120b rather than a frontier API.
IBM's open-source CUGA agent harness, introduced June 23, 2026, ships with two dozen single-file example apps and a lightweight FastAPI harness so developers can focus on tools and prompts rather than orchestration. The project is installable via pip (pip install cuga) and the hosted gallery includes live demos developers can inspect and clone.
What does CUGA provide and how is it configured?
CUGA supplies the orchestration every agentic app otherwise rebuilds: planning, an execution loop, tool-call adapters, long-horizon variable tracking, reflection and self-correction, and state plumbing. You configure a CugaAgent with four arguments (model, tools, special_instructions, cuga_folder) and pick reasoning modes — Fast, Balanced, and Accurate — plus a code-execution sandbox (local, Docker/Podman, or cloud) to trade latency for accuracy.
The harness carries planning and reflection responsibilities so a smaller open-weight model can function where it normally would not; hosted examples run on gpt-oss-120b rather than a frontier API. CUGA has topped agent benchmarks like AppWorld (#1 from 07/25 - 02/26) and WebArena (#1 from 02/25 - 09/25), crediting the harness-level machinery rather than per-app tuning.
How do the example apps work and what’s included?
The repository ships cuga-apps: two dozen small, working apps, each a single FastAPI file that wraps one CugaAgent, so you can read every line if you know FastAPI. Each app defines a tool list and a prompt; the harness handles invoke(...) and all below that line.
A representative app, the IBM Cloud advisor, shows the pattern: a make_agent factory builds CugaAgent(model=create_llm(...), tools=_make_tools(), special_instructions=_SYSTEM, cuga_folder=str(_DIR / ".cuga")). The create_llm factory reads environment variables (LLM_PROVIDER, LLM_MODEL) so the app code does not hardcode which model is used. Tools mix inline functions (for app-specific APIs) with shared MCP tools; the project exposes 7 public MCP servers hosting 36 tools on IBM Code Engine that apps can borrow without hosting them yourself.
The repository groups apps by family (research, productivity, doc/media RAG, ops, enterprise examples) and tags readiness (ship-ready, for-later, exploratory). The live gallery and an MCP Tool Explorer let you try web search, Wikipedia/arXiv lookups, geocoding, weather, and more before cloning.
How does CUGA keep agents within boundaries?
CUGA embeds governance into the runtime with a policy system you attach to the same agent object. The harness offers six policy types, including Intent Guard, Tool Approval, Tool Guide, Playbook, Output Formatter, and CustomPolicy. Intent Guards can refuse requests outright; Tool Approval can pause for a human before a risky tool runs. An example Intent Guard shown in the code blocks a destructive git operation by keyword and returns "Blocked: destructive git flags are not permitted."
Timing matters: an Intent Guard runs before tool selection, Tool Approval runs after generated code inspects requested tools, and Output Formatter runs on the final message. Policies match semantically using a sqlite-vec store so they trigger on meaning, not just exact keywords.
Why it matters
CUGA shifts time spent from plumbing to product: teams no longer rebuild planning, state tracking, tool adapters, streaming state to UI, or reflection steps for each new agent. That lets smaller or open models like gpt-oss-120b power production agents because the harness shoulder much of the cognitive load. For organizations, that reduces engineering repeat work and centralizes governance where policies can be applied consistently at runtime.
What to watch
See whether teams move cuga-apps from the gallery into governed production, especially the examples that run “sovereign and governed in production without a rewrite.” Track adoption of the runtime policy hooks (Intent Guard and Tool Approval) and whether the public MCP servers (7 servers, 36 tools) become a shared dependency in enterprise deployments.
Written by The Brieftide · Source: Hugging Face
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsData2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.
Vibe Coding: AI evaluation for greenfield software engineering
Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.
CODA-BENCH benchmark: testing code agents on data tasks
CODA-BENCH places agents in a Kaggle-based Linux sandbox with 1,009 tasks across 31 communities and an average of 980 files per task.
Deep Agents + Bedrock AgentCore: context-rich research agents
LangChain Deep Agents delegates deep work to isolated subagents running in Amazon Bedrock AgentCore MicroVMs, combining browsers.