Stripe agentic compliance on Amazon Bedrock: 26% faster reviews
Stripe built ReAct agents on Amazon Bedrock that cut review handling time by 26% while keeping humans in control and audit trails intact.
TL;DR
- 01Stripe built ReAct agents on Amazon Bedrock that cut review handling time by 26% while keeping humans in control and audit trails intact.
- 02The system supports compliance reviews at scale for a company that processes $1.4 trillion in annual payment volume across 50 countries and serves millions of companies.
- 03Stripe designed a three-part architecture: a review interface and orchestrator, a dedicated agent service, and an LLM Proxy that mediates access to foundation models on Amazon Bedrock.
Stripe built a production-grade AI agent system on AWS using Amazon Bedrock that reduced review handling time by 26 percent while preserving human final decisions and achieving over 96 percent helpfulness ratings. The system supports compliance reviews at scale for a company that processes $1.4 trillion in annual payment volume across 50 countries and serves millions of companies.
How did Stripe architect its agentic system?
Stripe designed a three-part architecture: a review interface and orchestrator, a dedicated agent service, and an LLM Proxy that mediates access to foundation models on Amazon Bedrock. The orchestrator runs the review flow, the agent service hosts ReAct agent logic and stateful multi-turn execution, and the LLM Proxy provides a single API to models plus safeguards such as model fallbacks and monitoring.
Stripe rejected the idea of running agents on a traditional ML inference engine because agentic workloads are mostly network bound, can take indeterminate time to finish, and require flexible schemas and state. As a result, the company created a dedicated agent service that started as a stateless, synchronous endpoint and now supports stateful agents, growing from a few agents at launch to well over 100 agents in less than a year.
How does the ReAct agent framework work in Stripe’s reviews?
Stripe uses a ReAct cycle where agents alternate between Thought, Tool call, and Observation, forcing the agent to process tool outputs as explicit observations before continuing. That injection pattern grounds agent reasoning in data, prevents hallucinations, maintains context coherence, and creates an auditable trace of tool invocation, observation, and reasoning.
To make complex reviews tractable, Stripe decomposes long investigations into composable sub-tasks arranged as a directed acyclic graph. Each sub-task is quality tested and runs only on vetted questions. Agents fetch research and relevant signals through tool calls; their responses are provided as supplementary information to human reviewers, who must ultimately answer each sub-task. This preserves oversight and accountability while delivering efficiency gains.
Prompt caching, provided by Amazon Bedrock, reduced the input-token cost by paying only for new observations and thoughts appended at each turn. The decomposition of tasks also limits prompt length and prevents running excessive turns on a single prompt.
What infrastructure decisions helped manage scale and reliability?
Stripe inserted an LLM Proxy microservice between agents and Amazon Bedrock to prevent noisy-neighbor effects, enforce authentication, monitor usage, and enable model fallbacks. The proxy gives teams a single endpoint that can switch model types by argument and apply capabilities like prompt caching and tool calling uniformly.
Human reviewers drive the final decision. The system treats agent outputs as pre-fetched research and pipes human-reviewed answers as context for deeper questions via the orchestrator. That design preserves an immutable audit trail and supports configurable approval workflows and multi-layered checkpoints.
Why it matters
This approach shows how agentic AI can scale judgment-heavy compliance work without removing human accountability. By cutting review handling time by 26 percent and achieving over 96 percent helpfulness ratings, Stripe demonstrates a path to reduce repetitive analyst work—where analysts previously spent up to 80 percent of their time collecting fragmented documentation—while keeping regulators’ needs for auditability and traceability.
The system also targets broader compliance burdens: Stripe links its method to addressing a $206 billion global compliance burden, and to operational outcomes such as identifying 95 percent of card-testing attacks in real time and reducing unnecessary customer friction by 20 percent.
What to watch
Look for adoption signals such as whether other large payments platforms adopt dedicated agent services and LLM proxy layers, and for metrics showing agent counts beyond Stripe’s “well over 100 agents” figure. Also watch confirmations that prompt caching and sub-task decomposition remain the main levers for cost and token control.
Written by The Brieftide · Source: AWS Machine Learning
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Coding AgentsAutoformalization: Agent Instructions to Policy-as-Code
A pipeline that uses an LLM generator-critic loop to turn prompts and policy text into Cedar policies, submitted 25 Jun 2026.
Agentic Analysis: LLM Pipeline compares ERC-8004 and Google A2A
An LLM-powered pipeline analyzes 4,323 governance participation records across ERC-8004 (permissionless.
Data2Story: CSV-to-article pipeline with seven AI agents
A Claude Code skill runs seven specialist agents to turn a CSV into a verifiable, interactive news article with an Inspector panel.
Vibe Coding: AI evaluation for greenfield software engineering
Callum Barbour's arXiv paper tests 'vibe coding' on isolated Python greenfield tasks using a custom evaluation suite.