Amazon SageMaker Async Inference adds inline request payloads
InvokeEndpointAsync now accepts a Body parameter up to 128,000 bytes, letting callers skip S3 uploads for small async inferences.
TL;DR
- 01InvokeEndpointAsync now accepts a Body parameter up to 128,000 bytes, letting callers skip S3 uploads for small async inferences.
- 02For payloads up to 128,000 bytes, this removes the S3 PUT step and the extra network round-trip that the previous workflow required.
- 03The InvokeEndpointAsync API now accepts a Body parameter, raw bytes capped at 128,000 bytes, and Body and InputLocation are mutually exclusive.
Amazon SageMaker AI Async Inference now accepts inline request payloads: the InvokeEndpointAsync API supports a new Body parameter that carries raw bytes up to 128,000 bytes so callers can skip uploading input data to Amazon S3 before each invocation. For payloads up to 128,000 bytes, this removes the S3 PUT step and the extra network round-trip that the previous workflow required.
What changed?
The InvokeEndpointAsync API now accepts a Body parameter, raw bytes capped at 128,000 bytes, and Body and InputLocation are mutually exclusive. When you include Body in the request the payload is sent inline and no S3 upload is required; if you set both Body and InputLocation the API rejects the request with a synchronous ValidationError. Output behavior is unchanged: the endpoint still writes results to the configured S3 OutputLocation.
The new parameter is designed to work with existing async endpoints with no model or container changes expected. The feature is available in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, TPE, CPT, MXP, TLV).
How does the new inline payload flow differ from the old one?
Before this launch, every async invocation required two steps: upload the input payload to an Amazon S3 bucket, then call InvokeEndpointAsync passing the S3 object URI as InputLocation. Now you can make a single API call by passing the request bytes in Body, avoiding the S3 PUT and the extra client-side plumbing.
The practical differences in client code are straightforward. The prior approach required an S3 client, an input bucket, IAM s3:PutObject permission, a naming scheme to avoid key collisions, and a cleanup strategy for stale input objects. The inline flow drops all of those requirements: no S3 client, no input bucket, no IAM grants on the input path, and no stale-object cleanup. In both cases the response from InvokeEndpointAsync contains an OutputLocation that you can poll or receive notifications for to obtain the inference result.
When to use each approach is unchanged in principle: use Body for payloads that fit within 128,000 bytes such as JSON prompts and structured data; use InputLocation for payloads larger than 128,000 bytes such as images, audio, or multi-megabyte documents; for mixed workloads branch on size; if you need to retain input data in S3 for audit or replay, continue using InputLocation.
What are the concrete benefits?
The announcement lists five customer-facing benefits: reduced latency by removing one network round-trip and one S3 PUT per request; simpler architecture because callers can avoid provisioning input buckets and related IAM patterns; fewer error paths because the request becomes a single API call that either enqueues or it does not; lower cost by removing the per-request S3 PUT charge; and immediate validation feedback with synchronous errors for size or mutual-exclusivity violations.
The feature preserves backward compatibility. Existing InputLocation workflows continue to work unchanged and both inline and S3 inputs are processed identically once the request is accepted.
Why it matters
For teams that run asynchronous workloads with small payloads and need processing latency measured in seconds or minutes rather than real time, removing the mandatory S3 upload eliminates a recurring friction point: extra latency, extra permissions, and extra operational work to manage input objects. The change simplifies client-side code and reduces per-request cost and failure modes without forcing any changes to endpoints, model containers, or output configuration.
What to watch
Adoption will hinge on how quickly SDKs and existing clients adopt the Body parameter. The provider recommends updating the AWS SDK for Python (Boto3) and testing InvokeEndpointAsync with Body; the walkthrough notes pip install --upgrade boto3 and verifying pip show boto3. Also watch for any edge cases in cross-account or audit workflows where teams still need InputLocation to persist inputs in S3.
Getting started notes from the announcement: ensure you have an existing Amazon SageMaker AI Async Inference endpoint, IAM permission sagemaker:InvokeEndpointAsync, the latest AWS SDK (Boto3) installed, and an S3 output bucket configured for the endpoint. The feature is available today. The announcement also reminds users that SageMaker async inference endpoints incur instance-hour charges and S3 buckets incur storage and request charges and recommends cleanup steps to avoid ongoing charges.
Written by The Brieftide · Source: AWS Machine Learning
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureNVIDIA ENPIRE: AI coding agents teach robots GPU installs
ENPIRE let AI coding agents train robot arms to cut zip ties and insert GPUs.
Hyperscalers AI spending to outpace cash flow by Q3 2026
Epoch AI data shows infrastructure spending growing ~70% annually versus operating cash flow at ~23%, with a crossover around Q3 2026.
Germany approves DE-AISI, an AI security institute based on UK
The National Security Council authorised a German AI Security Institute to test advanced models.
China's 2 trillion yuan AI buildout needs 80% domestic chips
Beijing plans roughly 2 trillion yuan over five years to knit data centers into a national network and require at least 80 percent domestic.