AI Infrastructure4 min read

Amazon SageMaker Async Inference adds inline request payloads

InvokeEndpointAsync now accepts a Body parameter up to 128,000 bytes, letting callers skip S3 uploads for small async inferences.

The Brieftide

TL;DR

  • 01InvokeEndpointAsync now accepts a Body parameter up to 128,000 bytes, letting callers skip S3 uploads for small async inferences.
  • 02For payloads up to 128,000 bytes, this removes the S3 PUT step and the extra network round-trip that the previous workflow required.
  • 03The InvokeEndpointAsync API now accepts a Body parameter, raw bytes capped at 128,000 bytes, and Body and InputLocation are mutually exclusive.

Amazon SageMaker AI Async Inference now accepts inline request payloads: the InvokeEndpointAsync API supports a new Body parameter that carries raw bytes up to 128,000 bytes so callers can skip uploading input data to Amazon S3 before each invocation. For payloads up to 128,000 bytes, this removes the S3 PUT step and the extra network round-trip that the previous workflow required.

What changed?

The InvokeEndpointAsync API now accepts a Body parameter, raw bytes capped at 128,000 bytes, and Body and InputLocation are mutually exclusive. When you include Body in the request the payload is sent inline and no S3 upload is required; if you set both Body and InputLocation the API rejects the request with a synchronous ValidationError. Output behavior is unchanged: the endpoint still writes results to the configured S3 OutputLocation.

The new parameter is designed to work with existing async endpoints with no model or container changes expected. The feature is available in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, TPE, CPT, MXP, TLV).

How does the new inline payload flow differ from the old one?

Before this launch, every async invocation required two steps: upload the input payload to an Amazon S3 bucket, then call InvokeEndpointAsync passing the S3 object URI as InputLocation. Now you can make a single API call by passing the request bytes in Body, avoiding the S3 PUT and the extra client-side plumbing.

The practical differences in client code are straightforward. The prior approach required an S3 client, an input bucket, IAM s3:PutObject permission, a naming scheme to avoid key collisions, and a cleanup strategy for stale input objects. The inline flow drops all of those requirements: no S3 client, no input bucket, no IAM grants on the input path, and no stale-object cleanup. In both cases the response from InvokeEndpointAsync contains an OutputLocation that you can poll or receive notifications for to obtain the inference result.

When to use each approach is unchanged in principle: use Body for payloads that fit within 128,000 bytes such as JSON prompts and structured data; use InputLocation for payloads larger than 128,000 bytes such as images, audio, or multi-megabyte documents; for mixed workloads branch on size; if you need to retain input data in S3 for audit or replay, continue using InputLocation.

What are the concrete benefits?

The announcement lists five customer-facing benefits: reduced latency by removing one network round-trip and one S3 PUT per request; simpler architecture because callers can avoid provisioning input buckets and related IAM patterns; fewer error paths because the request becomes a single API call that either enqueues or it does not; lower cost by removing the per-request S3 PUT charge; and immediate validation feedback with synchronous errors for size or mutual-exclusivity violations.

The feature preserves backward compatibility. Existing InputLocation workflows continue to work unchanged and both inline and S3 inputs are processed identically once the request is accepted.

Why it matters

For teams that run asynchronous workloads with small payloads and need processing latency measured in seconds or minutes rather than real time, removing the mandatory S3 upload eliminates a recurring friction point: extra latency, extra permissions, and extra operational work to manage input objects. The change simplifies client-side code and reduces per-request cost and failure modes without forcing any changes to endpoints, model containers, or output configuration.

What to watch

Adoption will hinge on how quickly SDKs and existing clients adopt the Body parameter. The provider recommends updating the AWS SDK for Python (Boto3) and testing InvokeEndpointAsync with Body; the walkthrough notes pip install --upgrade boto3 and verifying pip show boto3. Also watch for any edge cases in cross-account or audit workflows where teams still need InputLocation to persist inputs in S3.

Getting started notes from the announcement: ensure you have an existing Amazon SageMaker AI Async Inference endpoint, IAM permission sagemaker:InvokeEndpointAsync, the latest AWS SDK (Boto3) installed, and an S3 output bucket configured for the endpoint. The feature is available today. The announcement also reminds users that SageMaker async inference endpoints incur instance-hour charges and S3 buckets incur storage and request charges and recommends cleanup steps to avoid ongoing charges.

Async inference request flows before and after inline Body
PUT object (previous workflow)Invoke with InputLocation (previous)Invoke with Body (new, inline payload)API enqueues requestWrites inference output to configured S3 OutputLocationClient applicationAmazon S3 input bucketInvokeEndpointAsync APISageMaker AI Async EndpointAmazon S3 output location
Advertisement

Written by The Brieftide · Source: AWS Machine Learning

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement