SageMaker detailed metrics and Insights on CloudWatch
Turn on over 100 detailed observability metrics for SageMaker endpoints and view them in a built-in CloudWatch Insights dashboard.
TL;DR
- 01Turn on over 100 detailed observability metrics for SageMaker endpoints and view them in a built-in CloudWatch Insights dashboard.
- 02New endpoint configurations enable detailed observability by default, and metrics begin flowing within 2 minutes after the endpoint reaches InService.
- 03SageMaker emits native OpenTelemetry metrics to CloudWatch and the SageMaker Insights dashboard surfaces Performance, Capacity, and Reliability views for endpoints and inference components.
Amazon SageMaker now emits over 100 detailed inference metrics and visualizes them in a built-in SageMaker Insights dashboard inside Amazon CloudWatch, giving teams deeper signals for generative AI endpoints. New endpoint configurations enable detailed observability by default, and metrics begin flowing within 2 minutes after the endpoint reaches InService.
What is included in SageMaker detailed metrics and Insights?
SageMaker emits native OpenTelemetry metrics to CloudWatch and the SageMaker Insights dashboard surfaces Performance, Capacity, and Reliability views for endpoints and inference components. The Performance tab plots token-level measures such as Time to First Token (TTFT) and Inter-Token Latency (ITL), plus model and overhead latency breakdowns, engine and request pressure, and token throughput. The Capacity tab shows GPU, CPU, and memory utilization, and the Reliability tab shows Availability Zone distribution, scaling events, cold start anatomy, and insufficient capacity errors.
The dashboard supports both single-model endpoints (SME) and multi-model Inference Component (IC) endpoints, and it automatically shows IC-specific panels when inference components are detected.
How do you enable detailed observability for endpoints?
For new endpoint configurations, detailed observability is turned on by default: the EnableDetailedObservability flag defaults to true and MetricsPublishFrequencyInSeconds defaults to 60 seconds (you can set it to less than a minute for faster publishing). For existing endpoints you must opt in by creating a new endpoint configuration with MetricsConfig enabling detailed observability, then update the endpoint. After an endpoint reaches InService, OpenTelemetry-format metrics begin flowing to CloudWatch within 2 minutes.
The SageMaker console offers a guided three-step wizard when you choose Enable detailed observability: learn about the metrics, turn on OpenTelemetry enrichment, and select which endpoints to opt in. Note that classic CloudWatch metrics (Invocations, ModelLatency, OverheadLatency) require OTel enrichment to appear in the SageMaker Insights dashboard and be queryable with PromQL. Enable OTel metric enrichment and Resource tags for telemetry from the CloudWatch Console Settings as a one-time, account- and Region-level change.
Prerequisites called out in the documentation include an AWS account with at least one SageMaker real-time inference endpoint, IAM permissions sagemaker:CreateEndpointConfig, sagemaker:UpdateEndpoint, and cloudwatch:GetMetricData, and a vLLM or SGLang container framework to emit token-level metrics like TTFT and ITL. GPU instances additionally receive per-accelerator utilization metrics.
How does the Insights dashboard help debug latency and capacity issues?
The dashboard front-loads the signals you need to triage spikes: token streaming panels plot TTFT and ITL with P50/P99 toggles, a Latency breakdown panel separates Model Latency from Overhead Latency, and the Engine and request pressure panel shows KV cache and engine queue metrics. The Traffic distribution panel lets you filter by Availability Zone to spot routing or placement issues, and the Token throughput panel reports tokens per second broken down by input/output and inference engine (SGLang, vLLM, DJL). The documentation gives a concrete example: if an ml.g6.4xlarge shows 150 tokens per second output while a model benchmark is 500, that indicates a resource constraint, configuration issue, or KV cache pressure.
You can drill from fleet-level views down to instance, inference component copy, or endpoint detail. Color-coded hexagons and a side-by-side instance performance table surface outliers for TTFT, output TPS, concurrent requests, and KV cache utilization.
Why it matters
Operational teams deploy dozens of models across hundreds of GPU instances, and aggregate metrics are no longer sufficient to pinpoint root causes quickly. Token-level latency, KV cache pressure, per-accelerator GPU utilization, and AZ traffic distribution are the exact signals SREs and MLOps engineers need to decide whether a P99 latency spike stems from model queuing, cache exhaustion, a placement imbalance, or an autoscaling delay. Built-in PromQL support in CloudWatch also reduces the need for custom Grafana or Prometheus setups.
What to watch
Watch whether teams adopt inference component (IC) endpoints for multi-model hosting and per-component scaling, and whether they lower MetricsPublishFrequencyInSeconds below the 60-second default for near real-time alerting. Also track adoption of the CloudWatch PromQL endpoint to integrate SageMaker metrics with Grafana or Datadog.
Written by The Brieftide · Source: AWS Machine Learning
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in AI InfrastructureIEEE launches virtual training course on large language models
IEEE is offering a virtual training course that teaches engineers to use large language models as reasoning engines in development.
AI4SE and SE4AI: A decade review of AI in systems engineering
H. Sinan Bank, Daniel R. Herber and Thomas Bradley map three research phases and assess 1.
Amazon's AWS may sell Trainium chips to challenge Nvidia
AWS executives say selling Trainium to third parties is possible, with Andy Jassy estimating a potential ~$50 billion annual run rate.
Hyperscalers AI spending to outpace cash flow by Q3 2026
Epoch AI data shows infrastructure spending growing ~70% annually versus operating cash flow at ~23%, with a crossover around Q3 2026.