Gemma 4 on Amazon Bedrock: 31B, 26B-A4B and E2B models
Amazon Bedrock adds three Apache 2.0 Gemma 4 instruction-tuned variants with multimodal input, native function calling and long context.
TL;DR
- 01Amazon Bedrock adds three Apache 2.0 Gemma 4 instruction-tuned variants with multimodal input, native function calling and long context.
- 02Amazon Bedrock now offers the Gemma 4 family from Google DeepMind, released under the Apache 2.0 license.
- 03The instruction-tuned family includes three variants: Gemma 4 31B, Gemma 4 26B-A4B and Gemma 4 E2B, and Bedrock exposes them through the bedrock-mantle endpoint.
Amazon Bedrock now offers the Gemma 4 family from Google DeepMind, released under the Apache 2.0 license. The instruction-tuned family includes three variants: Gemma 4 31B, Gemma 4 26B-A4B and Gemma 4 E2B, and Bedrock exposes them through the bedrock-mantle endpoint.
What Amazon Bedrock is delivering
The Gemma 4 family on Amazon Bedrock is presented as open-weight models designed for "intelligence-per-parameter" across deployment scenarios. Bedrock provides a fully managed service so customers can run inference on AWS-operated infrastructure; prompts and completions are not used to train models and content is not shared with third parties. Independent benchmarking cited in the announcement shows an Intelligence Index of 39 for Gemma 4 31B from Artificial Analysis, compared with a median of 15 in the 4B–40B open-weights class.
Bedrock exposes the models via the bedrock-mantle endpoint (https://bedrock-mantle.{region}.api.aws/openai/v1), which supports the Chat Completions and Responses APIs and uses the same interface as the OpenAI Python and TypeScript SDKs. Bedrock also supports API keys and short-term bearer tokens; short-term API keys expire automatically (maximum 12 hours), and the aws-bedrock-token-generator package can create bearer tokens from native AWS credentials.
Technical specifics and the three variants
All three Gemma 4 variants support text and image input, built-in reasoning mode, and native function calling. They share a hybrid attention design that interleaves local and global attention to enable long contexts while keeping memory small. Key specifications called out in the announcement:
- Gemma 4 31B: model ID google.gemma-4-31b, dense architecture, 30.7B total parameters, 256K token context window.
- Gemma 4 26B-A4B: model ID google.gemma-4-26b-a4b, mixture-of-experts architecture, 25.2B total parameters with 3.8B active per token, 256K token context window; the MoE design yields roughly 4B-class cost and latency with larger-model knowledge capacity.
- Gemma 4 E2B: model ID google.gemma-4-e2b, dense (PLE) architecture, 5.1B total with 2.3B effective parameters, 128K token context window; recommended for latency-sensitive and low-cost multimodal workloads (set reasoning_effort=high for this variant).
The announcement recommends choosing 31B for reasoning- or coding-heavy single-dense-model use cases, 26B-A4B when cost-sensitive high-throughput workloads need knowledge breadth, and E2B for the fastest, lowest-cost multimodal classification tasks.
Bedrock accepts images either as inline base64-encoded data URLs or as s3:// URLs; arbitrary public https:// image URLs are not supported. The same Chat Completions API works for vision tasks by including image data alongside text messages.
Access control guidance in the post recommends attaching the AWS managed policy AmazonBedrockMantleInferenceAccess to grant read and inference-creation access on Mantle. For broader management needs, use AmazonBedrockMantleFullAccess. The announcement names two specific IAM actions: bedrock-mantle:CreateInference and bedrock-mantle:CallWithBearerToken.
Why it matters
Making Gemma 4 available as open-weight models on a fully managed Bedrock endpoint reduces operational friction for teams that want to evaluate, fine-tune or run sensitive workloads on open models while keeping inference inside AWS infrastructure. The combination of long context windows (up to 256K), MoE cost characteristics on the 26B-A4B variant, and multimodal input creates concrete choices for teams balancing latency, cost and capability.
What to watch
Watch for customer benchmarks comparing latency and cost across the three variants on real workloads, and for guidance or artifacts showing fine-tuning workflows on proprietary data. Also monitor support updates in the Amazon Bedrock model catalog and any published Gemma 4 model card details referenced by Bedrock.
Written by The Brieftide · Source: AWS Machine Learning
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Foundation ModelsNVIDIA BioNeMo recipes: LoRA fine-tunes ESM2-3B, Evo2-1B
LoRA adapters in NVIDIA BioNeMo Recipes fine-tune ESM2-3B and Evo2-1B using ~1% of parameters.
small models: Thousand Token Wood multi-model finance game
Thousand Token Wood v2 runs four labs' small models, gpt-oss-20b, MiniCPM3-4B, Nemotron-Mini-4B and a fine-tuned Qwen 0.5B.
OpenAI Frontier Governance Framework: Licensing, Audits, Safety
Sets safety testing, staged access licensing, independent audits and regulatory alignment for high-capability models.
GridSFM (Microsoft Research) predicts AC power flow in ms
Microsoft Research's small foundation model GridSFM estimates AC optimal power flow in milliseconds for real-time grid operations.