Multimodal AIJune 22, 20265 min read

Amazon Nova embeddings beat Cohere for Vexcel aerial search

Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.

The BrieftideJune 22, 2026

TL;DR

01Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
02The test ran roughly 100 distinct configurations over Grant Park in Chicago using OpenStreetMap as automated ground truth and two benchmark queries: "swimming pools" and "roads".
03The evaluation compared which embedding model, fusion strategy, captioning approach, and search method works best for multi-view aerial imagery.

Amazon Nova Multimodal Embeddings produced the highest F1 scores across two benchmark queries when AWS Generative AI Innovation Center and Vexcel evaluated embedding models, fusion strategies, captioning, and search methods for searchable aerial imagery. The test ran roughly 100 distinct configurations over Grant Park in Chicago using OpenStreetMap as automated ground truth and two benchmark queries: "swimming pools" and "roads".

What did the evaluation test?

The evaluation compared which embedding model, fusion strategy, captioning approach, and search method works best for multi-view aerial imagery. Specifically, the team tested Amazon Nova Multimodal Embeddings, Amazon Titan Multimodal Embeddings G1, and Cohere Embed v4; fusion options included per-view embeddings, late fusion (average and max-pool), LLM-weighted attention fusion, and Cohere’s multi-image batch encoding; captioning trials used Amazon Nova 2 Lite and Anthropic’s Claude. The framework ran about 100 distinct configurations across these axes, scoring results with precision, recall, and F1 against OpenStreetMap data pulled via the Overpass API.

Beyond model names, the tests examined operational choices: whether to index LLM-generated captions alongside embeddings, whether to extract up to 25 keyword tags from captions for metadata-filtered k-nearest neighbor lookup, and which search strategy suits discrete object detection ("swimming pools") versus distributed infrastructure detection ("roads"). The evaluation explicitly asked which combination minimizes per-feature training and lets users "index once, then query using natural language."

How is the system built and how does it handle multi-view tiles?

The system uses a five-stage pipeline: AOI selection, imagery ingest, embed and index, search, and evaluate, and it treats each geographic tile as seven complementary perspectives. Each tile yields up to seven images: an orthophoto, four oblique views (north, south, east, west), a Digital Surface Model (DSM), and a Digital Terrain Model (DTM). This multi-view design motivated fusion experiments because individual views miss details other views capture.

Ingest pulls tiles from Vexcel’s API at a configurable zoom, with rate limiting set to 100 requests per second and S3 caching for reproducibility. Embedding and captioning run on Amazon Bedrock models; embeddings and optional captions are indexed into Amazon OpenSearch Serverless or S3 Vectors. At query time, natural-language queries are embedded with the same model and matched against indexes; the system dynamically enables search methods based on which fields exist. The modular interfaces let teams swap models or fusion strategies without code changes, enabling the ~100 configuration tests.

Why it matters

Embedding-driven semantic search removes the need to train a bespoke computer vision model for every new feature, a costly cycle of labeling and retraining across industries such as insurance, real estate, government, infrastructure, and agriculture. Vexcel collects high-resolution imagery across 45+ countries and territories, and turning billions of pixels into actionable answers requires a repeatable, model-agnostic approach. By showing a single embedding family delivered the best F1 performance on two different task types, the work points to a practical path: index once across multi-view tiles and use natural-language queries to surface diverse features without per-feature model engineering.

What to watch

Vexcel has evolved this work into Vexcel Intelligence, a searchable imagery product currently in preview; the next signals to watch are broader benchmarks beyond Grant Park and whether labeled evaluations reproduce the same Amazon Nova advantage at larger scale. Also watch for how captioning plus metadata-filtered k-NN performs in production: the team tested caption extraction and up to 25 tags per tile as a pre-filter, but the tradeoffs between index cost and retrieval accuracy will determine real-world adoption.

Embedding models tested and outcomes

Item
Amazon Nova Multimodal Embeddings	Embedding model (Bedrock)	Delivered the highest F1 across both benchmark queries
Amazon Titan Multimodal Embeddings G1	Embedding model (Bedrock)	Evaluated as a comparator
Cohere Embed v4	Embedding model (Bedrock)	Evaluated as a comparator

Written by The Brieftide · Source: AWS Machine Learning

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

FreeOne email a dayEvery claim sourcedUnsubscribe in one click

Continue reading

LLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study

Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.

The BrieftideDAILY BRIEF

ThinkDeception: Progressive RL framework for multimodal deception

ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.

The BrieftideDAILY BRIEF

Visual-Seeker: visual-native multimodal search surpasses rivals

Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.

The BrieftideDAILY BRIEF

Gemma 4 12B: unified, encoder-free multimodal model for laptops

Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.