Amazon Nova embeddings beat Cohere for Vexcel aerial search
Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
TL;DR
- 01Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
- 02The test ran roughly 100 distinct configurations over Grant Park in Chicago using OpenStreetMap as automated ground truth and two benchmark queries: "swimming pools" and "roads".
- 03The evaluation compared which embedding model, fusion strategy, captioning approach, and search method works best for multi-view aerial imagery.
Amazon Nova Multimodal Embeddings produced the highest F1 scores across two benchmark queries when AWS Generative AI Innovation Center and Vexcel evaluated embedding models, fusion strategies, captioning, and search methods for searchable aerial imagery. The test ran roughly 100 distinct configurations over Grant Park in Chicago using OpenStreetMap as automated ground truth and two benchmark queries: "swimming pools" and "roads".
What did the evaluation test?
The evaluation compared which embedding model, fusion strategy, captioning approach, and search method works best for multi-view aerial imagery. Specifically, the team tested Amazon Nova Multimodal Embeddings, Amazon Titan Multimodal Embeddings G1, and Cohere Embed v4; fusion options included per-view embeddings, late fusion (average and max-pool), LLM-weighted attention fusion, and Cohere’s multi-image batch encoding; captioning trials used Amazon Nova 2 Lite and Anthropic’s Claude. The framework ran about 100 distinct configurations across these axes, scoring results with precision, recall, and F1 against OpenStreetMap data pulled via the Overpass API.
Beyond model names, the tests examined operational choices: whether to index LLM-generated captions alongside embeddings, whether to extract up to 25 keyword tags from captions for metadata-filtered k-nearest neighbor lookup, and which search strategy suits discrete object detection ("swimming pools") versus distributed infrastructure detection ("roads"). The evaluation explicitly asked which combination minimizes per-feature training and lets users "index once, then query using natural language."
How is the system built and how does it handle multi-view tiles?
The system uses a five-stage pipeline: AOI selection, imagery ingest, embed and index, search, and evaluate, and it treats each geographic tile as seven complementary perspectives. Each tile yields up to seven images: an orthophoto, four oblique views (north, south, east, west), a Digital Surface Model (DSM), and a Digital Terrain Model (DTM). This multi-view design motivated fusion experiments because individual views miss details other views capture.
Ingest pulls tiles from Vexcel’s API at a configurable zoom, with rate limiting set to 100 requests per second and S3 caching for reproducibility. Embedding and captioning run on Amazon Bedrock models; embeddings and optional captions are indexed into Amazon OpenSearch Serverless or S3 Vectors. At query time, natural-language queries are embedded with the same model and matched against indexes; the system dynamically enables search methods based on which fields exist. The modular interfaces let teams swap models or fusion strategies without code changes, enabling the ~100 configuration tests.
Why it matters
Embedding-driven semantic search removes the need to train a bespoke computer vision model for every new feature, a costly cycle of labeling and retraining across industries such as insurance, real estate, government, infrastructure, and agriculture. Vexcel collects high-resolution imagery across 45+ countries and territories, and turning billions of pixels into actionable answers requires a repeatable, model-agnostic approach. By showing a single embedding family delivered the best F1 performance on two different task types, the work points to a practical path: index once across multi-view tiles and use natural-language queries to surface diverse features without per-feature model engineering.
What to watch
Vexcel has evolved this work into Vexcel Intelligence, a searchable imagery product currently in preview; the next signals to watch are broader benchmarks beyond Grant Park and whether labeled evaluations reproduce the same Amazon Nova advantage at larger scale. Also watch for how captioning plus metadata-filtered k-NN performs in production: the team tested caption extraction and up to 25 tags per tile as a pre-filter, but the tradeoffs between index cost and retrieval accuracy will determine real-world adoption.
| Item | |||
|---|---|---|---|
| Amazon Nova Multimodal Embeddings | Embedding model (Bedrock) | Delivered the highest F1 across both benchmark queries | |
| Amazon Titan Multimodal Embeddings G1 | Embedding model (Bedrock) | Evaluated as a comparator | |
| Cohere Embed v4 | Embedding model (Bedrock) | Evaluated as a comparator |
Written by The Brieftide · Source: AWS Machine Learning
The Brieftide Daily · 06:00
Briefs like this one, in your inbox every morning.
Continue reading
More in Multimodal AILLMs: gpt-4o, gpt-4.1-mini and claude-sonnet-4.6 study
Analysis of 21,000 multi-turn conversations finds human-like behaviors vary by model and user and can be modulated by system prompts.
ThinkDeception: Progressive RL framework for multimodal deception
ThinkDeception on arXiv uses MLLMs, a step-by-step multimodal Chain of Thought dataset and a four-tier progressive RL trainer for.
Visual-Seeker: visual-native multimodal search surpasses rivals
Zhengbo Zhang and 12 co-authors submitted Visual-Seeker on 13 Jun 2026.
Gemma 4 12B: unified, encoder-free multimodal model for laptops
Google DeepMind’s 12B model brings encoder-free vision and native audio to laptops, runs on 16GB memory and is released under Apache 2.0.