Multimodal AI5 min read

Amazon Nova embeddings beat Cohere for Vexcel aerial search

Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.

The Brieftide

TL;DR

  • 01Amazon Nova Multimodal Embeddings, evaluated on Vexcel imagery via Amazon Bedrock.
  • 02The test ran roughly 100 distinct configurations over Grant Park in Chicago using OpenStreetMap as automated ground truth and two benchmark queries: "swimming pools" and "roads".
  • 03The evaluation compared which embedding model, fusion strategy, captioning approach, and search method works best for multi-view aerial imagery.

Amazon Nova Multimodal Embeddings produced the highest F1 scores across two benchmark queries when AWS Generative AI Innovation Center and Vexcel evaluated embedding models, fusion strategies, captioning, and search methods for searchable aerial imagery. The test ran roughly 100 distinct configurations over Grant Park in Chicago using OpenStreetMap as automated ground truth and two benchmark queries: "swimming pools" and "roads".

What did the evaluation test?

The evaluation compared which embedding model, fusion strategy, captioning approach, and search method works best for multi-view aerial imagery. Specifically, the team tested Amazon Nova Multimodal Embeddings, Amazon Titan Multimodal Embeddings G1, and Cohere Embed v4; fusion options included per-view embeddings, late fusion (average and max-pool), LLM-weighted attention fusion, and Cohere’s multi-image batch encoding; captioning trials used Amazon Nova 2 Lite and Anthropic’s Claude. The framework ran about 100 distinct configurations across these axes, scoring results with precision, recall, and F1 against OpenStreetMap data pulled via the Overpass API.

Beyond model names, the tests examined operational choices: whether to index LLM-generated captions alongside embeddings, whether to extract up to 25 keyword tags from captions for metadata-filtered k-nearest neighbor lookup, and which search strategy suits discrete object detection ("swimming pools") versus distributed infrastructure detection ("roads"). The evaluation explicitly asked which combination minimizes per-feature training and lets users "index once, then query using natural language."

How is the system built and how does it handle multi-view tiles?

The system uses a five-stage pipeline: AOI selection, imagery ingest, embed and index, search, and evaluate, and it treats each geographic tile as seven complementary perspectives. Each tile yields up to seven images: an orthophoto, four oblique views (north, south, east, west), a Digital Surface Model (DSM), and a Digital Terrain Model (DTM). This multi-view design motivated fusion experiments because individual views miss details other views capture.

Ingest pulls tiles from Vexcel’s API at a configurable zoom, with rate limiting set to 100 requests per second and S3 caching for reproducibility. Embedding and captioning run on Amazon Bedrock models; embeddings and optional captions are indexed into Amazon OpenSearch Serverless or S3 Vectors. At query time, natural-language queries are embedded with the same model and matched against indexes; the system dynamically enables search methods based on which fields exist. The modular interfaces let teams swap models or fusion strategies without code changes, enabling the ~100 configuration tests.

Why it matters

Embedding-driven semantic search removes the need to train a bespoke computer vision model for every new feature, a costly cycle of labeling and retraining across industries such as insurance, real estate, government, infrastructure, and agriculture. Vexcel collects high-resolution imagery across 45+ countries and territories, and turning billions of pixels into actionable answers requires a repeatable, model-agnostic approach. By showing a single embedding family delivered the best F1 performance on two different task types, the work points to a practical path: index once across multi-view tiles and use natural-language queries to surface diverse features without per-feature model engineering.

What to watch

Vexcel has evolved this work into Vexcel Intelligence, a searchable imagery product currently in preview; the next signals to watch are broader benchmarks beyond Grant Park and whether labeled evaluations reproduce the same Amazon Nova advantage at larger scale. Also watch for how captioning plus metadata-filtered k-NN performs in production: the team tested caption extraction and up to 25 tags per tile as a pre-filter, but the tradeoffs between index cost and retrieval accuracy will determine real-world adoption.

Embedding models tested and outcomes
Item
Amazon Nova Multimodal EmbeddingsEmbedding model (Bedrock)Delivered the highest F1 across both benchmark queries
Amazon Titan Multimodal Embeddings G1Embedding model (Bedrock)Evaluated as a comparator
Cohere Embed v4Embedding model (Bedrock)Evaluated as a comparator
Advertisement

Written by The Brieftide · Source: AWS Machine Learning

The Brieftide Daily · 06:00

Briefs like this one, in your inbox every morning.

 

FreeOne email a dayEvery claim sourcedUnsubscribe in one click
Advertisement