Sentence Transformers Now Does Multimodal Embedding and Reranking (v5.4)

Sentence Transformers has been my go-to library for embedding and reranker models for years. It’s simple, well-documented, and just works. The v5.4 update adds something I’ve been waiting for: proper multimodal support. Now you can encode and compare text, images, audio, and video using the same API you already know.

What’s New?

Multimodal embedding models map inputs from different modalities (text, images, audio, video) into a shared embedding space. That means you can compare a text query against image documents, find video clips matching a description, or build RAG pipelines that work across modalities. The reranker (Cross Encoder) models get the same treatment: they can now score relevance between mixed-modality pairs.

This isn’t just a toy feature. Visual document retrieval, cross-modal search, and multimodal RAG are real use cases that have been awkward to implement until now. You had to glue together separate models for each modality, or use heavyweight multimodal models that weren’t designed for retrieval. Sentence Transformers makes it feel natural.

Installation

You’ll need extra dependencies depending on which modalities you want to use:

pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers[image,video,train]"

A word of caution: VLM-based models like Qwen3-VL-2B need a GPU with at least 8 GB of VRAM. The 8B variants want closer to 20 GB. If you don’t have a local GPU, Google Colab or a cloud GPU service will work. On CPU, these models will be painfully slow. Stick with text-only or CLIP models if you’re CPU-bound.

Multimodal Embedding Models

Loading a multimodal model is identical to loading a text-only model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")

The model auto-detects which modalities it supports. No extra configuration needed. If you want to control image resolution or model precision, you can pass kwargs, but for most use cases it just works.

Encoding Images

model.encode() now accepts images alongside text. You can pass URLs, local file paths, or PIL Image objects:

img_embeddings = model.encode([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)

Cross-Modal Similarity

Since the model maps everything into the same embedding space, you can compute similarities between text and image embeddings directly:

text_embeddings = model.encode([
    "A green car parked in front of a yellow building",
    "A red car driving on a highway",
    "A bee on a pink flower",
    "A wasp on a wooden table",
])

similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)

As expected, “A green car parked in front of a yellow building” matches the car image (0.51), and “A bee on a pink flower” matches the bee image (0.67). The hard negatives get lower scores.

You’ll notice the scores aren’t close to 1.0. That’s the modality gap: embeddings from different modalities tend to cluster in separate regions of the space. Cross-modal similarities are typically lower than within-modal ones (e.g., text-to-text), but the relative ordering is preserved, so retrieval still works.

Encoding Queries and Documents

For retrieval tasks, use encode_query() and encode_document() instead of encode(). Many retrieval models prepend different instruction prompts depending on whether the input is a query or a document, and these methods handle that automatically.

query_emb = model.encode_query("Find images of cars")
doc_emb = model.encode_document("https://example.com/car.jpg")

Multimodal Reranker Models

Reranker models score the relevance of a query-document pair. With multimodal rerankers, the document can be an image, a video, or a text-image combination.

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")

scores = model.predict([
    ("A bee on a flower", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"),
    ("A car on the road", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"),
])
print(scores)

You can also use model.rank() to handle the full ranking pipeline, including query encoding and score sorting. This is the method I’d recommend for production use.

Retrieve and Rerank

The classic two-stage pipeline works with multimodal models too. First, retrieve candidates using a multimodal embedding model (fast, approximate). Then, rerank the top-k with a multimodal reranker (slower but more accurate).

# Stage 1: Retrieve
bi_encoder = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
query_emb = bi_encoder.encode_query("A bee on a pink flower")
doc_embs = bi_encoder.encode([doc_url1, doc_url2, ...])
candidates = top_k_by_similarity(query_emb, doc_embs, k=20)

# Stage 2: Rerank
cross_encoder = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")
scores = cross_encoder.predict([("A bee on a pink flower", doc) for doc in candidates])

Supported Models

The release supports several model families:

Qwen3-VL-Embedding and Qwen3-VL-Reranker: Strong all-around performers, 2B and 8B variants.
CLIP-based models: Lighter weight, good for image-text tasks.
SigLIP: Another solid option for image-text embedding.
Audio models: Whisper-based for speech and audio.
Video models: Based on VideoMAE and similar architectures.

Check the Hugging Face hub for the full list. New models are being added regularly.

What I Think

This update is genuinely useful. The API is clean, the documentation is solid (which is rare for multimodal features), and the model selection is good. The modality gap is a real limitation if you’re expecting near-1.0 similarity scores, but that’s a fundamental property of multimodal embeddings, not a bug in Sentence Transformers.

If you’re building a multimodal RAG system or doing cross-modal search, this is worth trying. The two-stage retrieve-and-rerank pipeline works well, and the library handles the complexity so you don’t have to.

One thing I’d like to see: better CPU support for the VLM models. I know it’s a hardware limitation, but not everyone has a GPU with 20 GB of VRAM. The CLIP-based models are fine on CPU, but the Qwen models are basically unusable without a GPU.

For training your own multimodal models, there’s a companion blogpost on the Hugging Face blog. I haven’t tried it yet, but if it’s as well-done as the inference side, it’ll be worth a read.

Give it a spin and let me know what you build.