Fine-Tuning Multimodal Embedding Models: A Hands-On Walkthrough with Sentence Transformers

Fine-Tuning Multimodal Embedding Models: A Hands-On Walkthrough with Sentence Transformers

3 0 0

I’ve been using Sentence Transformers for years now, mostly for text-only embedding and reranker work. But the library’s recent push into multimodal territory—handling images, audio, and video alongside text—has been genuinely impressive. Tom Aarsen’s earlier post covered the basics of using these multimodal models. This one goes further: how to actually train or fine-tune them on your own data.

The example that caught my eye is Visual Document Retrieval (VDR). You know, the kind of task where someone asks “What was the company’s Q3 revenue?” and the system needs to find the right page from a pile of document screenshots, charts, tables and all. It’s a very different skill from matching product photos to descriptions, and it’s exactly where off-the-shelf models fall short.

Why bother fine-tuning?

General-purpose multimodal embedding models like Qwen/Qwen3-VL-Embedding-2B are trained on everything from image-text pairs to visual QA to document understanding. That breadth is useful, but it also means the model isn’t optimized for any single task. Fine-tuning on domain-specific data lets the model learn the patterns that matter for your use case.

The numbers speak for themselves. On the VDR evaluation, the base Qwen model scored an NDCG@10 of 0.888. After fine-tuning on document retrieval data, that jumped to 0.947. That’s not just a marginal improvement—it outperformed every other multimodal model I tested, including ones up to four times larger. Worth the effort, I’d say.

The training components

The training pipeline for multimodal models uses the same SentenceTransformerTrainer you’d use for text-only models. The components are familiar: model, dataset, loss function, training arguments, evaluator, and trainer. The main difference is that your dataset now includes images (or other modalities) alongside text, and the model’s processor handles the image preprocessing automatically.

Model

You have two options. The straightforward one is to fine-tune an existing multimodal embedding model. You can pass processor_kwargs and model_kwargs to control things like image resolution (higher max_pixels means better quality but more memory) and attention implementation:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

The other option is to start from a fresh Vision-Language Model (VLM) checkpoint that hasn’t been trained for embeddings yet. Sentence Transformers will try to detect the architecture and set up the right pooling and forward methods automatically. If it doesn’t get it perfect, you can tweak the sentence_bert_config.json file manually:

model = SentenceTransformer("Qwen/Qwen3-VL-2B")

Either way, you can check what modalities the model supports with print(model.modalities) or print(model.supports("image")).

Dataset

For VDR, the dataset format is straightforward. Each training example needs a query (text), a positive document (image), and optionally one or more negative documents (images). The dataset should be structured as a dictionary with keys like "query", "positive", and "negative".

I used a custom dataset of document screenshots paired with text queries. The key was making sure the queries were realistic—things people actually search for in documents, not just random captions. The negatives were sampled from other documents in the corpus, which is the standard approach.

Loss Function

The loss function that worked best here is CachedMultipleNegativesRankingLoss. It’s a variant of the classic MultipleNegativesRankingLoss that caches embeddings to reduce memory usage during training. This matters when you’re dealing with images, because each forward pass is more expensive than with text alone.

I also used MatryoshkaLoss to train a Matryoshka embedding model. This lets you truncate the embedding dimension at inference time (e.g., from 2048 down to 256) with minimal performance loss. Handy for production deployments where you need to balance speed and accuracy.

Training Arguments

Standard stuff here: learning rate, batch size, number of epochs, warmup steps. I used a learning rate of 2e-5 with a linear schedule and 10% warmup. Batch size was limited by GPU memory—images eat up VRAM fast, even with Flash Attention enabled.

Evaluator

I set up an evaluator using the same VDR dataset to track NDCG@10 during training. This is critical for multimodal training because you can’t always rely on the loss curve alone—the model might overfit to image artifacts or layout patterns that don’t generalize.

Trainer

The SentenceTransformerTrainer ties everything together. It handles the training loop, evaluation, checkpointing, and logging. The API is consistent with text-only training, which is a nice touch.

Results

The fine-tuned model (tomaarsen/Qwen3-VL-Embedding-2B-vdr) achieved an NDCG@10 of 0.947, compared to the base model’s 0.888. That’s a 6.6% absolute improvement, which is substantial for a retrieval task.

What I found interesting is the comparison against larger models. Some VDR-specific models up to 8B parameters still couldn’t match the performance of this 2B model after fine-tuning. It’s a reminder that model size isn’t everything—domain-specific fine-tuning can close the gap more effectively than just throwing more parameters at the problem.

The Matryoshka dimensions also held up well. Even at 256 dimensions, the fine-tuned model retained most of its accuracy, which makes deployment much cheaper.

Training multimodal reranker models

The same approach works for multimodal reranker models, which take a query and a candidate document (as text and/or image) and output a relevance score. The training pipeline is almost identical, just with a different loss function (CachedCrossEntropyLoss or similar) and a reranker model architecture.

I haven’t played with this as much, but the principles are the same: start from a pretrained VLM, prepare a dataset with query-document pairs and relevance labels, and fine-tune with the appropriate loss.

My take

This is a solid addition to the Sentence Transformers ecosystem. The fact that the training pipeline is nearly identical to text-only training lowers the barrier for anyone who’s already familiar with the library. The performance gains from fine-tuning are real, and the VDR example is a good demonstration of what’s possible.

If you’re working on any task that involves matching text to images—document retrieval, visual search, multimodal RAG—this is worth your time. The code is straightforward, the results are measurable, and the library handles the messy parts (processor setup, modality detection, caching) for you.

One thing I’d like to see: better documentation for custom dataset formats. The VDR example is clear, but not everyone’s data looks like document screenshots. More examples with different modalities (audio, video) and data structures would help.

But overall, this is a well-executed feature. If you’ve been sitting on a multimodal dataset wondering how to fine-tune an embedding model, now you have your answer.

Comments (0)

Be the first to comment!