Running Transformers.js in a Chrome Extension: What I Learned Building With Gemma 4

Hugging Face recently shipped a demo browser extension powered by Transformers.js and Gemma 4 E2B. It’s meant to help users navigate the web with local AI. I spent some time digging into the source and the write-up they published, and there’s a lot of practical stuff here that’s worth unpacking — especially if you’ve ever tried to squeeze a local model into a Chrome extension under Manifest V3.

Let’s get one thing out of the way: running AI models in a browser extension is still kind of janky. But this project shows a clean way to do it. The architecture choices are sensible, and the messaging patterns are worth stealing.

Who should care

If you’re a developer who wants to add local AI features to a Chrome extension — things like text generation, semantic search, or page summarization — without shipping everything to a server, this is for you. The constraints of Manifest V3 make this harder than it should be, but the approach here works.

The end result is a side panel chat UI, a background service worker that hosts the models, and a content script that can extract page data and highlight elements. All inference runs locally on the user’s machine.

The architecture: three runtimes, one brain

Manifest V3 forces you to split your extension into separate execution contexts: the background service worker, the side panel (or popup), and content scripts that run on web pages. Each has different capabilities and limitations. The key insight here is to keep the heavy lifting in the background and treat everything else as thin clients.

Background service worker

This is where the models live. Text generation via Gemma 4, embeddings via MiniLM — both run here. The background worker is the control plane: it handles model initialization, inference, tool execution, and maintains the conversation history. It’s also the single point of contact for all messaging.

One thing that tripped me up when I first read this: MV3 service workers can be suspended and restarted at any time. That means your model state has to be recoverable. The project handles this by checking what’s already cached and re-initializing when needed. It’s not elegant, but it works.

Side panel

The side panel is the UI layer. It’s built with React and handles chat input/output, streaming updates, and setup controls. It doesn’t touch models directly — it sends typed messages to the background and waits for responses. This keeps the UI responsive and avoids duplicate model loads.

Content script

The content script is the page bridge. It can access the DOM, extract page data, and highlight elements. But it can’t communicate with models directly. Everything goes through the background worker via messages.

This split isn’t just good practice — it’s enforced by Chrome’s security model. Content scripts can’t access extension APIs directly, and service workers can’t touch the DOM. So you have to wire them together with messages.

The messaging contract: keep it typed

Messaging is the backbone of any MV3 extension, and this project does it right. All messages are typed through enums in a shared types file. The pattern is straightforward:

Side panel sends actions to background: check models, initialize, generate text, clear messages.
Background sends updates back: download progress, message list updates.
Background also sends commands to content script: extract page data, highlight elements, clear highlights.

The rule is simple: the background is the single coordinator. Side panel and content script are specialized workers that request actions and render results.

A typical flow looks like this: the side panel sends AGENT_GENERATE_TEXT. The background appends the message to the conversation history, runs inference (possibly with tool calls), then emits MESSAGES_UPDATE. The side panel re-renders from the updated message list.

This avoids the mess of multiple components trying to manage state independently. Everything flows through the background.

Model loading: what runs where and why

This extension uses two models: Gemma 4 for text generation (quantized to q4f16) and MiniLM for embeddings (fp32). The split is intentional — Gemma handles reasoning and tool decisions, while MiniLM generates vector embeddings for semantic similarity search.

Both models run in the background service worker. This gives a single model host for all tabs and sessions, avoids duplicate memory usage, and keeps the side panel responsive. Because models are loaded from the extension origin (chrome-extension://) rather than per-website origins, artifacts are cached once for the entire extension install rather than being duplicated across sites.

The model lifecycle is explicit: CHECK_MODELS inspects what’s already cached and estimates remaining download size. INITIALIZE_MODELS downloads and initializes models, emitting DOWNLOAD_PROGRESS to the UI. Long-lived instances are reused after setup.

One practical detail: the text generation pipeline uses consistent KV caching via a new DynamicCache class. This speeds up repeated inference significantly, especially for chat-style interactions where context builds over multiple turns.

What I’d do differently

This architecture is solid, but there are a couple of things I’d question if I were building from scratch.

First, keeping the full conversation history in the background worker’s memory is convenient, but it’s also a memory risk. Long conversations with large models can balloon quickly. I’d probably offload history to chrome.storage.session or IndexedDB for persistence, and only keep the active context in memory.

Second, the content script is fairly thin in this design — it basically just extracts DOM content and applies highlights. If you need more complex page interactions (form filling, dynamic content monitoring, cross-origin requests), you’ll need to extend the messaging contract significantly. The pattern scales, but the complexity adds up.

Third, model download progress is emitted to the UI, which is good UX. But there’s no mention of handling download failures or retries gracefully. In my experience, users will close the extension mid-download, and you need to handle partial downloads without corrupting the cache.

Why this matters

Running models locally in a browser extension isn’t just a technical novelty. It means no data leaves the user’s machine, no API keys to manage, and no server costs. For privacy-sensitive applications — personal assistants, document analysis, offline tools — this is a genuinely useful pattern.

The tradeoff is performance and model size. Gemma 4 E2B is a relatively small model, but even quantized it takes time to load and run on consumer hardware. You’re not replacing cloud-hosted GPT-4 with this approach. But for focused tasks like page summarization, question answering, or semantic search, it’s more than adequate.

The takeaway

This project is a well-architected reference for anyone trying to run Transformers.js in a Chrome extension. The messaging patterns are clean, the model lifecycle is explicit, and the separation of concerns between runtimes is exactly right for MV3.

If you’re building something similar, start by copying the messaging contract and the model initialization flow. The UI layer is secondary — you can swap React for Vue or vanilla JS without changing the core architecture.

The full source is on GitHub under nico-martin/gemma4-browser-extension, and the extension itself is on the Chrome Web Store. Worth a look if you’re wrestling with MV3 and local AI.