WAXAL: A Massive Open Speech Dataset for 27 African Languages

Google Research just dropped WAXAL, and honestly, it’s about time someone put serious weight behind African language speech tech. The team has been quietly working on this since 2021, and the result is a massive open dataset covering 27 Sub-Saharan African languages spoken by over 100 million people across 26+ countries.

Let me cut through the corporate speak: this is 1,846 hours of transcribed natural speech for automatic speech recognition (ASR) and 565 hours of high-fidelity recordings for text-to-speech (TTS). All released under Creative Commons CC-BY-4.0. That’s not just generous — it’s genuinely useful for anyone building voice interfaces for these languages.

The data gap problem

Voice assistants and transcription tools have become second nature for English, Mandarin, Spanish speakers. But if you speak Yoruba, Hausa, or Swahili? Good luck. Sub-Saharan Africa alone has over 2,000 distinct languages, and most speech datasets treat them like an afterthought. WAXAL doesn’t fix everything overnight, but it’s a solid start.

What I find interesting is how they collected the ASR data. Instead of having people read boring scripts — which always sounds stilted and unnatural — they showed participants images from Google’s Open Images and asked them to describe what they saw in their native language. This image-prompted elicitation method captures real linguistic variation: tonal nuances, code-switching, the way people actually talk. That’s smarter than most corpus collection I’ve seen.

The TTS side is even more collaborative. Local community members worked in pairs, drafting scripts of 10,000 to 20,000 words, alternating between reading and recording. Some participants even used project funding to build custom studio boxes for professional-grade acoustics. That’s the kind of grassroots investment that makes a dataset actually usable.

What’s actually in there

The ASR corpus covers spontaneous, unscripted speech across 50+ topics. The TTS dataset is phonetically balanced, segmented, and matched with script text for accuracy. Both are permissively licensed, meaning you can build commercial products without legal headaches.

Now, 27 languages is impressive, but let’s be real: there are thousands more. The team says they intend for WAXAL to evolve and expand. I hope they mean it. One release doesn’t solve the digital divide, but it’s a hell of a foundation.

If you’re working on African language speech technology, this is the resource you’ve been waiting for. The paper is linked on their site, and the dataset is available now. Go build something useful.

WAXAL: A Massive Open Speech Dataset for 27 African Languages

The data gap problem

What’s actually in there

Comments (0)