Who Really Controls WAXAL? The AI Sovereignty Question Google’s African Dataset Doesn’t Answer

Google’s WAXAL dataset is the largest African language AI corpus in existence. But the question of who controls it — the IP, the annotations, the governance model — remains largely unanswered. BETAR investigates.
Total
0
Shares
9 min read

Who Really Controls WAXAL? The AI Sovereignty Question Google’s African Dataset Doesn’t Answer

Google’s 21-language African speech dataset gave partner universities ownership of their data. What it didn’t give them was control of the infrastructure it runs on — and that distinction matters enormously.

When Google Research Africa released WAXAL on 2 February 2026, the coverage was immediate and largely celebratory. An open-source dataset covering 21 Sub-Saharan African languages. More than 11,000 hours of speech data. Nearly two million individual recordings collected from thousands of volunteers across the continent. A resource that doesn’t exist anywhere else at this scale. The numbers are real and they are impressive.

But the story that followed — about African AI sovereignty, about local institutions finally holding the keys to their own linguistic data — elides a more complicated question. Who controls the data infrastructure the dataset runs on? Who decides what gets built with it, and where that computation happens? And does holding a data ownership agreement with a technology company constitute sovereignty in any meaningful sense, when the computational layer remains almost entirely outside the continent?

WAXAL is genuinely significant. It is also, depending on how you read it, either a milestone in African AI self-determination or a carefully constructed pipeline that positions Google as the indispensable intermediary in a market it has just opened up. Both readings deserve serious examination.


What WAXAL actually is, and who built it

WAXAL — released under an open license and available on Hugging Face — covers 21 Sub-Saharan African languages with approximately 1,250 hours of transcribed natural speech and over 180 hours of high-quality, single-speaker studio recordings. The project took three years to assemble, and the institutional architecture behind it is genuinely African-led in its data collection phase.

Makerere University’s AI Lab in Uganda and the University of Ghana together led data collection across 13 languages. Digital Umuganda in Rwanda — a community-driven open data organisation — coordinated five additional languages, relying on 7,000-plus volunteers in Ghana alone. The African Institute for Mathematical Sciences contributed multilingual material slated for future releases. Google provided the technical framework, the platform, and — critically — the global distribution channel via Hugging Face.

“For AI to have a real impact in Africa, it must speak our languages and understand our contexts,” said Joyce Nakatumba-Nabende of Makerere University, whose lab contributed significantly to the Luganda, Acholi, and related Ugandan language datasets.

On the data ownership question, Google has been explicit. Partner institutions retain ownership of the data they collected. The framework was designed, according to Google’s announcement, so that the intellectual property and cultural content of each language dataset remain with the African institution that organised its collection — not with Google.

That is a meaningful departure from how extractive data partnerships have historically operated. It is also, under the terms of WAXAL’s own licence, unenforceable in the way that matters most.

WAXAL is released under Creative Commons CC-BY-4.0 — a licence that explicitly permits commercial use by any party, including Google itself. If Google were to train proprietary models on the WAXAL data, the “ownership” retained by African partner institutions provides no mechanism to restrict that use. CC-BY-4.0 requires attribution, not permission. The licence that makes WAXAL maximally open to the world is also the licence that makes African institutional ownership claims unactionable against any entity that wishes to build commercial products on top of the data.

Legal analysts at African technology law institutions have noted that any meaningful enforceability would require bilateral data licensing agreements between Google and individual partner institutions — agreements that, as of this article’s publication, are not in the public documentation. BETAR.africa has sought comment from Dr. Nelly C. Rotich, a Research Fellow at the Centre for Intellectual Property and Information Technology Law (CIPIT) at Strathmore University, Nairobi — whose research focuses on digital trade, data rights, and governance across African markets — on whether such bilateral agreements are known to exist and what enforcement mechanisms they could provide. Her response is pending.


Ownership without infrastructure is a partial sovereignty

Data ownership and computational sovereignty are not the same thing. The distinction is what makes the WAXAL debate harder than most of its coverage has acknowledged.

African institutions own the data. They do not own the compute infrastructure on which that data becomes useful. As of 2024, more than 90 percent of AI model training for African languages occurs on non-African infrastructure — servers in the United States and Europe, running on cloud platforms operated by the same technology companies that sponsor these research partnerships. Africa captures less than one percent of global cloud computing revenue, and the continent’s combined research compute capacity is a fraction of what would be needed to independently train competitive large language models.

The African Union has estimated that building sufficient compute capacity for Africa to independently train frontier-class models would require approximately $500 million in investment over five years. That investment does not yet exist at anything close to that scale. In its absence, African institutions with data rights are, in practice, dependent on non-African infrastructure to convert those rights into working AI systems.

The pipeline problem is structural: data is collected by African institutions, processed and trained by non-African compute, distributed through platforms like Hugging Face controlled by organisations headquartered in California, and then made available to African developers who largely access it via cloud APIs operated by — Google. The circle closes where it opened.

This is not a critique unique to WAXAL, nor is it an argument that WAXAL should not exist. It is an argument that the ownership framework, while genuinely progressive within its own terms, does not resolve the underlying infrastructure dependency that defines African AI development in 2026.


What African AI labs are building with WAXAL access

The question of what African institutions are actually building with WAXAL access is separate from — and more productive than — the sovereignty debate in isolation. Here the picture is more encouraging, though uneven.

Makerere University’s AI Lab has been developing speech recognition models for Luganda and other Ugandan languages for several years before WAXAL, making it one of the few African institutions with genuine end-to-end capability: data collection, model training, and deployment for local applications. WAXAL’s data significantly extends the training corpus available for those languages. The lab’s research has fed into agricultural advisory tools and healthcare communication applications targeting low-literacy users in rural Uganda — exactly the kind of grounded, locally relevant AI application that the data was ostensibly collected to enable.

Digital Umuganda’s Kinyarwanda language work in Rwanda represents a similar model: community-sourced data, open research outputs, and a deliberate orientation toward building tools for Rwandan users rather than publishing academic papers for non-African audiences. The organisation has explicitly framed its work as preserving linguistic sovereignty, not simply contributing to a global research commons.

The concern raised by critics is not that these institutions lack capability or ambition. It is that the most resource-intensive layer of AI development — training large models at scale — remains financially and computationally beyond what most African academic institutions can sustain without external infrastructure partnerships. WAXAL gives them better data. It does not give them the compute to use it fully autonomously.


The quality gap and what it reveals

A technical detail in the WAXAL release illustrates the broader tension. The Yoruba language dataset — representing one of West Africa’s most widely spoken languages, with approximately 40 million speakers — was collected without diacritics. In Yoruba, diacritical marks are not optional punctuation; they are phonemically significant, carrying tonal information that changes word meaning. A dataset without them is a compromised foundation for any high-accuracy Yoruba speech recognition system.

The omission may reflect practical constraints in data collection at scale. It also reflects the challenge of building research infrastructure across diverse institutional partners with varying linguistic expertise and quality control capacity. The problem is not that the data exists — imperfect data with a clear quality roadmap is still valuable — but that it reveals how much technical capacity-building still needs to happen within African institutions before data ownership translates into high-quality AI outputs.

Google has acknowledged six additional languages in its pipeline, targeting 27 total by mid-2026 and over 50 African languages within the broader WAXAL roadmap. The direction is right. The governance of that expansion — who decides which languages, which quality standards, which partner institutions receive resources — is a question worth watching closely.


What AI sovereignty actually requires

The institutions at the centre of WAXAL are not naive about these constraints. Makerere’s AI Lab has been explicit about the need for African compute infrastructure, not just African data. Digital Umuganda has structured itself as a community institution specifically to avoid the dependency trap that has undermined earlier African AI research initiatives. AIMS has a continental mandate that pushes against the fragmentation of African AI development into country-by-country bilateral deals with large technology companies.

What they need — and what WAXAL, for all its genuine value, does not provide — is the institutional and financial architecture to move computation onto African infrastructure. That means African data centres, African cloud platforms, African research compute networks, and African AI governance frameworks that can negotiate with Google and others from a position of infrastructure parity rather than dependence.

The UNDP’s Africa bureau has argued publicly that “building the infrastructure” is the precondition for “owning the future.” The AU’s $500 million compute estimate is a planning figure, not a funded commitment. Individual African governments have not yet moved at the speed or scale the window requires — and that window is not unlimited. Language AI is moving fast. The models that define how African languages are represented in AI systems will likely be built in the next three to five years. Whether they are built primarily on African data, on African infrastructure, by African institutions, is a political and financial question as much as a technical one.

WAXAL is a step in the right direction. It is not, by itself, AI sovereignty. And conflating the two does a disservice to the African institutions working hardest to achieve the latter.


Note: BETAR.africa sought comment from Makerere University AI Lab and the University of Ghana’s AI research teams on their current model-building work and infrastructure arrangements. Responses are pending and will be incorporated into the final version. Dr. Nelly C. Rotich (CIPIT, Strathmore University) has been approached for comment on the CC-BY-4.0 ownership enforceability question; her response is pending. Academic sourcing confirmed: KNUST (Kwame Nkrumah University of Science and Technology) is not a WAXAL partner — the confirmed Ghana institutional partner is the University of Ghana. Sources: Google Research Africa blog (February 2026), WAXAL academic paper (arXiv:2602.02734), Hugging Face WaxalNLP dataset (CC-BY-4.0), Rest of World, TechCabal, TechAfrica News, UNDP Africa, African Union Digital Transformation Strategy documentation.

You May Also Like