Open AI infrastructure

We build what
African language AI
runs on.

DeepSahel AI builds open infrastructure that takes African languages from raw speech and text data to reusable datasets, validated models, and open benchmarks — lowering the cost of entry for every team that comes after.

Development pipeline

01
Raw speech & text Community recordings, documents, local knowledge
02
Consent, metadata & cleaning Structured collection, rights capture, quality control
03
Native-speaker validation Human-in-the-loop review, ASR listening, translation audit
04
Models, benchmarks & open release Training recipes, evaluation, open licensing, HF & GitHub

Communities have the data.
The tooling doesn't exist.

2,000+

Languages spoken across the African continent

~30

With any meaningful AI tooling today

Talent and compute are part of the picture. But even teams that have both still face a missing layer between raw language resources and usable models — data workflows, tokenization, evaluation, native-speaker review, documentation. DeepSahel makes that layer reusable.

Bottlenecks we address

  • Data collection & consent workflows
  • Transcription & translation pipelines
  • Tokenizer & preprocessing recipes
  • Model selection & training configuration
  • Native-speaker evaluation interfaces
  • Benchmark & evaluation design
  • Dataset & model card documentation
  • Open licensing & reproducible release

Reusable infrastructure,
not one-off experiments.

Four integrated components that move a language project from raw data to open release.

DS · Data

Data Workflows

Speech and text collection pipelines, consent capture, metadata schemas, cleaning, normalization, enrichment, and dataset versioning built for low-resource language realities.

DS · Eval

Native-Speaker Evaluation

Human-in-the-loop review interfaces for ASR listening, translation adequacy scoring, terminology validation, and naturalness rating — at every stage of the model cycle.

DS · Train

Training Infrastructure

Guided workflows for ASR, translation, tokenizers, embeddings, and efficient foundation-model adaptation. Training recipes built to be reproduced and extended.

DS · Commons

Open Release Layer

Every dataset, model, benchmark, and recipe published openly on GitHub and Hugging Face with full documentation — designed for reuse, not just citation.

Current languages

Gĩkũyũ Kikuyu · ~8M speakers · ASR + Translation models published
Kiswahili Swahili · ~200M speakers · Multilingual + code-switched evaluation
+ More Infrastructure designed for reuse across all underrepresented African languages

Early models and open project spaces.

Published before this grant. Infrastructure already underway.

Foundational infrastructure
for public-good applications.

DeepSahel builds what makes downstream applications possible — not the applications themselves.

Health education Disease-awareness communication Agriculture & food security Mother-tongue education Public information access Civic services Cultural preservation Oral knowledge documentation Speech AI · ASR Machine translation Accessibility

Research, engineering,
and evaluation leadership.

Kelvin Ngeno

AI Engineer · Project Lead & Model Training / Platform Engineering Lead

Imani Ndolo

ML Researcher & Applied Mathematician · Model Optimization and Evaluation Lead

Deep Sahel Artificial Intelligence CLG

Kenya · Non-profit CLG

Work with us or reuse our infrastructure

If you're building with African languages — as a researcher, developer, community organization, or downstream builder — we want to hear from you. All infrastructure will be open. Our goal is to reduce duplication across the ecosystem, not add to it.

Email us ↗