Open AI infrastructure

We build what
African language AI
runs on.

DeepSahel AI builds open infrastructure that takes African languages from raw speech and text data to reusable datasets, validated models, and open benchmarks — lowering the cost of entry for every team that comes after.

GitHub ↗ Hugging Face ↗

Development pipeline

Raw speech & text Community recordings, documents, local knowledge

Consent, metadata & cleaning Structured collection, rights capture, quality control

Native-speaker validation Human-in-the-loop review, ASR listening, translation audit

Models, benchmarks & open release Training recipes, evaluation, open licensing, HF & GitHub

The problem

Communities have the data.
The tooling doesn't exist.

2,000+

Languages spoken across the African continent

~30

With any meaningful AI tooling today

Talent and compute are part of the picture. But even teams that have both still face a missing layer between raw language resources and usable models — data workflows, tokenization, evaluation, native-speaker review, documentation. DeepSahel makes that layer reusable.

Bottlenecks we address

Data collection & consent workflows
Transcription & translation pipelines
Tokenizer & preprocessing recipes
Model selection & training configuration
Native-speaker evaluation interfaces
Benchmark & evaluation design
Dataset & model card documentation
Open licensing & reproducible release

What we build

Reusable infrastructure,
not one-off experiments.

Four integrated components that move a language project from raw data to open release.

DS · Data

Data Workflows

Speech and text collection pipelines, consent capture, metadata schemas, cleaning, normalization, enrichment, and dataset versioning built for low-resource language realities.

DS · Eval

Native-Speaker Evaluation

Human-in-the-loop review interfaces for ASR listening, translation adequacy scoring, terminology validation, and naturalness rating — at every stage of the model cycle.

DS · Train

Training Infrastructure

Guided workflows for ASR, translation, tokenizers, embeddings, and efficient foundation-model adaptation. Training recipes built to be reproduced and extended.

DS · Commons

Open Release Layer

Every dataset, model, benchmark, and recipe published openly on GitHub and Hugging Face with full documentation — designed for reuse, not just citation.

Current languages

Gĩkũyũ Kikuyu · ~8M speakers · ASR + Translation models published

Kiswahili Swahili · ~200M speakers · Multilingual + code-switched evaluation

+ More Infrastructure designed for reuse across all underrepresented African languages

Existing work

Early models and open project spaces.

Published before this grant. Infrastructure already underway.

GitHub Organization Deep-Sahel github.com/Deep-Sahel

↗

Hugging Face Organization DeepSahelAI huggingface.co/DeepSahelAI

↗

Model · Translation DeepSahel Kikuyu Translation DeepSahelAI/DeepSahel-Kikuyu-Translation

↗

Model · ASR DeepSahel Kikuyu ASR DeepSahelAI/DeepSahel-Kikuyu-ASR

↗

Impact pathways

Foundational infrastructure
for public-good applications.

DeepSahel builds what makes downstream applications possible — not the applications themselves.

Health education Disease-awareness communication Agriculture & food security Mother-tongue education Public information access Civic services Cultural preservation Oral knowledge documentation Speech AI · ASR Machine translation Accessibility

Founding team

Research, engineering,
and evaluation leadership.

Kelvin Ngeno

AI Engineer · Project Lead & Model Training / Platform Engineering Lead

GitHub ↗ LinkedIn ↗ HuggingFace ↗

Imani Ndolo

ML Researcher & Applied Mathematician · Model Optimization and Evaluation Lead

LinkedIn ↗

Contact

Deep Sahel Artificial Intelligence CLG

Kenya · Non-profit CLG

Email deepsahel@protonmail.com

GitHub github.com/Deep-Sahel

HF huggingface.co/DeepSahelAI

Web deepsahel.co.ke

Work with us or reuse our infrastructure

If you're building with African languages — as a researcher, developer, community organization, or downstream builder — we want to hear from you. All infrastructure will be open. Our goal is to reduce duplication across the ecosystem, not add to it.

Email us ↗

We build what African language AI runs on.

Communities have the data.The tooling doesn't exist.

Reusable infrastructure,not one-off experiments.