Voice Services Backend (Ajrasakha Chatbot)

Problem Statement

India’s linguistic diversity makes voice-based chatbot interaction challenging. Most voice-enabled systems primarily support English and require users to manually select languages, resulting in poor user experience. Generic speech-to-text models struggle with Indian accents, regional phonetics, and code-mixed speech (e.g., Hinglish), leading to inaccurate transcriptions. Additionally, raw transcriptions often lack punctuation, reducing readability and downstream NLP performance. A scalable, low-latency, multilingual voice processing system is required to enable seamless voice interaction for Indian users without manual language selection.

Objective

To provide a high-performance, multilingual voice-to-text backend for the Ajrasakha Chatbot that automatically detects spoken language, accurately transcribes Indian and English speech, and returns readable, punctuated text in real time.

Work Done So Far

Built a GPU-accelerated voice processing microservice.
Implemented automatic language detection using a hybrid Whisper + MMS-LID pipeline.
Enabled multilingual transcription:
Whisper for English.
IndicConformer for 22+ Indian languages.
Integrated AI-based punctuation restoration using Cadence.
Exposed an OpenAI-compatible transcription API.
Containerized the service using Docker with NVIDIA runtime.
Integrated the service with the Ajrasakha chatbot voice interface.
Achieved low-latency, production-ready inference.

Key Features

Automatic language detection (no manual selection).
Support for 22+ Indian languages and English.
Robust handling of code-mixed speech.
GPU-accelerated, low-latency transcription.
AI-based punctuation for improved readability.
OpenAI-compatible API for easy integration.
Production-ready containerized deployment.

System Flow

User Voice Input → Language Detection → Model Selection (Whisper / IndicConformer) → Speech-to-Text → Punctuation Restoration → Formatted Text Returned to Ajrasakha Chat Interface

Technologies Used

Speech Models: OpenAI Whisper, AI4Bharat IndicConformer
Language Detection: Whisper + MMS-LID-126
Text Processing: AI4Bharat Cadence (punctuation)
Backend: FastAPI, Python
Deployment: Docker, NVIDIA GPU Runtime
Orchestration: Supervisord

Future Scopes

Improved language detection accuracy for similar languages and dialects.
Expansion to additional Indian languages (Tulu, Bhojpuri, Rajasthani, etc.).
Real-time streaming transcription via WebSockets.
Speaker diarization for multi-speaker audio.
Domain-specific vocabulary injection (agriculture-focused terms).
Voice-based command detection for chatbot actions.
Latency and inference optimization.
Edge or on-device transcription support.

Impact

This backend enables inclusive, voice-first access to the Ajrasakha Chatbot for Indian users across languages. By removing language barriers and improving transcription quality, it enhances accessibility, usability, and trust—especially for users more comfortable speaking than typing—while providing a scalable foundation for multilingual conversational AI.

Problem Statement​

Objective​

Work Done So Far​

Key Features​

System Flow​

Technologies Used​

Future Scopes​

Impact​