Design a Voice Assistant System
Design a machine learning system that understands spoken commands and responds (like Alexa, Siri, or Google Assistant).
Requirements
Functional:
- Wake word detection ("Hey Assistant")
- Speech-to-text transcription
- Intent recognition and slot filling
- Execute commands (smart home, queries, timers)
- Generate spoken responses
Non-functional:
- Wake word: < 1% false accept, < 5% false reject
- End-to-end latency: < 2 seconds
- Work in noisy environments (50dB SNR)
- Support multiple languages and accents
- Run partially on-device for privacy
Metrics
Component Metrics
| Component | Metric | Target |
|---|---|---|
| Wake word | False accept rate | < 1% |
| Wake word | False reject rate | < 5% |
| ASR | Word error rate (WER) | < 8% |
| NLU | Intent accuracy | > 95% |
| NLU | Slot F1 | > 90% |
| TTS | Mean opinion score (MOS) | > 4.0/5.0 |
System Metrics
| Metric | Description | Target |
|---|---|---|
| End-to-end latency | Time from speech end to response start | < 2s |
| Request success rate | Successfully completed requests | > 90% |
| User retention | Daily active users | Baseline + 5% |
| NPS | User satisfaction | > 50 |
Architecture
Wake Word Detection
The system listens constantly for the wake phrase. This must run on-device for privacy and latency.
Model Architecture
Requirements:
- Tiny model (runs on microcontroller)
- Low power (always listening)
- High precision (few false wakes)
- Personalized (learn user's voice)
Training Data
| Data Source | Purpose |
|---|---|
| Positive | Thousands of wake word recordings |
| Negative | General speech, music, TV |
| Hard negatives | Similar-sounding phrases |
| Noise augmentation | Various acoustic conditions |
Train with class imbalance handling (false accepts are worse than false rejects).
Speech Recognition (ASR)
Convert audio to text.
End-to-End Architecture
Conformer: Current state-of-the-art. Combines self-attention (global context) with convolution (local patterns).
Streaming vs Non-Streaming
| Mode | Description | Latency | Accuracy |
|---|---|---|---|
| Non-streaming | Process full utterance | Higher | Better |
| Streaming | Process chunks as they arrive | Lower | Worse |
| Hybrid | Stream with look-ahead | Medium | Medium |
Recommendation: Streaming for voice assistants. Users expect immediate feedback. Use limited look-ahead (300ms) to improve accuracy.
Handling Challenges
| Challenge | Solution |
|---|---|
| Background noise | Noise-robust features, multi-condition training |
| Accents | Accent-specific models or adaptation |
| Rare words | External language model, spell correction |
| Code-switching | Multilingual model |
| Disfluencies | Training data with "uh", "um", etc. |
Natural Language Understanding (NLU)
Once you have text, understand what the user wants.
Intent Classification
Map utterances to intents:
| Utterance | Intent |
|---|---|
| "Set a timer for 5 minutes" | SetTimer |
| "What's the weather?" | GetWeather |
| "Play some jazz" | PlayMusic |
| "Turn off the lights" | SmartHome |
Slot Filling
Extract key parameters:
"Set a timer for 5 minutes"
Intent: SetTimer
Slots:
- duration: "5 minutes"
- duration_value: 5
- duration_unit: minutes
Joint Model Architecture
Joint training: intent classification from [CLS] token, slot filling from token representations.
Handling Ambiguity
User: "Play Taylor Swift"
Could mean:
- Play songs by Taylor Swift
- Play the song "Taylor Swift" (if it exists)
- Play Taylor Swift playlist
Solutions:
- Confidence thresholds (ask for clarification if uncertain)
- Dialogue context (previous queries help disambiguate)
- User preferences (this user usually means artist)
Dialogue Management
Track conversation state across turns.
State Tracking
Context Carryover
User: "What's the weather in Seattle?"
Assistant: "It's 55F and cloudy in Seattle."
User: "What about tomorrow?" <- needs context
Context: {location: Seattle, date: today}
Interpret: "What's the weather in Seattle tomorrow?"
Maintain context slots across turns. Expire after timeout or topic change.
Execution Layer
Route intents to skills that fulfill them.
Skill Architecture
Each skill:
- Validates required slots
- Calls external APIs if needed
- Generates response text
- Returns structured response
Response Generation
| Approach | When to Use |
|---|---|
| Template | Simple, predictable responses |
| Retrieval | FAQ-style responses |
| Neural generation | Complex, conversational responses |
For voice assistants, templates work well. They are predictable and fast.
Text-to-Speech (TTS)
Convert response text to natural speech.
Modern TTS Pipeline
Text normalization: Convert abbreviations, numbers, dates to spoken form.
Acoustic model: Text -> mel spectrogram. FastSpeech2 is fast and high quality.
Vocoder: Mel spectrogram -> waveform. HiFi-GAN is fast and sounds natural.
Voice Customization
| Feature | Implementation |
|---|---|
| Multiple voices | Train separate models per voice |
| Prosody control | Adjust pitch, speed, emotion |
| SSML support | Markup for pronunciation, pauses |
| Celebrity voices | Voice cloning with consent |
On-Device vs Cloud
Privacy vs Capability Trade-off
Keep on-device:
- Wake word detection (always listening)
- Simple commands that don't need cloud
- Audio encryption
Send to cloud:
- Complex speech recognition
- Knowledge queries
- Skill execution
Latency Optimization
End-to-End Budget: 2 seconds
| Component | Budget | Optimization |
|---|---|---|
| Audio capture | 200ms | Streaming, endpoint detection |
| Network | 100ms | Edge servers, compression |
| ASR | 500ms | Streaming models, GPU inference |
| NLU | 50ms | Distilled models |
| Skill execution | 500ms | Caching, async calls |
| TTS | 400ms | Pre-synthesized common phrases |
| Response playback | 250ms | Streaming audio |
Streaming Architecture
Don't wait for each component to finish:
NLU can start processing partial transcripts. Skills can pre-fetch data based on partial understanding.
Monitoring
Quality Metrics
| Metric | How to Measure |
|---|---|
| ASR accuracy | Sample and human-transcribe |
| NLU accuracy | Sample and human-label |
| End-to-end success | Did user get what they wanted? |
| Latency percentiles | Instrument each component |
User Feedback Signals
| Signal | Indicates |
|---|---|
| Explicit feedback | "That's wrong" / "Thank you" |
| Repetition | User repeats the command |
| Reformulation | User rephrases the command |
| Abandonment | User gives up mid-request |
| Barge-in | User interrupts the response |
Reference
| Topic | Description |
|---|---|
| Multiple languages | Multilingual ASR model or language detection followed by language-specific model. NLU can be multilingual or separate. |
| Personalization | Speaker identification for multi-user devices. Learn preferences (music, news sources). Adapt to user's speech patterns. |
| Noisy environments | Multi-microphone arrays with beamforming. Noise-robust ASR training. Confidence thresholds that ask for repetition. |
| Mistake handling | Allow corrections ("No, I said Seattle"). Implicit feedback from repetitions. Easy escalation to screen or app. |
| Privacy vs accuracy | On-device is private but less capable. Trade-off based on command type. |
| Latency vs quality | Faster responses vs better understanding. Streaming enables both. |