Design a Voice Assistant System

Design a machine learning system that understands spoken commands and responds (like Alexa, Siri, or Google Assistant).

Requirements

Functional:

Wake word detection ("Hey Assistant")
Speech-to-text transcription
Intent recognition and slot filling
Execute commands (smart home, queries, timers)
Generate spoken responses

Non-functional:

Wake word: < 1% false accept, < 5% false reject
End-to-end latency: < 2 seconds
Work in noisy environments (50dB SNR)
Support multiple languages and accents
Run partially on-device for privacy

Metrics

Component Metrics

Component	Metric	Target
Wake word	False accept rate	< 1%
Wake word	False reject rate	< 5%
ASR	Word error rate (WER)	< 8%
NLU	Intent accuracy	> 95%
NLU	Slot F1	> 90%
TTS	Mean opinion score (MOS)	> 4.0/5.0

System Metrics

Metric	Description	Target
End-to-end latency	Time from speech end to response start	< 2s
Request success rate	Successfully completed requests	> 90%
User retention	Daily active users	Baseline + 5%
NPS	User satisfaction	> 50

Architecture

Loading diagram...

Wake Word Detection

The system listens constantly for the wake phrase. This must run on-device for privacy and latency.

Model Architecture

Loading diagram...

Requirements:

Tiny model (runs on microcontroller)
Low power (always listening)
High precision (few false wakes)
Personalized (learn user's voice)

Training Data

Data Source	Purpose
Positive	Thousands of wake word recordings
Negative	General speech, music, TV
Hard negatives	Similar-sounding phrases
Noise augmentation	Various acoustic conditions

Train with class imbalance handling (false accepts are worse than false rejects).

Speech Recognition (ASR)

Convert audio to text.

End-to-End Architecture

Loading diagram...

Conformer: Current state-of-the-art. Combines self-attention (global context) with convolution (local patterns).

Streaming vs Non-Streaming

Mode	Description	Latency	Accuracy
Non-streaming	Process full utterance	Higher	Better
Streaming	Process chunks as they arrive	Lower	Worse
Hybrid	Stream with look-ahead	Medium	Medium

Recommendation: Streaming for voice assistants. Users expect immediate feedback. Use limited look-ahead (300ms) to improve accuracy.

Handling Challenges

Challenge	Solution
Background noise	Noise-robust features, multi-condition training
Accents	Accent-specific models or adaptation
Rare words	External language model, spell correction
Code-switching	Multilingual model
Disfluencies	Training data with "uh", "um", etc.

Natural Language Understanding (NLU)

Once you have text, understand what the user wants.

Intent Classification

Map utterances to intents:

Utterance	Intent
"Set a timer for 5 minutes"	SetTimer
"What's the weather?"	GetWeather
"Play some jazz"	PlayMusic
"Turn off the lights"	SmartHome

Slot Filling

Extract key parameters:

"Set a timer for 5 minutes"
  Intent: SetTimer
  Slots:
    - duration: "5 minutes"
    - duration_value: 5
    - duration_unit: minutes

Joint Model Architecture

Loading diagram...

Joint training: intent classification from [CLS] token, slot filling from token representations.

Handling Ambiguity

User: "Play Taylor Swift"

Could mean:

Play songs by Taylor Swift
Play the song "Taylor Swift" (if it exists)
Play Taylor Swift playlist

Solutions:

Confidence thresholds (ask for clarification if uncertain)
Dialogue context (previous queries help disambiguate)
User preferences (this user usually means artist)

Dialogue Management

Track conversation state across turns.

State Tracking

Loading diagram...

Context Carryover

User: "What's the weather in Seattle?"
Assistant: "It's 55F and cloudy in Seattle."
User: "What about tomorrow?"  <- needs context

Context: {location: Seattle, date: today}
Interpret: "What's the weather in Seattle tomorrow?"

Maintain context slots across turns. Expire after timeout or topic change.

Execution Layer

Route intents to skills that fulfill them.

Skill Architecture

Loading diagram...

Each skill:

Validates required slots
Calls external APIs if needed
Generates response text
Returns structured response

Response Generation

Approach	When to Use
Template	Simple, predictable responses
Retrieval	FAQ-style responses
Neural generation	Complex, conversational responses

For voice assistants, templates work well. They are predictable and fast.

Text-to-Speech (TTS)

Convert response text to natural speech.

Modern TTS Pipeline

Loading diagram...

Text normalization: Convert abbreviations, numbers, dates to spoken form.

Acoustic model: Text -> mel spectrogram. FastSpeech2 is fast and high quality.

Vocoder: Mel spectrogram -> waveform. HiFi-GAN is fast and sounds natural.

Voice Customization

Feature	Implementation
Multiple voices	Train separate models per voice
Prosody control	Adjust pitch, speed, emotion
SSML support	Markup for pronunciation, pauses
Celebrity voices	Voice cloning with consent

On-Device vs Cloud

Privacy vs Capability Trade-off

Loading diagram...

Keep on-device:

Wake word detection (always listening)
Simple commands that don't need cloud
Audio encryption

Send to cloud:

Complex speech recognition
Knowledge queries
Skill execution

Latency Optimization

End-to-End Budget: 2 seconds

Component	Budget	Optimization
Audio capture	200ms	Streaming, endpoint detection
Network	100ms	Edge servers, compression
ASR	500ms	Streaming models, GPU inference
NLU	50ms	Distilled models
Skill execution	500ms	Caching, async calls
TTS	400ms	Pre-synthesized common phrases
Response playback	250ms	Streaming audio

Streaming Architecture

Don't wait for each component to finish:

Loading diagram...

NLU can start processing partial transcripts. Skills can pre-fetch data based on partial understanding.

Monitoring

Quality Metrics

Metric	How to Measure
ASR accuracy	Sample and human-transcribe
NLU accuracy	Sample and human-label
End-to-end success	Did user get what they wanted?
Latency percentiles	Instrument each component

User Feedback Signals

Signal	Indicates
Explicit feedback	"That's wrong" / "Thank you"
Repetition	User repeats the command
Reformulation	User rephrases the command
Abandonment	User gives up mid-request
Barge-in	User interrupts the response

Reference

Topic	Description
Multiple languages	Multilingual ASR model or language detection followed by language-specific model. NLU can be multilingual or separate.
Personalization	Speaker identification for multi-user devices. Learn preferences (music, news sources). Adapt to user's speech patterns.
Noisy environments	Multi-microphone arrays with beamforming. Noise-robust ASR training. Confidence thresholds that ask for repetition.
Mistake handling	Allow corrections ("No, I said Seattle"). Implicit feedback from repetitions. Easy escalation to screen or app.
Privacy vs accuracy	On-device is private but less capable. Trade-off based on command type.
Latency vs quality	Faster responses vs better understanding. Streaming enables both.

Requirements​

Metrics​

Component Metrics​

System Metrics​

Architecture​

Wake Word Detection​

Model Architecture​

Training Data​

Speech Recognition (ASR)​

End-to-End Architecture​

Streaming vs Non-Streaming​

Handling Challenges​

Natural Language Understanding (NLU)​

Intent Classification​

Slot Filling​

Joint Model Architecture​

Handling Ambiguity​

Dialogue Management​

State Tracking​

Context Carryover​

Execution Layer​

Skill Architecture​

Response Generation​

Text-to-Speech (TTS)​

Modern TTS Pipeline​

Voice Customization​

On-Device vs Cloud​

Privacy vs Capability Trade-off​

Latency Optimization​

End-to-End Budget: 2 seconds​

Streaming Architecture​

Monitoring​

Quality Metrics​

User Feedback Signals​

Reference​

Table of Contents

Requirements

Metrics

Component Metrics

System Metrics

Architecture

Wake Word Detection

Model Architecture

Training Data

Speech Recognition (ASR)

End-to-End Architecture

Streaming vs Non-Streaming

Handling Challenges

Natural Language Understanding (NLU)

Intent Classification

Slot Filling

Joint Model Architecture

Handling Ambiguity

Dialogue Management

State Tracking

Context Carryover

Execution Layer

Skill Architecture

Response Generation

Text-to-Speech (TTS)

Modern TTS Pipeline

Voice Customization

On-Device vs Cloud

Privacy vs Capability Trade-off

Latency Optimization

End-to-End Budget: 2 seconds

Streaming Architecture

Monitoring

Quality Metrics

User Feedback Signals

Reference