Skip to main content

Design a Voice Assistant System

Design a machine learning system that understands spoken commands and responds (like Alexa, Siri, or Google Assistant).

Requirements

Functional:

  • Wake word detection ("Hey Assistant")
  • Speech-to-text transcription
  • Intent recognition and slot filling
  • Execute commands (smart home, queries, timers)
  • Generate spoken responses

Non-functional:

  • Wake word: < 1% false accept, < 5% false reject
  • End-to-end latency: < 2 seconds
  • Work in noisy environments (50dB SNR)
  • Support multiple languages and accents
  • Run partially on-device for privacy

Metrics

Component Metrics

ComponentMetricTarget
Wake wordFalse accept rate< 1%
Wake wordFalse reject rate< 5%
ASRWord error rate (WER)< 8%
NLUIntent accuracy> 95%
NLUSlot F1> 90%
TTSMean opinion score (MOS)> 4.0/5.0

System Metrics

MetricDescriptionTarget
End-to-end latencyTime from speech end to response start< 2s
Request success rateSuccessfully completed requests> 90%
User retentionDaily active usersBaseline + 5%
NPSUser satisfaction> 50

Architecture

Loading diagram...

Wake Word Detection

The system listens constantly for the wake phrase. This must run on-device for privacy and latency.

Model Architecture

Loading diagram...

Requirements:

  • Tiny model (runs on microcontroller)
  • Low power (always listening)
  • High precision (few false wakes)
  • Personalized (learn user's voice)

Training Data

Data SourcePurpose
PositiveThousands of wake word recordings
NegativeGeneral speech, music, TV
Hard negativesSimilar-sounding phrases
Noise augmentationVarious acoustic conditions

Train with class imbalance handling (false accepts are worse than false rejects).

Speech Recognition (ASR)

Convert audio to text.

End-to-End Architecture

Loading diagram...

Conformer: Current state-of-the-art. Combines self-attention (global context) with convolution (local patterns).

Streaming vs Non-Streaming

ModeDescriptionLatencyAccuracy
Non-streamingProcess full utteranceHigherBetter
StreamingProcess chunks as they arriveLowerWorse
HybridStream with look-aheadMediumMedium

Recommendation: Streaming for voice assistants. Users expect immediate feedback. Use limited look-ahead (300ms) to improve accuracy.

Handling Challenges

ChallengeSolution
Background noiseNoise-robust features, multi-condition training
AccentsAccent-specific models or adaptation
Rare wordsExternal language model, spell correction
Code-switchingMultilingual model
DisfluenciesTraining data with "uh", "um", etc.

Natural Language Understanding (NLU)

Once you have text, understand what the user wants.

Intent Classification

Map utterances to intents:

UtteranceIntent
"Set a timer for 5 minutes"SetTimer
"What's the weather?"GetWeather
"Play some jazz"PlayMusic
"Turn off the lights"SmartHome

Slot Filling

Extract key parameters:

"Set a timer for 5 minutes"
Intent: SetTimer
Slots:
- duration: "5 minutes"
- duration_value: 5
- duration_unit: minutes

Joint Model Architecture

Loading diagram...

Joint training: intent classification from [CLS] token, slot filling from token representations.

Handling Ambiguity

User: "Play Taylor Swift"

Could mean:

  • Play songs by Taylor Swift
  • Play the song "Taylor Swift" (if it exists)
  • Play Taylor Swift playlist

Solutions:

  • Confidence thresholds (ask for clarification if uncertain)
  • Dialogue context (previous queries help disambiguate)
  • User preferences (this user usually means artist)

Dialogue Management

Track conversation state across turns.

State Tracking

Loading diagram...

Context Carryover

User: "What's the weather in Seattle?"
Assistant: "It's 55F and cloudy in Seattle."
User: "What about tomorrow?" <- needs context

Context: {location: Seattle, date: today}
Interpret: "What's the weather in Seattle tomorrow?"

Maintain context slots across turns. Expire after timeout or topic change.

Execution Layer

Route intents to skills that fulfill them.

Skill Architecture

Loading diagram...

Each skill:

  1. Validates required slots
  2. Calls external APIs if needed
  3. Generates response text
  4. Returns structured response

Response Generation

ApproachWhen to Use
TemplateSimple, predictable responses
RetrievalFAQ-style responses
Neural generationComplex, conversational responses

For voice assistants, templates work well. They are predictable and fast.

Text-to-Speech (TTS)

Convert response text to natural speech.

Modern TTS Pipeline

Loading diagram...

Text normalization: Convert abbreviations, numbers, dates to spoken form.

Acoustic model: Text -> mel spectrogram. FastSpeech2 is fast and high quality.

Vocoder: Mel spectrogram -> waveform. HiFi-GAN is fast and sounds natural.

Voice Customization

FeatureImplementation
Multiple voicesTrain separate models per voice
Prosody controlAdjust pitch, speed, emotion
SSML supportMarkup for pronunciation, pauses
Celebrity voicesVoice cloning with consent

On-Device vs Cloud

Privacy vs Capability Trade-off

Loading diagram...

Keep on-device:

  • Wake word detection (always listening)
  • Simple commands that don't need cloud
  • Audio encryption

Send to cloud:

  • Complex speech recognition
  • Knowledge queries
  • Skill execution

Latency Optimization

End-to-End Budget: 2 seconds

ComponentBudgetOptimization
Audio capture200msStreaming, endpoint detection
Network100msEdge servers, compression
ASR500msStreaming models, GPU inference
NLU50msDistilled models
Skill execution500msCaching, async calls
TTS400msPre-synthesized common phrases
Response playback250msStreaming audio

Streaming Architecture

Don't wait for each component to finish:

Loading diagram...

NLU can start processing partial transcripts. Skills can pre-fetch data based on partial understanding.

Monitoring

Quality Metrics

MetricHow to Measure
ASR accuracySample and human-transcribe
NLU accuracySample and human-label
End-to-end successDid user get what they wanted?
Latency percentilesInstrument each component

User Feedback Signals

SignalIndicates
Explicit feedback"That's wrong" / "Thank you"
RepetitionUser repeats the command
ReformulationUser rephrases the command
AbandonmentUser gives up mid-request
Barge-inUser interrupts the response

Reference

TopicDescription
Multiple languagesMultilingual ASR model or language detection followed by language-specific model. NLU can be multilingual or separate.
PersonalizationSpeaker identification for multi-user devices. Learn preferences (music, news sources). Adapt to user's speech patterns.
Noisy environmentsMulti-microphone arrays with beamforming. Noise-robust ASR training. Confidence thresholds that ask for repetition.
Mistake handlingAllow corrections ("No, I said Seattle"). Implicit feedback from repetitions. Easy escalation to screen or app.
Privacy vs accuracyOn-device is private but less capable. Trade-off based on command type.
Latency vs qualityFaster responses vs better understanding. Streaming enables both.