Design a Content Moderation System

Design a machine learning system to detect and remove harmful content (hate speech, violence, spam, misinformation) across text, images, and video at scale.

Requirements

Functional:

Classify content across multiple harm categories
Handle text, images, video, and combinations
Support human review workflow
Adapt to emerging threats quickly
Respect regional legal requirements

Non-functional:

Latency < 500ms for user-facing content
Process 1M+ pieces of content per hour
High precision (don't remove legitimate content)
Reasonable recall (catch most violations)

Metrics

Model Metrics

Metric	Description	Target
Precision	Violations caught / Flagged as violation	> 95%
Recall	Violations caught / Total violations	> 80%
False Positive Rate	Good content removed / Total good content	< 0.1%
Latency	Time to classify	p99 < 500ms

Business Metrics

Metric	Description
Prevalence	Harmful content seen by users / Total content views
User reports	Reports per million views
Appeal success rate	Overturned decisions / Total appeals
Reviewer efficiency	Content reviewed per hour

Architecture

Loading diagram...

Harm Categories

Category	Examples	Challenges
Violence	Gore, threats, self-harm	Medical/news exceptions
Hate Speech	Slurs, dehumanization	Cultural context, reclaimed terms
Sexual Content	Nudity, exploitation	Art vs pornography
Spam	Scams, fake engagement	Adversarial adaptation
Misinformation	False health claims, election interference	Requires fact-checking
Harassment	Bullying, doxxing	Relationship context matters

Each category may require its own model with category-specific features and thresholds.

Feature Engineering

Text Features

Feature	Type	Description
Token embeddings	Embedding	BERT/RoBERTa contextual embeddings
Toxicity lexicon	Numerical	Match against known toxic terms
Named entities	Categorical	People, groups, locations mentioned
Sentiment	Numerical	Positive/negative/neutral
All caps ratio	Numerical	Shouting indicator
Punctuation density	Numerical	Excessive !!! or ???
Character repetition	Numerical	Obfuscation like "haaaate"

Image Features

Feature	Type	Description
Image embeddings	Embedding	ViT or ResNet features
Object detection	Categorical	Weapons, nudity, violence indicators
OCR text	Text	Extract text for text classification
Perceptual hash	Hash	Match against known violating images
Face detection	Numerical	Number of faces, ages
Skin tone ratio	Numerical	NSFW indicator

Video Features

Feature	Type	Description
Keyframe embeddings	Embedding	Sample frames, pool embeddings
Audio transcript	Text	Speech-to-text for audio analysis
Scene changes	Numerical	Rapid cuts (common in harmful content)
Audio features	Embedding	Screaming, gunshots, music type
Thumbnail	Image	Often designed to attract clicks

Feature	Type	Description
Text-image alignment	Numerical	Does text match image content?
Sarcasm indicators	Binary	Text tone vs image content mismatch
Engagement prediction	Numerical	Likely to go viral?

Model Architecture

Multi-Task Learning

Train a single model with multiple heads for different policy violations:

Loading diagram...

Benefits of multi-task learning:

Shared representations across related tasks
More training signal from each example
Faster inference (one forward pass)

Handling Adversarial Content

Bad actors attempt to evade detection:

Evasion Technique	Detection Approach
Leetspeak (h4te)	Character normalization, learned spelling variants
Text in images	OCR pipeline
Unicode tricks	Unicode normalization
Homoglyphs (Greek H for Latin H)	Confusable character mapping
Word splitting (ha te)	Character-level models
Coded language	Monitor emerging terms, user reports

Train with adversarial examples. Augment data with common evasion patterns.

Near-Duplicate Detection

Previously removed content often gets re-uploaded. Use perceptual hashing:

Loading diagram...

Perceptual hashes are resistant to minor modifications (cropping, compression, color changes).

Serving

Two-Phase Architecture

Content arrives in bulk but needs different latency guarantees:

Loading diagram...

Fast path: Catches obvious violations immediately (known hashes, high-confidence text).

Deep path: Runs full multimodal analysis for uncertain content.

Regional Deployment

Content policies vary by jurisdiction:

Loading diagram...

Same base model, different thresholds per region.

Human Review Workflow

ML alone cannot handle everything. Build a robust human review pipeline:

Loading diagram...

Reviewer Tools

Provide reviewers:

Original content + context
ML model explanation (why it flagged)
User history and account signals
Clear policy guidelines
One-click decisions with required rationale

Feedback Loop

Reviewer decisions improve the model:

Reviewer labels content
Labels feed back to training data
Model retrained weekly
Monitor reviewer agreement (inter-rater reliability)

Monitoring

Real-Time Dashboards

Metric	Purpose	Alert Threshold
Flagging rate	Model behavior shift	> 2x baseline
Precision (sampled)	Quality check	< 90%
Review queue depth	Capacity planning	> 1 hour backlog
Appeal rate	User satisfaction	> 5% of decisions
Latency p99	System health	> 500ms

Emerging Threat Detection

New harmful content patterns emerge constantly:

Loading diagram...

Monitor user reports for spikes. Cluster similar reports to identify new patterns.

Reference

Topic	Description
Precision vs recall	Removing more harm risks silencing legitimate speech. Balance based on harm severity.
Speed vs accuracy	Fast lightweight model vs. slow accurate model. Use fast for initial filter.
Automation vs human review	Scale vs. nuance and context. Reserve review for uncertain decisions.
Global vs local	Consistent policy vs. cultural sensitivity. Same model, different thresholds.
Transparency vs gaming	Explaining decisions helps users but helps bad actors evade.
Context handling	Include surrounding content (parent post, thread history, account history). Train on full context.
New harm types	User reports + anomaly detection surface new patterns. Analysts label examples. Rapid model update pipeline.

Requirements​

Metrics​

Model Metrics​

Business Metrics​

Architecture​

Harm Categories​

Feature Engineering​

Text Features​

Image Features​

Video Features​

Cross-Modal Features​

Model Architecture​

Multi-Task Learning​

Handling Adversarial Content​

Near-Duplicate Detection​

Serving​

Two-Phase Architecture​

Regional Deployment​

Human Review Workflow​

Reviewer Tools​

Feedback Loop​

Monitoring​

Real-Time Dashboards​

Emerging Threat Detection​

Reference​

Table of Contents

Requirements

Metrics

Model Metrics

Business Metrics

Architecture

Harm Categories

Feature Engineering

Text Features

Image Features

Video Features

Cross-Modal Features

Model Architecture

Multi-Task Learning

Handling Adversarial Content

Near-Duplicate Detection

Serving

Two-Phase Architecture

Regional Deployment

Human Review Workflow

Reviewer Tools

Feedback Loop

Monitoring

Real-Time Dashboards

Emerging Threat Detection

Reference