Design a Content Moderation System
Design a machine learning system to detect and remove harmful content (hate speech, violence, spam, misinformation) across text, images, and video at scale.
Requirements
Functional:
- Classify content across multiple harm categories
- Handle text, images, video, and combinations
- Support human review workflow
- Adapt to emerging threats quickly
- Respect regional legal requirements
Non-functional:
- Latency < 500ms for user-facing content
- Process 1M+ pieces of content per hour
- High precision (don't remove legitimate content)
- Reasonable recall (catch most violations)
Metrics
Model Metrics
| Metric | Description | Target |
|---|---|---|
| Precision | Violations caught / Flagged as violation | > 95% |
| Recall | Violations caught / Total violations | > 80% |
| False Positive Rate | Good content removed / Total good content | < 0.1% |
| Latency | Time to classify | p99 < 500ms |
Business Metrics
| Metric | Description |
|---|---|
| Prevalence | Harmful content seen by users / Total content views |
| User reports | Reports per million views |
| Appeal success rate | Overturned decisions / Total appeals |
| Reviewer efficiency | Content reviewed per hour |
Architecture
Harm Categories
| Category | Examples | Challenges |
|---|---|---|
| Violence | Gore, threats, self-harm | Medical/news exceptions |
| Hate Speech | Slurs, dehumanization | Cultural context, reclaimed terms |
| Sexual Content | Nudity, exploitation | Art vs pornography |
| Spam | Scams, fake engagement | Adversarial adaptation |
| Misinformation | False health claims, election interference | Requires fact-checking |
| Harassment | Bullying, doxxing | Relationship context matters |
Each category may require its own model with category-specific features and thresholds.
Feature Engineering
Text Features
| Feature | Type | Description |
|---|---|---|
| Token embeddings | Embedding | BERT/RoBERTa contextual embeddings |
| Toxicity lexicon | Numerical | Match against known toxic terms |
| Named entities | Categorical | People, groups, locations mentioned |
| Sentiment | Numerical | Positive/negative/neutral |
| All caps ratio | Numerical | Shouting indicator |
| Punctuation density | Numerical | Excessive !!! or ??? |
| Character repetition | Numerical | Obfuscation like "haaaate" |
Image Features
| Feature | Type | Description |
|---|---|---|
| Image embeddings | Embedding | ViT or ResNet features |
| Object detection | Categorical | Weapons, nudity, violence indicators |
| OCR text | Text | Extract text for text classification |
| Perceptual hash | Hash | Match against known violating images |
| Face detection | Numerical | Number of faces, ages |
| Skin tone ratio | Numerical | NSFW indicator |
Video Features
| Feature | Type | Description |
|---|---|---|
| Keyframe embeddings | Embedding | Sample frames, pool embeddings |
| Audio transcript | Text | Speech-to-text for audio analysis |
| Scene changes | Numerical | Rapid cuts (common in harmful content) |
| Audio features | Embedding | Screaming, gunshots, music type |
| Thumbnail | Image | Often designed to attract clicks |
Cross-Modal Features
| Feature | Type | Description |
|---|---|---|
| Text-image alignment | Numerical | Does text match image content? |
| Sarcasm indicators | Binary | Text tone vs image content mismatch |
| Engagement prediction | Numerical | Likely to go viral? |
Model Architecture
Multi-Task Learning
Train a single model with multiple heads for different policy violations:
Benefits of multi-task learning:
- Shared representations across related tasks
- More training signal from each example
- Faster inference (one forward pass)
Handling Adversarial Content
Bad actors attempt to evade detection:
| Evasion Technique | Detection Approach |
|---|---|
| Leetspeak (h4te) | Character normalization, learned spelling variants |
| Text in images | OCR pipeline |
| Unicode tricks | Unicode normalization |
| Homoglyphs (Greek H for Latin H) | Confusable character mapping |
| Word splitting (ha te) | Character-level models |
| Coded language | Monitor emerging terms, user reports |
Train with adversarial examples. Augment data with common evasion patterns.
Near-Duplicate Detection
Previously removed content often gets re-uploaded. Use perceptual hashing:
Perceptual hashes are resistant to minor modifications (cropping, compression, color changes).
Serving
Two-Phase Architecture
Content arrives in bulk but needs different latency guarantees:
Fast path: Catches obvious violations immediately (known hashes, high-confidence text).
Deep path: Runs full multimodal analysis for uncertain content.
Regional Deployment
Content policies vary by jurisdiction:
Same base model, different thresholds per region.
Human Review Workflow
ML alone cannot handle everything. Build a robust human review pipeline:
Reviewer Tools
Provide reviewers:
- Original content + context
- ML model explanation (why it flagged)
- User history and account signals
- Clear policy guidelines
- One-click decisions with required rationale
Feedback Loop
Reviewer decisions improve the model:
- Reviewer labels content
- Labels feed back to training data
- Model retrained weekly
- Monitor reviewer agreement (inter-rater reliability)
Monitoring
Real-Time Dashboards
| Metric | Purpose | Alert Threshold |
|---|---|---|
| Flagging rate | Model behavior shift | > 2x baseline |
| Precision (sampled) | Quality check | < 90% |
| Review queue depth | Capacity planning | > 1 hour backlog |
| Appeal rate | User satisfaction | > 5% of decisions |
| Latency p99 | System health | > 500ms |
Emerging Threat Detection
New harmful content patterns emerge constantly:
Monitor user reports for spikes. Cluster similar reports to identify new patterns.
Reference
| Topic | Description |
|---|---|
| Precision vs recall | Removing more harm risks silencing legitimate speech. Balance based on harm severity. |
| Speed vs accuracy | Fast lightweight model vs. slow accurate model. Use fast for initial filter. |
| Automation vs human review | Scale vs. nuance and context. Reserve review for uncertain decisions. |
| Global vs local | Consistent policy vs. cultural sensitivity. Same model, different thresholds. |
| Transparency vs gaming | Explaining decisions helps users but helps bad actors evade. |
| Context handling | Include surrounding content (parent post, thread history, account history). Train on full context. |
| New harm types | User reports + anomaly detection surface new patterns. Analysts label examples. Rapid model update pipeline. |