Skip to main content

Design a Content Moderation System

Design a machine learning system to detect and remove harmful content (hate speech, violence, spam, misinformation) across text, images, and video at scale.

Requirements

Functional:

  • Classify content across multiple harm categories
  • Handle text, images, video, and combinations
  • Support human review workflow
  • Adapt to emerging threats quickly
  • Respect regional legal requirements

Non-functional:

  • Latency < 500ms for user-facing content
  • Process 1M+ pieces of content per hour
  • High precision (don't remove legitimate content)
  • Reasonable recall (catch most violations)

Metrics

Model Metrics

MetricDescriptionTarget
PrecisionViolations caught / Flagged as violation> 95%
RecallViolations caught / Total violations> 80%
False Positive RateGood content removed / Total good content< 0.1%
LatencyTime to classifyp99 < 500ms

Business Metrics

MetricDescription
PrevalenceHarmful content seen by users / Total content views
User reportsReports per million views
Appeal success rateOverturned decisions / Total appeals
Reviewer efficiencyContent reviewed per hour

Architecture

Loading diagram...

Harm Categories

CategoryExamplesChallenges
ViolenceGore, threats, self-harmMedical/news exceptions
Hate SpeechSlurs, dehumanizationCultural context, reclaimed terms
Sexual ContentNudity, exploitationArt vs pornography
SpamScams, fake engagementAdversarial adaptation
MisinformationFalse health claims, election interferenceRequires fact-checking
HarassmentBullying, doxxingRelationship context matters

Each category may require its own model with category-specific features and thresholds.

Feature Engineering

Text Features

FeatureTypeDescription
Token embeddingsEmbeddingBERT/RoBERTa contextual embeddings
Toxicity lexiconNumericalMatch against known toxic terms
Named entitiesCategoricalPeople, groups, locations mentioned
SentimentNumericalPositive/negative/neutral
All caps ratioNumericalShouting indicator
Punctuation densityNumericalExcessive !!! or ???
Character repetitionNumericalObfuscation like "haaaate"

Image Features

FeatureTypeDescription
Image embeddingsEmbeddingViT or ResNet features
Object detectionCategoricalWeapons, nudity, violence indicators
OCR textTextExtract text for text classification
Perceptual hashHashMatch against known violating images
Face detectionNumericalNumber of faces, ages
Skin tone ratioNumericalNSFW indicator

Video Features

FeatureTypeDescription
Keyframe embeddingsEmbeddingSample frames, pool embeddings
Audio transcriptTextSpeech-to-text for audio analysis
Scene changesNumericalRapid cuts (common in harmful content)
Audio featuresEmbeddingScreaming, gunshots, music type
ThumbnailImageOften designed to attract clicks

Cross-Modal Features

FeatureTypeDescription
Text-image alignmentNumericalDoes text match image content?
Sarcasm indicatorsBinaryText tone vs image content mismatch
Engagement predictionNumericalLikely to go viral?

Model Architecture

Multi-Task Learning

Train a single model with multiple heads for different policy violations:

Loading diagram...

Benefits of multi-task learning:

  • Shared representations across related tasks
  • More training signal from each example
  • Faster inference (one forward pass)

Handling Adversarial Content

Bad actors attempt to evade detection:

Evasion TechniqueDetection Approach
Leetspeak (h4te)Character normalization, learned spelling variants
Text in imagesOCR pipeline
Unicode tricksUnicode normalization
Homoglyphs (Greek H for Latin H)Confusable character mapping
Word splitting (ha te)Character-level models
Coded languageMonitor emerging terms, user reports

Train with adversarial examples. Augment data with common evasion patterns.

Near-Duplicate Detection

Previously removed content often gets re-uploaded. Use perceptual hashing:

Loading diagram...

Perceptual hashes are resistant to minor modifications (cropping, compression, color changes).

Serving

Two-Phase Architecture

Content arrives in bulk but needs different latency guarantees:

Loading diagram...

Fast path: Catches obvious violations immediately (known hashes, high-confidence text).

Deep path: Runs full multimodal analysis for uncertain content.

Regional Deployment

Content policies vary by jurisdiction:

Loading diagram...

Same base model, different thresholds per region.

Human Review Workflow

ML alone cannot handle everything. Build a robust human review pipeline:

Loading diagram...

Reviewer Tools

Provide reviewers:

  • Original content + context
  • ML model explanation (why it flagged)
  • User history and account signals
  • Clear policy guidelines
  • One-click decisions with required rationale

Feedback Loop

Reviewer decisions improve the model:

  1. Reviewer labels content
  2. Labels feed back to training data
  3. Model retrained weekly
  4. Monitor reviewer agreement (inter-rater reliability)

Monitoring

Real-Time Dashboards

MetricPurposeAlert Threshold
Flagging rateModel behavior shift> 2x baseline
Precision (sampled)Quality check< 90%
Review queue depthCapacity planning> 1 hour backlog
Appeal rateUser satisfaction> 5% of decisions
Latency p99System health> 500ms

Emerging Threat Detection

New harmful content patterns emerge constantly:

Loading diagram...

Monitor user reports for spikes. Cluster similar reports to identify new patterns.

Reference

TopicDescription
Precision vs recallRemoving more harm risks silencing legitimate speech. Balance based on harm severity.
Speed vs accuracyFast lightweight model vs. slow accurate model. Use fast for initial filter.
Automation vs human reviewScale vs. nuance and context. Reserve review for uncertain decisions.
Global vs localConsistent policy vs. cultural sensitivity. Same model, different thresholds.
Transparency vs gamingExplaining decisions helps users but helps bad actors evade.
Context handlingInclude surrounding content (parent post, thread history, account history). Train on full context.
New harm typesUser reports + anomaly detection surface new patterns. Analysts label examples. Rapid model update pipeline.