Design a Bot Detection System

Design a machine learning system to detect bot traffic on a social media platform or e-commerce site.

Requirements

Functional:

Detect bot accounts at registration
Detect bot behavior in real-time
Score existing accounts for bot likelihood
Enable human review workflow

Non-functional:

Real-time detection < 100ms latency
Handle millions of requests per minute
High precision (avoid blocking real users)
Adapt to evolving bot techniques

Metrics

Offline Metrics

Metric	Description	Target
Precision	True bots / Predicted bots	> 0.95
Recall	Detected bots / All bots	> 0.80
F1 Score	Harmonic mean	> 0.85
AUC-ROC	Overall discrimination	> 0.95

Online Metrics

Metric	Description
Detection Rate	Bots caught / Total bots
False Positive Rate	Real users blocked / Total real users
Time to Detection	How quickly bots are caught
Evasion Rate	Bots that bypass detection

Business Metrics

Spam content reduction
Fake account reduction
User-reported bot accounts
Platform trust scores

Architecture

Loading diagram...

Feature Engineering

Device & Network Features

Feature	Type	Description
ip_reputation	Numerical	Known bad IP score
ip_type	Categorical	Residential, datacenter, VPN, Tor
device_fingerprint	Hash	Browser/device fingerprint
user_agent_anomaly	Numerical	Suspicious user agent
geolocation_mismatch	Binary	IP location vs claimed location
connection_type	Categorical	Direct, proxy, VPN
request_rate	Numerical	Requests per minute from IP

Behavioral Features

Feature	Type	Description
typing_speed	Numerical	Characters per second
mouse_movement	Embedding	Mouse trajectory patterns
session_duration	Numerical	Time on site
page_view_pattern	Sequence	Navigation sequence
action_timing	Distribution	Time between actions
form_fill_time	Numerical	Time to complete forms
scroll_behavior	Embedding	Scrolling patterns

Account Features

Feature	Type	Description
account_age	Numerical	Days since creation
profile_completeness	Numerical	% of profile filled
email_domain	Categorical	Email provider type
username_pattern	Numerical	Randomness score
profile_photo	Binary	Has profile photo
bio_sentiment	Numerical	Bio text analysis
verification_status	Binary	Email/phone verified

Activity Features

Feature	Type	Description
posts_per_day	Numerical	Posting frequency
posting_regularity	Numerical	Variance in posting times
content_similarity	Numerical	Similarity across posts
engagement_ratio	Numerical	Received / Given engagement
follow_velocity	Numerical	Follows per day
follow_ratio	Numerical	Followers / Following
action_diversity	Numerical	Variety of actions

Network/Graph Features

Feature	Type	Description
follower_bot_ratio	Numerical	% followers that are bots
cluster_coefficient	Numerical	Network clustering
connection_age	Distribution	Age of connections
shared_ip_accounts	Numerical	Accounts from same IP
coordinated_behavior	Numerical	Acting in sync with others

Model Architecture

Multi-Stage Detection

Loading diagram...

Stage	Technology	Latency	Purpose
1. Rules	WAF	Under 1ms	Known bad IPs, rate limits
2. ML Model	Endpoint	Under 50ms	Request-level scoring
3. Behavioral	Async	Async	Sequence and graph analysis

Real-time Model

Loading diagram...

Feature	Source	Description
IP Reputation	External feed (cache)	Known bad IP score
Device Score	Device fingerprint	Browser/device anomalies
Request Rate	Rate counter (Redis)	Requests per minute
User Agent	Request header	Suspicious patterns
Timing Anomaly	Request timing	Abnormal patterns

Behavioral Sequence Model

Loading diagram...

Layer	Purpose	Output
Embedding	Convert actions to vectors	64-dim per action
LSTM	Capture temporal patterns	128-dim hidden state
Classifier	Final prediction	Bot probability (0-1)

Graph-Based Detection

Loading diagram...

Detects coordinated bot networks by analyzing:

Accounts with shared IPs
Similar activity patterns
Synchronized actions

Training

Labeled Data Sources

Label Type	Sources	Confidence
Confirmed Bots	Banned accounts, Honeypot captures, Known bot networks	High
Confirmed Humans	Phone/ID verified, Long-standing active, Premium subscribers	High
Uncertain	New accounts, Low activity	Use for semi-supervised learning

Handling Class Imbalance

Bots are minority class (typically under 5%). Techniques:

Technique	Description	When to Use
SMOTE	Generate synthetic bot examples	Small bot dataset
Undersampling	Reduce human examples	Very large dataset
Class Weights	Penalize bot misclassification more	Default approach
Anomaly Detection	Frame as outlier detection	Very rare bots

Adversarial Training

Bots evolve to evade detection. Counter with:

Strategy	Implementation	Purpose
Red Team Testing	Internal team attempts evasion	Find blind spots
Adversarial Examples	Perturb features to evade	Harden model
Honeypots	Deploy traps for bots	Catch new patterns
Threat Intel	External bot network feeds	Stay current

Serving

Decision Engine

Loading diagram...

Challenge Mechanisms

Bot Score	Challenge Type
Above 0.9	Phone Verification
0.7-0.9	CAPTCHA
0.5-0.7	Invisible Challenge
Under 0.5	None (Allow)

Monitoring

Detection Quality

Metric	Target	Alert Threshold
Detection rate by bot type	Above 80%	Under 60%
False positive rate	Under 0.5%	Above 1%
Time to detection	Under 1 minute	Above 5 minutes
Evasion rate	Under 10%	Above 20%

Alerts

Alert	Condition	Severity
Spike in bot traffic	Bot rate above 2x baseline	High
High false positives	FP rate above 1%	Critical
Model latency	p99 above 100ms	Medium

Continuous Improvement

Feedback loop: Label decisions based on downstream outcomes
Honeypots: Deploy traps to catch new bot patterns
Red team: Regular adversarial testing
External intel: Subscribe to threat intelligence feeds

Reference

Topic	Description
Precision vs recall	Catching more bots risks blocking real users. Prioritize precision for user experience.
Latency vs accuracy	Real-time decisions use fewer features than batch analysis.
Transparency vs security	Explaining detection methods helps users but also helps bot operators.
Adversarial nature	Bots evolve. Models require continuous updating as detection methods become known.

Requirements​

Metrics​

Offline Metrics​

Online Metrics​

Business Metrics​

Architecture​

Feature Engineering​

Device & Network Features​

Behavioral Features​

Account Features​

Activity Features​

Network/Graph Features​

Model Architecture​

Multi-Stage Detection​

Real-time Model​

Behavioral Sequence Model​

Graph-Based Detection​

Training​

Labeled Data Sources​

Handling Class Imbalance​

Adversarial Training​

Serving​

Decision Engine​

Challenge Mechanisms​

Monitoring​

Detection Quality​

Alerts​

Continuous Improvement​

Reference​

Table of Contents

Requirements

Metrics

Offline Metrics

Online Metrics

Business Metrics

Architecture

Feature Engineering

Device & Network Features

Behavioral Features

Account Features

Activity Features

Network/Graph Features

Model Architecture

Multi-Stage Detection

Real-time Model

Behavioral Sequence Model

Graph-Based Detection

Training

Labeled Data Sources

Handling Class Imbalance

Adversarial Training

Serving

Decision Engine

Challenge Mechanisms

Monitoring

Detection Quality

Alerts

Continuous Improvement

Reference