Skip to main content

Design a Bot Detection System

Design a machine learning system to detect bot traffic on a social media platform or e-commerce site.

Requirements

Functional:

  • Detect bot accounts at registration
  • Detect bot behavior in real-time
  • Score existing accounts for bot likelihood
  • Enable human review workflow

Non-functional:

  • Real-time detection < 100ms latency
  • Handle millions of requests per minute
  • High precision (avoid blocking real users)
  • Adapt to evolving bot techniques

Metrics

Offline Metrics

MetricDescriptionTarget
PrecisionTrue bots / Predicted bots> 0.95
RecallDetected bots / All bots> 0.80
F1 ScoreHarmonic mean> 0.85
AUC-ROCOverall discrimination> 0.95

Online Metrics

MetricDescription
Detection RateBots caught / Total bots
False Positive RateReal users blocked / Total real users
Time to DetectionHow quickly bots are caught
Evasion RateBots that bypass detection

Business Metrics

  • Spam content reduction
  • Fake account reduction
  • User-reported bot accounts
  • Platform trust scores

Architecture

Loading diagram...

Feature Engineering

Device & Network Features

FeatureTypeDescription
ip_reputationNumericalKnown bad IP score
ip_typeCategoricalResidential, datacenter, VPN, Tor
device_fingerprintHashBrowser/device fingerprint
user_agent_anomalyNumericalSuspicious user agent
geolocation_mismatchBinaryIP location vs claimed location
connection_typeCategoricalDirect, proxy, VPN
request_rateNumericalRequests per minute from IP

Behavioral Features

FeatureTypeDescription
typing_speedNumericalCharacters per second
mouse_movementEmbeddingMouse trajectory patterns
session_durationNumericalTime on site
page_view_patternSequenceNavigation sequence
action_timingDistributionTime between actions
form_fill_timeNumericalTime to complete forms
scroll_behaviorEmbeddingScrolling patterns

Account Features

FeatureTypeDescription
account_ageNumericalDays since creation
profile_completenessNumerical% of profile filled
email_domainCategoricalEmail provider type
username_patternNumericalRandomness score
profile_photoBinaryHas profile photo
bio_sentimentNumericalBio text analysis
verification_statusBinaryEmail/phone verified

Activity Features

FeatureTypeDescription
posts_per_dayNumericalPosting frequency
posting_regularityNumericalVariance in posting times
content_similarityNumericalSimilarity across posts
engagement_ratioNumericalReceived / Given engagement
follow_velocityNumericalFollows per day
follow_ratioNumericalFollowers / Following
action_diversityNumericalVariety of actions

Network/Graph Features

FeatureTypeDescription
follower_bot_ratioNumerical% followers that are bots
cluster_coefficientNumericalNetwork clustering
connection_ageDistributionAge of connections
shared_ip_accountsNumericalAccounts from same IP
coordinated_behaviorNumericalActing in sync with others

Model Architecture

Multi-Stage Detection

Loading diagram...
StageTechnologyLatencyPurpose
1. RulesWAFUnder 1msKnown bad IPs, rate limits
2. ML ModelEndpointUnder 50msRequest-level scoring
3. BehavioralAsyncAsyncSequence and graph analysis

Real-time Model

Loading diagram...
FeatureSourceDescription
IP ReputationExternal feed (cache)Known bad IP score
Device ScoreDevice fingerprintBrowser/device anomalies
Request RateRate counter (Redis)Requests per minute
User AgentRequest headerSuspicious patterns
Timing AnomalyRequest timingAbnormal patterns

Behavioral Sequence Model

Loading diagram...
LayerPurposeOutput
EmbeddingConvert actions to vectors64-dim per action
LSTMCapture temporal patterns128-dim hidden state
ClassifierFinal predictionBot probability (0-1)

Graph-Based Detection

Loading diagram...

Detects coordinated bot networks by analyzing:

  • Accounts with shared IPs
  • Similar activity patterns
  • Synchronized actions

Training

Labeled Data Sources

Label TypeSourcesConfidence
Confirmed BotsBanned accounts, Honeypot captures, Known bot networksHigh
Confirmed HumansPhone/ID verified, Long-standing active, Premium subscribersHigh
UncertainNew accounts, Low activityUse for semi-supervised learning

Handling Class Imbalance

Bots are minority class (typically under 5%). Techniques:

TechniqueDescriptionWhen to Use
SMOTEGenerate synthetic bot examplesSmall bot dataset
UndersamplingReduce human examplesVery large dataset
Class WeightsPenalize bot misclassification moreDefault approach
Anomaly DetectionFrame as outlier detectionVery rare bots

Adversarial Training

Bots evolve to evade detection. Counter with:

StrategyImplementationPurpose
Red Team TestingInternal team attempts evasionFind blind spots
Adversarial ExamplesPerturb features to evadeHarden model
HoneypotsDeploy traps for botsCatch new patterns
Threat IntelExternal bot network feedsStay current

Serving

Decision Engine

Loading diagram...

Challenge Mechanisms

Bot ScoreChallenge Type
Above 0.9Phone Verification
0.7-0.9CAPTCHA
0.5-0.7Invisible Challenge
Under 0.5None (Allow)

Monitoring

Detection Quality

MetricTargetAlert Threshold
Detection rate by bot typeAbove 80%Under 60%
False positive rateUnder 0.5%Above 1%
Time to detectionUnder 1 minuteAbove 5 minutes
Evasion rateUnder 10%Above 20%

Alerts

AlertConditionSeverity
Spike in bot trafficBot rate above 2x baselineHigh
High false positivesFP rate above 1%Critical
Model latencyp99 above 100msMedium

Continuous Improvement

  1. Feedback loop: Label decisions based on downstream outcomes
  2. Honeypots: Deploy traps to catch new bot patterns
  3. Red team: Regular adversarial testing
  4. External intel: Subscribe to threat intelligence feeds

Reference

TopicDescription
Precision vs recallCatching more bots risks blocking real users. Prioritize precision for user experience.
Latency vs accuracyReal-time decisions use fewer features than batch analysis.
Transparency vs securityExplaining detection methods helps users but also helps bot operators.
Adversarial natureBots evolve. Models require continuous updating as detection methods become known.