Design a Bot Detection System
Design a machine learning system to detect bot traffic on a social media platform or e-commerce site.
Requirements
Functional:
- Detect bot accounts at registration
- Detect bot behavior in real-time
- Score existing accounts for bot likelihood
- Enable human review workflow
Non-functional:
- Real-time detection < 100ms latency
- Handle millions of requests per minute
- High precision (avoid blocking real users)
- Adapt to evolving bot techniques
Metrics
Offline Metrics
| Metric | Description | Target |
|---|---|---|
| Precision | True bots / Predicted bots | > 0.95 |
| Recall | Detected bots / All bots | > 0.80 |
| F1 Score | Harmonic mean | > 0.85 |
| AUC-ROC | Overall discrimination | > 0.95 |
Online Metrics
| Metric | Description |
|---|---|
| Detection Rate | Bots caught / Total bots |
| False Positive Rate | Real users blocked / Total real users |
| Time to Detection | How quickly bots are caught |
| Evasion Rate | Bots that bypass detection |
Business Metrics
- Spam content reduction
- Fake account reduction
- User-reported bot accounts
- Platform trust scores
Architecture
Loading diagram...
Feature Engineering
Device & Network Features
| Feature | Type | Description |
|---|---|---|
| ip_reputation | Numerical | Known bad IP score |
| ip_type | Categorical | Residential, datacenter, VPN, Tor |
| device_fingerprint | Hash | Browser/device fingerprint |
| user_agent_anomaly | Numerical | Suspicious user agent |
| geolocation_mismatch | Binary | IP location vs claimed location |
| connection_type | Categorical | Direct, proxy, VPN |
| request_rate | Numerical | Requests per minute from IP |
Behavioral Features
| Feature | Type | Description |
|---|---|---|
| typing_speed | Numerical | Characters per second |
| mouse_movement | Embedding | Mouse trajectory patterns |
| session_duration | Numerical | Time on site |
| page_view_pattern | Sequence | Navigation sequence |
| action_timing | Distribution | Time between actions |
| form_fill_time | Numerical | Time to complete forms |
| scroll_behavior | Embedding | Scrolling patterns |
Account Features
| Feature | Type | Description |
|---|---|---|
| account_age | Numerical | Days since creation |
| profile_completeness | Numerical | % of profile filled |
| email_domain | Categorical | Email provider type |
| username_pattern | Numerical | Randomness score |
| profile_photo | Binary | Has profile photo |
| bio_sentiment | Numerical | Bio text analysis |
| verification_status | Binary | Email/phone verified |
Activity Features
| Feature | Type | Description |
|---|---|---|
| posts_per_day | Numerical | Posting frequency |
| posting_regularity | Numerical | Variance in posting times |
| content_similarity | Numerical | Similarity across posts |
| engagement_ratio | Numerical | Received / Given engagement |
| follow_velocity | Numerical | Follows per day |
| follow_ratio | Numerical | Followers / Following |
| action_diversity | Numerical | Variety of actions |
Network/Graph Features
| Feature | Type | Description |
|---|---|---|
| follower_bot_ratio | Numerical | % followers that are bots |
| cluster_coefficient | Numerical | Network clustering |
| connection_age | Distribution | Age of connections |
| shared_ip_accounts | Numerical | Accounts from same IP |
| coordinated_behavior | Numerical | Acting in sync with others |
Model Architecture
Multi-Stage Detection
Loading diagram...
| Stage | Technology | Latency | Purpose |
|---|---|---|---|
| 1. Rules | WAF | Under 1ms | Known bad IPs, rate limits |
| 2. ML Model | Endpoint | Under 50ms | Request-level scoring |
| 3. Behavioral | Async | Async | Sequence and graph analysis |
Real-time Model
Loading diagram...
| Feature | Source | Description |
|---|---|---|
| IP Reputation | External feed (cache) | Known bad IP score |
| Device Score | Device fingerprint | Browser/device anomalies |
| Request Rate | Rate counter (Redis) | Requests per minute |
| User Agent | Request header | Suspicious patterns |
| Timing Anomaly | Request timing | Abnormal patterns |
Behavioral Sequence Model
Loading diagram...
| Layer | Purpose | Output |
|---|---|---|
| Embedding | Convert actions to vectors | 64-dim per action |
| LSTM | Capture temporal patterns | 128-dim hidden state |
| Classifier | Final prediction | Bot probability (0-1) |
Graph-Based Detection
Loading diagram...
Detects coordinated bot networks by analyzing:
- Accounts with shared IPs
- Similar activity patterns
- Synchronized actions
Training
Labeled Data Sources
| Label Type | Sources | Confidence |
|---|---|---|
| Confirmed Bots | Banned accounts, Honeypot captures, Known bot networks | High |
| Confirmed Humans | Phone/ID verified, Long-standing active, Premium subscribers | High |
| Uncertain | New accounts, Low activity | Use for semi-supervised learning |
Handling Class Imbalance
Bots are minority class (typically under 5%). Techniques:
| Technique | Description | When to Use |
|---|---|---|
| SMOTE | Generate synthetic bot examples | Small bot dataset |
| Undersampling | Reduce human examples | Very large dataset |
| Class Weights | Penalize bot misclassification more | Default approach |
| Anomaly Detection | Frame as outlier detection | Very rare bots |
Adversarial Training
Bots evolve to evade detection. Counter with:
| Strategy | Implementation | Purpose |
|---|---|---|
| Red Team Testing | Internal team attempts evasion | Find blind spots |
| Adversarial Examples | Perturb features to evade | Harden model |
| Honeypots | Deploy traps for bots | Catch new patterns |
| Threat Intel | External bot network feeds | Stay current |
Serving
Decision Engine
Loading diagram...
Challenge Mechanisms
| Bot Score | Challenge Type |
|---|---|
| Above 0.9 | Phone Verification |
| 0.7-0.9 | CAPTCHA |
| 0.5-0.7 | Invisible Challenge |
| Under 0.5 | None (Allow) |
Monitoring
Detection Quality
| Metric | Target | Alert Threshold |
|---|---|---|
| Detection rate by bot type | Above 80% | Under 60% |
| False positive rate | Under 0.5% | Above 1% |
| Time to detection | Under 1 minute | Above 5 minutes |
| Evasion rate | Under 10% | Above 20% |
Alerts
| Alert | Condition | Severity |
|---|---|---|
| Spike in bot traffic | Bot rate above 2x baseline | High |
| High false positives | FP rate above 1% | Critical |
| Model latency | p99 above 100ms | Medium |
Continuous Improvement
- Feedback loop: Label decisions based on downstream outcomes
- Honeypots: Deploy traps to catch new bot patterns
- Red team: Regular adversarial testing
- External intel: Subscribe to threat intelligence feeds
Reference
| Topic | Description |
|---|---|
| Precision vs recall | Catching more bots risks blocking real users. Prioritize precision for user experience. |
| Latency vs accuracy | Real-time decisions use fewer features than batch analysis. |
| Transparency vs security | Explaining detection methods helps users but also helps bot operators. |
| Adversarial nature | Bots evolve. Models require continuous updating as detection methods become known. |