Skip to main content

System Design Concept Questions

Theory and concept questions for backend system design interviews covering distributed systems fundamentals.

Scalability

Q1: Vertical vs horizontal scaling

Vertical (Scale Up): Larger machine (more CPU, RAM, storage)

  • Simpler, no code changes
  • Hardware limits
  • Single point of failure

Horizontal (Scale Out): More machines

  • Theoretically unlimited scale
  • Requires distributed architecture
  • More complex (load balancing, data partitioning)

Most systems use both: scale up until cost-prohibitive, then scale out.

Q2: Single point of failure (SPOF) elimination

SPOF: Component whose failure causes entire system failure.

Elimination strategies:

  • Redundancy: Multiple instances of critical components
  • Load balancers: Distribute traffic, provide failover
  • Database replication: Primary-replica setup
  • Multi-AZ/region: Geographic redundancy
  • Graceful degradation: System continues with reduced functionality

Q3: Read replicas

Read replicas: Copies of primary database that handle read queries.

Benefits:

  • Offload reads: Primary handles writes only
  • Reduce latency: Place replicas closer to users
  • Improve availability: Replicas can be promoted if primary fails

Considerations:

  • Replication lag: Reads may return stale data
  • Write bottleneck remains on primary
  • Replica failover handling required

Use case: Read-heavy workloads, tolerance for slight staleness.

CAP Theorem and Consistency

Q4: CAP theorem

In a distributed system, only 2 of 3 properties can be guaranteed:

  • Consistency: All nodes see same data at same time
  • Availability: Every request gets a response (not error)
  • Partition Tolerance: System works despite network failures

Network partitions occur, so the choice is between:

  • CP: Consistent but may be unavailable during partition (bank transactions)
  • AP: Available but may return stale data (social media feed)

Q5: Strong vs eventual consistency

Strong consistency: After write completes, all reads return the new value

  • Easier to reason about
  • Higher latency, lower availability
  • Use for: Financial transactions, inventory

Eventual consistency: Reads may return stale data, but will converge

  • Lower latency, higher availability
  • More complex client handling
  • Use for: Social media, analytics, caching

Many systems offer tunable consistency (e.g., Cassandra's consistency levels).

Q6: PACELC theorem

Extension of CAP addressing normal operation:

If Partition:

  • Choose Availability or Consistency (same as CAP)

Else (no partition):

  • Choose Latency or Consistency

Examples:

  • DynamoDB: PA/EL (available during partition, low latency otherwise)
  • Traditional RDBMS: PC/EC (consistent always, higher latency)
  • Cassandra: Tunable (configurable per query)

Database Design

Q7: SQL vs NoSQL selection

FactorSQLNoSQL
SchemaFixed, structuredFlexible, schema-less
RelationshipsComplex joinsDenormalized, embedded
TransactionsACID guaranteedUsually eventual consistency
ScalingVertical primarilyHorizontal by design
Query flexibilityAd-hoc queriesLimited query patterns

Choose SQL: Complex queries, transactions, data integrity critical Choose NoSQL: High scale, flexible schema, specific access patterns

Q8: Database sharding strategies

1. Range-based sharding:

  • Partition by value ranges (users A-M, N-Z)
  • Advantage: Range queries efficient
  • Disadvantage: Hotspots if data not uniform

2. Hash-based sharding:

  • Hash(key) mod N determines shard
  • Advantage: Even distribution
  • Disadvantage: Range queries across all shards

3. Directory-based sharding:

  • Lookup service maps keys to shards
  • Advantage: Flexible, can rebalance
  • Disadvantage: Lookup service is SPOF, bottleneck

4. Geographic sharding:

  • Data stored by region
  • Advantage: Low latency, data locality compliance
  • Disadvantage: Cross-region queries complex

Q9: Denormalization

Denormalization: Adding redundancy to improve read performance.

Examples:

  • Storing count in parent table instead of counting children
  • Duplicating user name in posts table to avoid join
  • Precomputing aggregations

Appropriate when:

  • Read-heavy workloads
  • Joins are too expensive
  • Slight inconsistency acceptable

Trade-offs:

  • Writes become more complex (update multiple places)
  • Data inconsistency risk
  • More storage

Q10: Database indexes

Index: Data structure that speeds up queries at cost of write performance.

B-tree index (default):

  • Suitable for: Range queries, equality, sorting
  • Columns in WHERE, JOIN, ORDER BY

Hash index:

  • Suitable for: Exact equality only
  • O(1) lookup

Composite index:

  • Multiple columns, leftmost prefix rule
  • (a, b, c) works for queries on (a), (a, b), (a, b, c), not (b) or (c)

Trade-offs:

  • Slower writes (must update index)
  • Storage overhead
  • Too many indexes hurt write performance

Caching

Q11: Cache invalidation strategies

1. TTL (Time-to-Live):

  • Cache expires after fixed time
  • Simple, works for eventually consistent data
  • May serve stale data until expiry

2. Write-through:

  • Write to cache and DB simultaneously
  • Cache always fresh
  • Higher write latency

3. Write-behind (write-back):

  • Write to cache, async write to DB
  • Low write latency
  • Risk of data loss

4. Cache-aside:

  • Application manages cache (check cache -> miss -> read DB -> update cache)
  • Most flexible
  • Application complexity

Q12: Cache stampede prevention

Cache stampede: Many requests hit the database simultaneously when cache expires.

Prevention strategies:

  • Locking: Only one request fetches from DB, others wait
  • Probabilistic early expiration: Randomly refresh before TTL
  • Background refresh: Async job refreshes cache before expiry
  • Fallback to stale: Serve stale data while refreshing
  • Request coalescing: Collapse duplicate requests

Q13: Redis vs Memcached

FeatureRedisMemcached
Data structuresStrings, lists, sets, hashes, sorted setsStrings only
PersistenceYes (RDB, AOF)No
ReplicationBuilt-inNo
ClusteringRedis ClusterClient-side sharding
Pub/SubYesNo
TransactionsYes (MULTI)No

Choose Redis: Need data structures, persistence, or advanced features Choose Memcached: Simple caching, slightly lower latency

Message Queues

Q14: Message queue use cases

Use cases:

  • Async processing: Non-blocking slow operations (email, notifications)
  • Decoupling: Services do not need to know about each other
  • Load leveling: Smooth out traffic spikes
  • Reliability: Persist messages if consumer down
  • Fan-out: One message to multiple consumers

Not needed when:

  • Synchronous response required
  • Simple request-response
  • Tight latency requirements

Q15: Message queue vs event stream

AspectQueue (SQS, RabbitMQ)Stream (Kafka)
ConsumptionMessage deleted after consumeMessages retained, replayable
ConsumersCompeting consumers (one gets message)Consumer groups (each group gets all messages)
OrderingFIFO within queueOrdered within partition
RetentionUntil consumedTime-based (days/weeks)
Use caseTask distributionEvent sourcing, audit log, streaming

Q16: Exactly-once message processing

Exactly-once is difficult. Approaches:

1. Idempotent consumers:

  • Process message multiple times, same result
  • Use unique message ID to detect duplicates
  • Most practical approach

2. Transactional outbox:

  • Write to DB and outbox table in same transaction
  • Separate process reads outbox, publishes to queue
  • Guarantees delivery without duplication

3. Two-phase commit:

  • Coordinate DB and queue in transaction
  • Complex, performance impact
  • Not commonly used

Practical approach: At-least-once delivery + idempotent consumers.

Load Balancing

Q17: Load balancing algorithms

AlgorithmMechanismUse Case
Round RobinRotate through serversServers are identical
Weighted Round RobinMore requests to stronger serversHeterogeneous servers
Least ConnectionsSend to server with fewest active connectionsLong-lived connections
IP HashHash client IP to serverSession affinity needed
RandomRandom serverSimple, decent distribution

Layer 4 (TCP) vs Layer 7 (HTTP):

  • L4: Faster, fewer features
  • L7: Content-based routing, SSL termination, caching

Q18: Consistent hashing

Problem: Adding/removing servers causes massive redistribution with modulo hashing.

Solution:

  1. Map both servers and keys to a ring (hash space)
  2. Walk clockwise from key position to find server
  3. Adding server only moves keys between neighbors
  4. Virtual nodes improve distribution

Only K/N keys move when adding/removing a node (vs all keys with modulo).

Used in: DynamoDB, Cassandra, CDNs.

Availability and Reliability

Q19: Availability vs reliability

Availability: System is operational when needed

  • Measured as uptime percentage (99.9% = 8.76 hours down/year)
  • Focus on reducing downtime

Reliability: System performs correctly over time

  • Measured as MTBF (Mean Time Between Failures)
  • Focus on reducing failures

A system can be available but unreliable (up but giving wrong answers). A system can be reliable but unavailable (works perfectly when running, but often down).

Q20: The nines

AvailabilityAnnual Downtime
99% (two nines)3.65 days
99.9% (three nines)8.76 hours
99.99% (four nines)52.6 minutes
99.999% (five nines)5.26 minutes

Each nine is 10x harder to achieve. Five nines requires:

  • Redundancy at every level
  • Automated failover
  • Zero-downtime deployments
  • Geographic distribution

Q21: Circuit breaker pattern

Prevents cascading failures when a service is down.

States:

  1. Closed: Requests flow normally, count failures
  2. Open: After threshold failures, reject requests immediately
  3. Half-open: After timeout, allow test requests

Benefits:

  • Fails fast (no waiting for timeouts)
  • Prevents overwhelming failing service
  • Allows recovery time

API Design

Q22: REST vs GraphQL vs gRPC

AspectRESTGraphQLgRPC
ProtocolHTTPHTTPHTTP/2
FormatJSONJSONProtocol Buffers
TypingOptional (OpenAPI)Strong schemaStrong (protobuf)
Over-fetchingCommonNo (client specifies)No
CachingHTTP caching worksComplex (POST)Complex
Use casePublic APIs, CRUDMobile apps, complex UIsInternal microservices

Q23: API versioning

1. URL path versioning:

  • /api/v1/users, /api/v2/users
  • Clear, easy caching
  • Not "pure" REST

2. Header versioning:

  • Accept: application/vnd.api+json;version=2
  • Clean URLs
  • Harder to test

3. Query parameter:

  • /api/users?version=2
  • Simple
  • Not cacheable

URL versioning is most common and practical.

Q24: Rate limiting

Rate limiting: Restrict number of requests from a client.

Algorithms:

  • Token bucket: Tokens refill at fixed rate, request consumes token
  • Leaky bucket: Requests queue, process at fixed rate
  • Fixed window: Count requests per time window
  • Sliding window: Rolling window for smoother limiting

Implementation:

  • Redis for distributed systems (INCR with TTL)
  • Return 429 Too Many Requests
  • Include Retry-After header
  • Different limits for different tiers/endpoints

Microservices

Q25: Microservices vs monolith

AspectMonolithMicroservices
ComplexityLowerHigher (network, deployment)
DeploymentAll or nothingIndependent
ScalingScale entire appScale individual services
Data consistencyTransactions easyDistributed transactions difficult
Team organizationSingle teamMultiple autonomous teams
DebuggingStack tracesDistributed tracing needed

Start with monolith, extract microservices when:

  • Team grows beyond what one codebase supports
  • Different scaling needs for components
  • Need independent deployment

Q26: Distributed transactions

1. Saga pattern:

  • Sequence of local transactions with compensating actions
  • If step fails, execute compensations for completed steps
  • Eventual consistency

2. Two-phase commit (2PC):

  • Coordinator asks participants to prepare
  • If all ready, commit; else abort
  • Blocking, performance impact

3. Outbox pattern:

  • Write to local DB + outbox table atomically
  • Separate process publishes events
  • Reliable event delivery

Best practice: Avoid distributed transactions. Design services with bounded contexts.

Q27: Service discovery

Service discovery: How services find each other's network locations.

Client-side discovery:

  • Client queries registry, selects instance
  • More control, but client complexity
  • Example: Netflix Eureka

Server-side discovery:

  • Load balancer queries registry
  • Client simpler, but extra hop
  • Example: AWS ALB, Kubernetes

Registry options: etcd, Consul, ZooKeeper, DNS-based (Kubernetes)

Needed because: Container IPs change, instances scale dynamically.

Security

Q28: API security

  1. Authentication:

    • API keys (simple, limited)
    • JWT tokens (stateless, self-contained)
    • OAuth 2.0 (delegated access)
  2. Authorization:

    • RBAC (Role-Based Access Control)
    • Check permissions on every request
  3. Transport:

    • HTTPS everywhere
    • TLS 1.2+ minimum
  4. Input validation:

    • Never trust client input
    • Parameterized queries (prevent SQL injection)
    • Sanitize output (prevent XSS)
  5. Rate limiting and throttling

  6. Logging and monitoring:

    • Audit sensitive operations
    • Detect anomalies

Q29: OAuth 2.0 flows

Authorization Code (most secure):

  1. User redirected to auth server
  2. User authenticates, gets authorization code
  3. Backend exchanges code for access token
  4. Backend uses token for API calls

Client Credentials:

  • Machine-to-machine (no user)
  • Client directly requests token with credentials

Implicit (deprecated):

  • Token returned directly in redirect
  • Not secure (token exposed in URL)

Refresh tokens:

  • Long-lived token to get new access tokens
  • Access tokens should be short-lived

Monitoring and Observability

Q30: Three pillars of observability

1. Logs:

  • Record of discrete events
  • Structured logs (JSON) for parsing
  • Centralized (ELK, CloudWatch)

2. Metrics:

  • Numeric measurements over time
  • Aggregatable (counters, gauges, histograms)
  • Prometheus, CloudWatch, Datadog

3. Traces:

  • Request path through distributed system
  • Correlation IDs across services
  • Jaeger, X-Ray, Zipkin

Together they answer:

  • What happened? (logs)
  • How much/how fast? (metrics)
  • Where was time spent? (traces)