Data Engineering Concept Questions
This section covers frequently asked conceptual questions in data engineering interviews, organized by topic area.
Data Pipeline Fundamentals
Q1: Explain the difference between ETL and ELT.
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | Before loading (staging server) | After loading (in warehouse) |
| Compute requirements | Requires dedicated transform infrastructure | Uses warehouse compute resources |
| Latency | Higher (transformation precedes loading) | Lower (raw data loaded first) |
| Flexibility | Lower (schema defined early) | Higher (raw data preserved for future transformations) |
ELT has become the predominant pattern due to:
- Scalable compute in cloud warehouses (Snowflake, BigQuery, Redshift)
- Preservation of raw data for new transformation requirements
- Version-controlled transformations through tools like dbt
Q2: What is data lineage and why does it matter?
Data lineage tracks three aspects of data:
- Origin (source systems)
- Transformations applied
- Downstream destinations
| Use Case | Description |
|---|---|
| Debugging | Tracing incorrect values back through transformation chains |
| Impact analysis | Identifying downstream effects of schema changes |
| Compliance | Documenting personal data flows for GDPR and similar regulations |
| Trust | Providing transparency for data consumers |
Common tools: Apache Atlas, DataHub, Amundsen, dbt docs.
Q3: How do you handle schema evolution?
Schema evolution refers to changes in data structure over time (new columns, type changes, field removals).
Strategies:
| Strategy | Description |
|---|---|
| Schema Registry | Validates new schemas for compatibility before allowing changes (Kafka approach) |
| Avro/Protobuf | Serialization formats with built-in schema evolution support |
| Schema-on-read | Store raw JSON/data, apply schema at query time |
| Versioned tables | Maintain separate table versions (orders_v1, orders_v2) |
Compatibility types:
| Type | Definition | Use Case |
|---|---|---|
| Backward | New code reads old data | Most common, safest default |
| Forward | Old code reads new data | Useful during migrations |
| Full | Both directions supported | Maximum flexibility |
Q4: What is idempotency and why is it important in data pipelines?
An idempotent operation produces the same result regardless of how many times it executes.
Significance:
| Scenario | Without Idempotency | With Idempotency |
|---|---|---|
| Job retry after failure | Duplicate records | Consistent state |
| Backfill operations | Data corruption | Clean reprocessing |
| Partial failures | Unknown state | Deterministic recovery |
Implementation techniques:
- Use MERGE or UPSERT instead of INSERT
- Partition by date with overwrite mode
- Include deduplication keys
- Use transactional writes
Q5: Explain the difference between data lake, data warehouse, and lakehouse.
| Aspect | Data Lake | Data Warehouse | Lakehouse |
|---|---|---|---|
| Data format | Raw, any format | Structured tables | Both |
| Schema | Schema-on-read | Schema-on-write | Configurable |
| Use case | ML, data exploration | BI, reporting | Combined workloads |
| Cost profile | Low storage, variable query cost | Higher storage, optimized queries | Balanced |
| Examples | S3 + Spark | Snowflake, BigQuery | Databricks, Delta Lake |
The lakehouse architecture combines warehouse features (ACID transactions, schema enforcement, optimized queries) with lake storage (low cost, format flexibility). Delta Lake and Apache Iceberg are the primary table formats enabling this pattern.
Batch vs Streaming
Q6: When would you choose batch vs streaming processing?
| Factor | Batch | Streaming |
|---|---|---|
| Latency tolerance | Minutes to hours acceptable | Seconds required |
| Data completeness | Full dataset required | Partial data acceptable |
| Implementation complexity | Lower | Higher |
| Infrastructure cost | Generally lower | Higher |
| Example use cases | Reports, ML training | Fraud detection, live dashboards |
Batch processing is appropriate when latency requirements permit and data completeness is required. Streaming adds operational complexity and should be used when sub-minute latency is a business requirement.
Q7: Explain event time vs processing time.
| Concept | Definition |
|---|---|
| Event time | Timestamp when the event occurred |
| Processing time | Timestamp when the system processes the event |
These values differ when events are delayed (network issues, offline devices, buffering).
Example: A user action at 2:00 PM on an offline device may reach the processing system at 2:45 PM when connectivity is restored. Using processing time would attribute this event to the wrong time window.
Late data handling mechanisms:
| Mechanism | Description |
|---|---|
| Watermarks | System estimate of event time progress ("all events up to time T have arrived") |
| Allowed lateness | Threshold for accepting late events (e.g., 1 hour) |
| Side outputs | Separate handling path for extremely late events |
Q8: What is backpressure and how do you handle it?
Backpressure occurs when data ingestion rate exceeds processing capacity.
Consequences:
- Memory exhaustion
- System crashes
- Data loss
Mitigation strategies:
| Strategy | Description |
|---|---|
| Buffering | Queue messages with defined limits |
| Dropping | Sample or discard events (acceptable for some metrics) |
| Source throttling | Signal producers to reduce rate |
| Horizontal scaling | Add consumer instances |
| Auto-scaling | Dynamic infrastructure adjustment |
Kafka handles backpressure through consumer-controlled polling. Consumers read at their own pace; unprocessed messages remain in Kafka with measurable "lag."
Q9: Explain exactly-once semantics in streaming.
| Delivery Guarantee | Description | Trade-off |
|---|---|---|
| At-most-once | Messages may be lost, never duplicated | Lowest latency, simplest implementation |
| At-least-once | Messages never lost, may be duplicated | Requires idempotent consumers |
| Exactly-once | No loss, no duplicates | Highest complexity, additional latency |
Exactly-once requirements:
- Idempotent producers (source deduplication)
- Transactional writes (atomic commits)
- Checkpointing (recovery state persistence)
- End-to-end coordination (all components must participate)
In practice, most production systems implement at-least-once delivery with idempotent consumers. True exactly-once semantics adds complexity and latency.
Q10: What are the Lambda and Kappa architectures?
Lambda Architecture:
| Layer | Purpose |
|---|---|
| Batch layer | Complete, accurate historical processing |
| Speed layer | Real-time approximate results |
| Serving layer | Merges batch and speed layer outputs |
Trade-off: Maintains two separate codebases for similar logic.
Kappa Architecture:
Single streaming system for all processing. Historical reprocessing is performed by replaying events through the same pipeline.
| Aspect | Lambda | Kappa |
|---|---|---|
| Codebases | Two (batch + streaming) | One |
| Complexity | Higher | Lower |
| Reprocessing | Fast (dedicated batch) | Slower (stream replay) |
Kappa architecture is generally preferred for simplicity. A batch layer can be added later if specific limitations arise.
Data Quality
Q11: How do you ensure data quality in pipelines?
Data quality validation operates at multiple levels:
| Level | Description | Example |
|---|---|---|
| Schema validation | Column existence, data types | Expected columns present with correct types |
| Constraint validation | Primary key uniqueness, foreign key validity | No duplicate IDs, valid references |
| Statistical validation | Volume, null rates, distributions | Row count within expected range |
| Business rule validation | Domain-specific logic | Revenue values positive, dates within valid range |
Tools: Great Expectations, dbt tests, Soda.
Best practices:
- Validate at each pipeline stage (fail fast)
- Alert on anomalies rather than logging only
- Quarantine invalid data for investigation rather than discarding
Q12: What is SLA for data pipelines and how do you ensure it?
A Service Level Agreement (SLA) defines commitments to downstream consumers:
| SLA Type | Example |
|---|---|
| Freshness | Data updated by 8 AM daily |
| Completeness | All records from prior day included |
| Accuracy | Values match source systems |
Enforcement strategies:
| Strategy | Description |
|---|---|
| Monitoring | Detect issues before consumers notice |
| Buffer time | Schedule jobs with margin (7 AM job for 8 AM SLA) |
| Automated retries | Exponential backoff for transient failures |
| Redundancy | Backup systems for critical pipelines |
| Clear ownership | Defined on-call responsibilities |
Q13: How do you handle data skew?
Data skew occurs when data distribution is uneven across partitions, causing some tasks to take significantly longer than others.
Symptoms:
- Single Spark tasks running 10x longer than others
- Out-of-memory errors on specific executors
- Straggler tasks visible in Spark UI
Solutions:
| Technique | Description |
|---|---|
| Salting | Add random suffix to hot keys, aggregate in two passes |
| Broadcast joins | Broadcast small tables to all executors |
| Adaptive Query Execution | Spark 3.0+ automatic skew handling |
| Pre-aggregation | Reduce cardinality before joins |
| Hot key isolation | Process high-cardinality keys separately |
Distributed Systems
Q14: Explain partitioning strategies for distributed data.
| Strategy | Method | Advantages | Disadvantages |
|---|---|---|---|
| Hash | hash(key) mod N | Even distribution | Poor for range queries |
| Range | Key ranges to partitions | Efficient range queries | Potential hot spots |
| Round-robin | Rotate through partitions | Simple, even spread | No key locality |
Common applications:
- Kafka: Hash partitioning on message key
- Spark: Configurable (hash default)
- Databases: Hash or range sharding
Q15: What is the CAP theorem?
The CAP theorem states that a distributed system can provide at most two of three guarantees during a network partition:
| Property | Definition |
|---|---|
| Consistency | Every read returns the most recent write |
| Availability | Every request receives a response |
| Partition tolerance | System operates despite network failures |
Since network partitions are inevitable in distributed systems, the practical choice is between:
| Type | Behavior | Example Use Case |
|---|---|---|
| CP | Consistent but may be unavailable during partitions | Financial transactions |
| AP | Available but may return stale data during partitions | Social media feeds |
Q16: Explain how Kafka works at a high level.
Kafka is a distributed commit log with the following components:
| Component | Description |
|---|---|
| Topics | Named streams of messages |
| Partitions | Subdivisions of topics for parallelism; ordering guaranteed within partition only |
| Producers | Write messages to topics |
| Consumers | Read messages, tracking position via offsets |
| Consumer groups | Multiple consumers sharing load; each partition assigned to one consumer per group |
| Brokers | Kafka server instances |
| Replication | Partitions replicated across brokers for fault tolerance |
Performance characteristics:
- Sequential disk writes (high throughput)
- Consumer-controlled replay from any offset
- Append-only writes, sequential reads
Q17: How do you handle failures in distributed data processing?
| Failure Type | Recovery Strategy |
|---|---|
| Task failure | Retry on same or different node |
| Node failure | Reassign tasks to healthy nodes |
| State corruption | Restore from checkpoint |
| Full job failure | Restart from beginning (requires idempotent pipeline) |
Recovery mechanisms:
| Mechanism | Description |
|---|---|
| Checkpointing | Periodic state snapshots for recovery |
| Write-ahead logs | Record operations before execution |
| Lineage (Spark) | Recompute lost data from parent RDDs |
| Exponential backoff | Increasing delay between retry attempts |
Q18: What is data locality and why does it matter?
Data locality refers to processing data where it resides, minimizing network transfer.
Locality levels (best to worst):
| Level | Description |
|---|---|
| Process local | Data in JVM memory |
| Node local | Data on local disk |
| Rack local | Data on same rack |
| Network | Data requires network transfer |
Network transfer is typically the bottleneck in distributed systems. Spark and Hadoop schedulers attempt to schedule tasks on nodes containing the required data.
Tools and Technologies
Q19: Compare Spark vs Flink.
| Aspect | Spark | Flink |
|---|---|---|
| Origin | Batch-first, streaming added later | Streaming-first, batch added later |
| Streaming model | Micro-batches | True event-at-a-time |
| Latency | Seconds | Milliseconds |
| State management | External checkpoints | Built-in managed state |
| Ecosystem | Large, mature | Smaller, growing |
| Best for | Batch workloads, SQL analytics | Real-time processing, stateful streaming |
Selection criteria:
- Spark: Primarily batch workloads, existing team expertise
- Flink: Sub-second latency requirements, complex stateful stream processing
Q20: When would you use a columnar vs row-based storage format?
| Format Type | Storage Pattern | Advantages | Disadvantages |
|---|---|---|---|
| Row (CSV, JSON, Avro) | All fields of record together | Efficient single-record access | Inefficient for column aggregations |
| Columnar (Parquet, ORC) | All values of column together | Efficient aggregations, better compression | Inefficient for full-record access |
Columnar advantages for analytics:
- Read only required columns
- Superior compression (similar values grouped)
- Vectorized processing optimization
Guidelines:
- Parquet: Data lakes, warehouses, ML workloads
- Avro: CDC streams, message passing
Q21: Explain the role of a data catalog.
A data catalog provides centralized metadata management for data assets.
| Function | Description |
|---|---|
| Discovery | Search for datasets by name, description, tags |
| Metadata | Schemas, owners, update timestamps |
| Lineage | Upstream dependencies and downstream consumers |
| Quality signals | Freshness, completeness, trust scores |
| Access control | Permission management |
Tools: DataHub, Amundsen, Alation, AWS Glue Catalog.
A data catalog becomes necessary when:
- Table count exceeds tribal knowledge capacity
- New team members require onboarding
- Compliance requires data flow documentation
Q22: What is CDC (Change Data Capture)?
CDC captures insert, update, and delete operations from databases and streams them to downstream systems.
Implementation approaches:
| Approach | Method | Source Impact | Captures Deletes |
|---|---|---|---|
| Query-based | Poll for rows where updated_at > last_check | Medium-high | No |
| Trigger-based | Database triggers on changes | High | Yes |
| Log-based | Read database transaction log | Minimal | Yes |
Log-based CDC is preferred:
- Zero impact on source database (reads existing logs)
- Captures all change types including deletes
- Low latency
Tools: Debezium (open source), AWS DMS, Fivetran, Airbyte.
Use cases:
- Data warehouse synchronization
- Event-driven architectures
- Cache invalidation
Design and Architecture
Q23: How would you design a real-time analytics pipeline?
Architecture layers:
| Layer | Purpose | Example Technologies |
|---|---|---|
| Ingestion | Event capture | Kafka, Kinesis |
| Processing | Stream transformations | Flink, Spark Streaming |
| Storage | Time-series data | InfluxDB, Druid, ClickHouse |
| Serving | Query API | REST/GraphQL endpoints |
Design considerations:
| Consideration | Approach |
|---|---|
| Exactly-once delivery | Flink checkpoints + idempotent writes |
| Late data | Watermarks + allowed lateness configuration |
| Backpressure | Consumer scaling, rate limiting |
| Horizontal scaling | Partition-based parallelism |
Q24: How do you handle slowly changing dimensions (SCD)?
Slowly changing dimensions are attributes that change over time (customer address, product price).
| Type | Behavior | History Preserved |
|---|---|---|
| Type 1 | Overwrite | No |
| Type 2 | Add new row with date range | Full history |
| Type 3 | Add column for previous value | Limited (one prior value) |
Type 2 example:
A customer who moved from NYC to LA would have two rows: one with NYC (valid from 2020-01-01 to 2023-01-01, is_current = false) and one with LA (valid from 2023-01-01 to 9999-12-31, is_current = true).
Type 2 is most common as it enables historical queries ("customer's city at time of order").
Q25: What factors do you consider when choosing a data storage solution?
| Factor | Considerations |
|---|---|
| Scale | Current volume, projected growth |
| Query patterns | OLTP (many small operations) vs OLAP (large aggregations) |
| Latency requirements | Milliseconds vs seconds vs minutes acceptable |
| Consistency needs | Strong vs eventual consistency |
| Schema stability | Fixed vs frequently evolving |
| Cost | Storage, compute, operational overhead |
| Team expertise | Existing knowledge and maintenance capability |
Team expertise is a significant factor. A well-understood technology often outperforms a theoretically superior but unfamiliar alternative.