Data Engineering Concept Questions

This section covers frequently asked conceptual questions in data engineering interviews, organized by topic area.

Data Pipeline Fundamentals

Q1: Explain the difference between ETL and ELT.

Aspect	ETL	ELT
Transform location	Before loading (staging server)	After loading (in warehouse)
Compute requirements	Requires dedicated transform infrastructure	Uses warehouse compute resources
Latency	Higher (transformation precedes loading)	Lower (raw data loaded first)
Flexibility	Lower (schema defined early)	Higher (raw data preserved for future transformations)

ELT has become the predominant pattern due to:

Scalable compute in cloud warehouses (Snowflake, BigQuery, Redshift)
Preservation of raw data for new transformation requirements
Version-controlled transformations through tools like dbt

Q2: What is data lineage and why does it matter?

Data lineage tracks three aspects of data:

Origin (source systems)
Transformations applied
Downstream destinations

Use Case	Description
Debugging	Tracing incorrect values back through transformation chains
Impact analysis	Identifying downstream effects of schema changes
Compliance	Documenting personal data flows for GDPR and similar regulations
Trust	Providing transparency for data consumers

Common tools: Apache Atlas, DataHub, Amundsen, dbt docs.

Q3: How do you handle schema evolution?

Schema evolution refers to changes in data structure over time (new columns, type changes, field removals).

Strategies:

Strategy	Description
Schema Registry	Validates new schemas for compatibility before allowing changes (Kafka approach)
Avro/Protobuf	Serialization formats with built-in schema evolution support
Schema-on-read	Store raw JSON/data, apply schema at query time
Versioned tables	Maintain separate table versions (orders_v1, orders_v2)

Compatibility types:

Type	Definition	Use Case
Backward	New code reads old data	Most common, safest default
Forward	Old code reads new data	Useful during migrations
Full	Both directions supported	Maximum flexibility

Q4: What is idempotency and why is it important in data pipelines?

An idempotent operation produces the same result regardless of how many times it executes.

Significance:

Scenario	Without Idempotency	With Idempotency
Job retry after failure	Duplicate records	Consistent state
Backfill operations	Data corruption	Clean reprocessing
Partial failures	Unknown state	Deterministic recovery

Implementation techniques:

Use MERGE or UPSERT instead of INSERT
Partition by date with overwrite mode
Include deduplication keys
Use transactional writes

Q5: Explain the difference between data lake, data warehouse, and lakehouse.

Aspect	Data Lake	Data Warehouse	Lakehouse
Data format	Raw, any format	Structured tables	Both
Schema	Schema-on-read	Schema-on-write	Configurable
Use case	ML, data exploration	BI, reporting	Combined workloads
Cost profile	Low storage, variable query cost	Higher storage, optimized queries	Balanced
Examples	S3 + Spark	Snowflake, BigQuery	Databricks, Delta Lake

The lakehouse architecture combines warehouse features (ACID transactions, schema enforcement, optimized queries) with lake storage (low cost, format flexibility). Delta Lake and Apache Iceberg are the primary table formats enabling this pattern.

Batch vs Streaming

Q6: When would you choose batch vs streaming processing?

Factor	Batch	Streaming
Latency tolerance	Minutes to hours acceptable	Seconds required
Data completeness	Full dataset required	Partial data acceptable
Implementation complexity	Lower	Higher
Infrastructure cost	Generally lower	Higher
Example use cases	Reports, ML training	Fraud detection, live dashboards

Batch processing is appropriate when latency requirements permit and data completeness is required. Streaming adds operational complexity and should be used when sub-minute latency is a business requirement.

Q7: Explain event time vs processing time.

Concept	Definition
Event time	Timestamp when the event occurred
Processing time	Timestamp when the system processes the event

These values differ when events are delayed (network issues, offline devices, buffering).

Example: A user action at 2:00 PM on an offline device may reach the processing system at 2:45 PM when connectivity is restored. Using processing time would attribute this event to the wrong time window.

Late data handling mechanisms:

Mechanism	Description
Watermarks	System estimate of event time progress ("all events up to time T have arrived")
Allowed lateness	Threshold for accepting late events (e.g., 1 hour)
Side outputs	Separate handling path for extremely late events

Q8: What is backpressure and how do you handle it?

Backpressure occurs when data ingestion rate exceeds processing capacity.

Consequences:

Memory exhaustion
System crashes
Data loss

Mitigation strategies:

Strategy	Description
Buffering	Queue messages with defined limits
Dropping	Sample or discard events (acceptable for some metrics)
Source throttling	Signal producers to reduce rate
Horizontal scaling	Add consumer instances
Auto-scaling	Dynamic infrastructure adjustment

Kafka handles backpressure through consumer-controlled polling. Consumers read at their own pace; unprocessed messages remain in Kafka with measurable "lag."

Q9: Explain exactly-once semantics in streaming.

Delivery Guarantee	Description	Trade-off
At-most-once	Messages may be lost, never duplicated	Lowest latency, simplest implementation
At-least-once	Messages never lost, may be duplicated	Requires idempotent consumers
Exactly-once	No loss, no duplicates	Highest complexity, additional latency

Exactly-once requirements:

Idempotent producers (source deduplication)
Transactional writes (atomic commits)
Checkpointing (recovery state persistence)
End-to-end coordination (all components must participate)

In practice, most production systems implement at-least-once delivery with idempotent consumers. True exactly-once semantics adds complexity and latency.

Q10: What are the Lambda and Kappa architectures?

Lambda Architecture:

Layer	Purpose
Batch layer	Complete, accurate historical processing
Speed layer	Real-time approximate results
Serving layer	Merges batch and speed layer outputs

Trade-off: Maintains two separate codebases for similar logic.

Kappa Architecture:

Single streaming system for all processing. Historical reprocessing is performed by replaying events through the same pipeline.

Aspect	Lambda	Kappa
Codebases	Two (batch + streaming)	One
Complexity	Higher	Lower
Reprocessing	Fast (dedicated batch)	Slower (stream replay)

Kappa architecture is generally preferred for simplicity. A batch layer can be added later if specific limitations arise.

Data Quality

Q11: How do you ensure data quality in pipelines?

Data quality validation operates at multiple levels:

Level	Description	Example
Schema validation	Column existence, data types	Expected columns present with correct types
Constraint validation	Primary key uniqueness, foreign key validity	No duplicate IDs, valid references
Statistical validation	Volume, null rates, distributions	Row count within expected range
Business rule validation	Domain-specific logic	Revenue values positive, dates within valid range

Tools: Great Expectations, dbt tests, Soda.

Best practices:

Validate at each pipeline stage (fail fast)
Alert on anomalies rather than logging only
Quarantine invalid data for investigation rather than discarding

Q12: What is SLA for data pipelines and how do you ensure it?

A Service Level Agreement (SLA) defines commitments to downstream consumers:

SLA Type	Example
Freshness	Data updated by 8 AM daily
Completeness	All records from prior day included
Accuracy	Values match source systems

Enforcement strategies:

Strategy	Description
Monitoring	Detect issues before consumers notice
Buffer time	Schedule jobs with margin (7 AM job for 8 AM SLA)
Automated retries	Exponential backoff for transient failures
Redundancy	Backup systems for critical pipelines
Clear ownership	Defined on-call responsibilities

Q13: How do you handle data skew?

Data skew occurs when data distribution is uneven across partitions, causing some tasks to take significantly longer than others.

Symptoms:

Single Spark tasks running 10x longer than others
Out-of-memory errors on specific executors
Straggler tasks visible in Spark UI

Solutions:

Technique	Description
Salting	Add random suffix to hot keys, aggregate in two passes
Broadcast joins	Broadcast small tables to all executors
Adaptive Query Execution	Spark 3.0+ automatic skew handling
Pre-aggregation	Reduce cardinality before joins
Hot key isolation	Process high-cardinality keys separately

Distributed Systems

Q14: Explain partitioning strategies for distributed data.

Strategy	Method	Advantages	Disadvantages
Hash	hash(key) mod N	Even distribution	Poor for range queries
Range	Key ranges to partitions	Efficient range queries	Potential hot spots
Round-robin	Rotate through partitions	Simple, even spread	No key locality

Common applications:

Kafka: Hash partitioning on message key
Spark: Configurable (hash default)
Databases: Hash or range sharding

Q15: What is the CAP theorem?

The CAP theorem states that a distributed system can provide at most two of three guarantees during a network partition:

Property	Definition
Consistency	Every read returns the most recent write
Availability	Every request receives a response
Partition tolerance	System operates despite network failures

Since network partitions are inevitable in distributed systems, the practical choice is between:

Type	Behavior	Example Use Case
CP	Consistent but may be unavailable during partitions	Financial transactions
AP	Available but may return stale data during partitions	Social media feeds

Q16: Explain how Kafka works at a high level.

Kafka is a distributed commit log with the following components:

Component	Description
Topics	Named streams of messages
Partitions	Subdivisions of topics for parallelism; ordering guaranteed within partition only
Producers	Write messages to topics
Consumers	Read messages, tracking position via offsets
Consumer groups	Multiple consumers sharing load; each partition assigned to one consumer per group
Brokers	Kafka server instances
Replication	Partitions replicated across brokers for fault tolerance

Performance characteristics:

Sequential disk writes (high throughput)
Consumer-controlled replay from any offset
Append-only writes, sequential reads

Q17: How do you handle failures in distributed data processing?

Failure Type	Recovery Strategy
Task failure	Retry on same or different node
Node failure	Reassign tasks to healthy nodes
State corruption	Restore from checkpoint
Full job failure	Restart from beginning (requires idempotent pipeline)

Recovery mechanisms:

Mechanism	Description
Checkpointing	Periodic state snapshots for recovery
Write-ahead logs	Record operations before execution
Lineage (Spark)	Recompute lost data from parent RDDs
Exponential backoff	Increasing delay between retry attempts

Q18: What is data locality and why does it matter?

Data locality refers to processing data where it resides, minimizing network transfer.

Locality levels (best to worst):

Level	Description
Process local	Data in JVM memory
Node local	Data on local disk
Rack local	Data on same rack
Network	Data requires network transfer

Network transfer is typically the bottleneck in distributed systems. Spark and Hadoop schedulers attempt to schedule tasks on nodes containing the required data.

Tools and Technologies

Q19: Compare Spark vs Flink.

Aspect	Spark	Flink
Origin	Batch-first, streaming added later	Streaming-first, batch added later
Streaming model	Micro-batches	True event-at-a-time
Latency	Seconds	Milliseconds
State management	External checkpoints	Built-in managed state
Ecosystem	Large, mature	Smaller, growing
Best for	Batch workloads, SQL analytics	Real-time processing, stateful streaming

Selection criteria:

Spark: Primarily batch workloads, existing team expertise
Flink: Sub-second latency requirements, complex stateful stream processing

Q20: When would you use a columnar vs row-based storage format?

Format Type	Storage Pattern	Advantages	Disadvantages
Row (CSV, JSON, Avro)	All fields of record together	Efficient single-record access	Inefficient for column aggregations
Columnar (Parquet, ORC)	All values of column together	Efficient aggregations, better compression	Inefficient for full-record access

Columnar advantages for analytics:

Read only required columns
Superior compression (similar values grouped)
Vectorized processing optimization

Guidelines:

Parquet: Data lakes, warehouses, ML workloads
Avro: CDC streams, message passing

Q21: Explain the role of a data catalog.

A data catalog provides centralized metadata management for data assets.

Function	Description
Discovery	Search for datasets by name, description, tags
Metadata	Schemas, owners, update timestamps
Lineage	Upstream dependencies and downstream consumers
Quality signals	Freshness, completeness, trust scores
Access control	Permission management

Tools: DataHub, Amundsen, Alation, AWS Glue Catalog.

A data catalog becomes necessary when:

Table count exceeds tribal knowledge capacity
New team members require onboarding
Compliance requires data flow documentation

Q22: What is CDC (Change Data Capture)?

CDC captures insert, update, and delete operations from databases and streams them to downstream systems.

Implementation approaches:

Approach	Method	Source Impact	Captures Deletes
Query-based	Poll for rows where updated_at > last_check	Medium-high	No
Trigger-based	Database triggers on changes	High	Yes
Log-based	Read database transaction log	Minimal	Yes

Log-based CDC is preferred:

Zero impact on source database (reads existing logs)
Captures all change types including deletes
Low latency

Tools: Debezium (open source), AWS DMS, Fivetran, Airbyte.

Use cases:

Data warehouse synchronization
Event-driven architectures
Cache invalidation

Design and Architecture

Q23: How would you design a real-time analytics pipeline?

Architecture layers:

Layer	Purpose	Example Technologies
Ingestion	Event capture	Kafka, Kinesis
Processing	Stream transformations	Flink, Spark Streaming
Storage	Time-series data	InfluxDB, Druid, ClickHouse
Serving	Query API	REST/GraphQL endpoints

Design considerations:

Consideration	Approach
Exactly-once delivery	Flink checkpoints + idempotent writes
Late data	Watermarks + allowed lateness configuration
Backpressure	Consumer scaling, rate limiting
Horizontal scaling	Partition-based parallelism

Q24: How do you handle slowly changing dimensions (SCD)?

Slowly changing dimensions are attributes that change over time (customer address, product price).

Type	Behavior	History Preserved
Type 1	Overwrite	No
Type 2	Add new row with date range	Full history
Type 3	Add column for previous value	Limited (one prior value)

Type 2 example:

A customer who moved from NYC to LA would have two rows: one with NYC (valid from 2020-01-01 to 2023-01-01, is_current = false) and one with LA (valid from 2023-01-01 to 9999-12-31, is_current = true).

Type 2 is most common as it enables historical queries ("customer's city at time of order").

Q25: What factors do you consider when choosing a data storage solution?

Factor	Considerations
Scale	Current volume, projected growth
Query patterns	OLTP (many small operations) vs OLAP (large aggregations)
Latency requirements	Milliseconds vs seconds vs minutes acceptable
Consistency needs	Strong vs eventual consistency
Schema stability	Fixed vs frequently evolving
Cost	Storage, compute, operational overhead
Team expertise	Existing knowledge and maintenance capability

Team expertise is a significant factor. A well-understood technology often outperforms a theoretically superior but unfamiliar alternative.

Data Pipeline Fundamentals​

Q1: Explain the difference between ETL and ELT.​

Q2: What is data lineage and why does it matter?​

Q3: How do you handle schema evolution?​

Q4: What is idempotency and why is it important in data pipelines?​

Q5: Explain the difference between data lake, data warehouse, and lakehouse.​

Batch vs Streaming​

Q6: When would you choose batch vs streaming processing?​

Q7: Explain event time vs processing time.​

Q8: What is backpressure and how do you handle it?​

Q9: Explain exactly-once semantics in streaming.​

Q10: What are the Lambda and Kappa architectures?​

Data Quality​

Q11: How do you ensure data quality in pipelines?​

Q12: What is SLA for data pipelines and how do you ensure it?​

Q13: How do you handle data skew?​

Distributed Systems​

Q14: Explain partitioning strategies for distributed data.​

Q15: What is the CAP theorem?​

Q16: Explain how Kafka works at a high level.​

Q17: How do you handle failures in distributed data processing?​

Q18: What is data locality and why does it matter?​

Tools and Technologies​

Q19: Compare Spark vs Flink.​

Q20: When would you use a columnar vs row-based storage format?​

Q21: Explain the role of a data catalog.​

Q22: What is CDC (Change Data Capture)?​

Design and Architecture​

Q23: How would you design a real-time analytics pipeline?​

Q24: How do you handle slowly changing dimensions (SCD)?​

Q25: What factors do you consider when choosing a data storage solution?​

Table of Contents