Skip to main content

Data Engineering Concept Questions

This section covers frequently asked conceptual questions in data engineering interviews, organized by topic area.

Data Pipeline Fundamentals

Q1: Explain the difference between ETL and ELT.

AspectETLELT
Transform locationBefore loading (staging server)After loading (in warehouse)
Compute requirementsRequires dedicated transform infrastructureUses warehouse compute resources
LatencyHigher (transformation precedes loading)Lower (raw data loaded first)
FlexibilityLower (schema defined early)Higher (raw data preserved for future transformations)

ELT has become the predominant pattern due to:

  • Scalable compute in cloud warehouses (Snowflake, BigQuery, Redshift)
  • Preservation of raw data for new transformation requirements
  • Version-controlled transformations through tools like dbt

Q2: What is data lineage and why does it matter?

Data lineage tracks three aspects of data:

  1. Origin (source systems)
  2. Transformations applied
  3. Downstream destinations
Use CaseDescription
DebuggingTracing incorrect values back through transformation chains
Impact analysisIdentifying downstream effects of schema changes
ComplianceDocumenting personal data flows for GDPR and similar regulations
TrustProviding transparency for data consumers

Common tools: Apache Atlas, DataHub, Amundsen, dbt docs.

Q3: How do you handle schema evolution?

Schema evolution refers to changes in data structure over time (new columns, type changes, field removals).

Strategies:

StrategyDescription
Schema RegistryValidates new schemas for compatibility before allowing changes (Kafka approach)
Avro/ProtobufSerialization formats with built-in schema evolution support
Schema-on-readStore raw JSON/data, apply schema at query time
Versioned tablesMaintain separate table versions (orders_v1, orders_v2)

Compatibility types:

TypeDefinitionUse Case
BackwardNew code reads old dataMost common, safest default
ForwardOld code reads new dataUseful during migrations
FullBoth directions supportedMaximum flexibility

Q4: What is idempotency and why is it important in data pipelines?

An idempotent operation produces the same result regardless of how many times it executes.

Significance:

ScenarioWithout IdempotencyWith Idempotency
Job retry after failureDuplicate recordsConsistent state
Backfill operationsData corruptionClean reprocessing
Partial failuresUnknown stateDeterministic recovery

Implementation techniques:

  • Use MERGE or UPSERT instead of INSERT
  • Partition by date with overwrite mode
  • Include deduplication keys
  • Use transactional writes

Q5: Explain the difference between data lake, data warehouse, and lakehouse.

AspectData LakeData WarehouseLakehouse
Data formatRaw, any formatStructured tablesBoth
SchemaSchema-on-readSchema-on-writeConfigurable
Use caseML, data explorationBI, reportingCombined workloads
Cost profileLow storage, variable query costHigher storage, optimized queriesBalanced
ExamplesS3 + SparkSnowflake, BigQueryDatabricks, Delta Lake

The lakehouse architecture combines warehouse features (ACID transactions, schema enforcement, optimized queries) with lake storage (low cost, format flexibility). Delta Lake and Apache Iceberg are the primary table formats enabling this pattern.

Batch vs Streaming

Q6: When would you choose batch vs streaming processing?

FactorBatchStreaming
Latency toleranceMinutes to hours acceptableSeconds required
Data completenessFull dataset requiredPartial data acceptable
Implementation complexityLowerHigher
Infrastructure costGenerally lowerHigher
Example use casesReports, ML trainingFraud detection, live dashboards

Batch processing is appropriate when latency requirements permit and data completeness is required. Streaming adds operational complexity and should be used when sub-minute latency is a business requirement.

Q7: Explain event time vs processing time.

ConceptDefinition
Event timeTimestamp when the event occurred
Processing timeTimestamp when the system processes the event

These values differ when events are delayed (network issues, offline devices, buffering).

Example: A user action at 2:00 PM on an offline device may reach the processing system at 2:45 PM when connectivity is restored. Using processing time would attribute this event to the wrong time window.

Late data handling mechanisms:

MechanismDescription
WatermarksSystem estimate of event time progress ("all events up to time T have arrived")
Allowed latenessThreshold for accepting late events (e.g., 1 hour)
Side outputsSeparate handling path for extremely late events

Q8: What is backpressure and how do you handle it?

Backpressure occurs when data ingestion rate exceeds processing capacity.

Consequences:

  • Memory exhaustion
  • System crashes
  • Data loss

Mitigation strategies:

StrategyDescription
BufferingQueue messages with defined limits
DroppingSample or discard events (acceptable for some metrics)
Source throttlingSignal producers to reduce rate
Horizontal scalingAdd consumer instances
Auto-scalingDynamic infrastructure adjustment

Kafka handles backpressure through consumer-controlled polling. Consumers read at their own pace; unprocessed messages remain in Kafka with measurable "lag."

Q9: Explain exactly-once semantics in streaming.

Delivery GuaranteeDescriptionTrade-off
At-most-onceMessages may be lost, never duplicatedLowest latency, simplest implementation
At-least-onceMessages never lost, may be duplicatedRequires idempotent consumers
Exactly-onceNo loss, no duplicatesHighest complexity, additional latency

Exactly-once requirements:

  1. Idempotent producers (source deduplication)
  2. Transactional writes (atomic commits)
  3. Checkpointing (recovery state persistence)
  4. End-to-end coordination (all components must participate)

In practice, most production systems implement at-least-once delivery with idempotent consumers. True exactly-once semantics adds complexity and latency.

Q10: What are the Lambda and Kappa architectures?

Lambda Architecture:

LayerPurpose
Batch layerComplete, accurate historical processing
Speed layerReal-time approximate results
Serving layerMerges batch and speed layer outputs

Trade-off: Maintains two separate codebases for similar logic.

Kappa Architecture:

Single streaming system for all processing. Historical reprocessing is performed by replaying events through the same pipeline.

AspectLambdaKappa
CodebasesTwo (batch + streaming)One
ComplexityHigherLower
ReprocessingFast (dedicated batch)Slower (stream replay)

Kappa architecture is generally preferred for simplicity. A batch layer can be added later if specific limitations arise.

Data Quality

Q11: How do you ensure data quality in pipelines?

Data quality validation operates at multiple levels:

LevelDescriptionExample
Schema validationColumn existence, data typesExpected columns present with correct types
Constraint validationPrimary key uniqueness, foreign key validityNo duplicate IDs, valid references
Statistical validationVolume, null rates, distributionsRow count within expected range
Business rule validationDomain-specific logicRevenue values positive, dates within valid range

Tools: Great Expectations, dbt tests, Soda.

Best practices:

  • Validate at each pipeline stage (fail fast)
  • Alert on anomalies rather than logging only
  • Quarantine invalid data for investigation rather than discarding

Q12: What is SLA for data pipelines and how do you ensure it?

A Service Level Agreement (SLA) defines commitments to downstream consumers:

SLA TypeExample
FreshnessData updated by 8 AM daily
CompletenessAll records from prior day included
AccuracyValues match source systems

Enforcement strategies:

StrategyDescription
MonitoringDetect issues before consumers notice
Buffer timeSchedule jobs with margin (7 AM job for 8 AM SLA)
Automated retriesExponential backoff for transient failures
RedundancyBackup systems for critical pipelines
Clear ownershipDefined on-call responsibilities

Q13: How do you handle data skew?

Data skew occurs when data distribution is uneven across partitions, causing some tasks to take significantly longer than others.

Symptoms:

  • Single Spark tasks running 10x longer than others
  • Out-of-memory errors on specific executors
  • Straggler tasks visible in Spark UI

Solutions:

TechniqueDescription
SaltingAdd random suffix to hot keys, aggregate in two passes
Broadcast joinsBroadcast small tables to all executors
Adaptive Query ExecutionSpark 3.0+ automatic skew handling
Pre-aggregationReduce cardinality before joins
Hot key isolationProcess high-cardinality keys separately

Distributed Systems

Q14: Explain partitioning strategies for distributed data.

StrategyMethodAdvantagesDisadvantages
Hashhash(key) mod NEven distributionPoor for range queries
RangeKey ranges to partitionsEfficient range queriesPotential hot spots
Round-robinRotate through partitionsSimple, even spreadNo key locality

Common applications:

  • Kafka: Hash partitioning on message key
  • Spark: Configurable (hash default)
  • Databases: Hash or range sharding

Q15: What is the CAP theorem?

The CAP theorem states that a distributed system can provide at most two of three guarantees during a network partition:

PropertyDefinition
ConsistencyEvery read returns the most recent write
AvailabilityEvery request receives a response
Partition toleranceSystem operates despite network failures

Since network partitions are inevitable in distributed systems, the practical choice is between:

TypeBehaviorExample Use Case
CPConsistent but may be unavailable during partitionsFinancial transactions
APAvailable but may return stale data during partitionsSocial media feeds

Q16: Explain how Kafka works at a high level.

Kafka is a distributed commit log with the following components:

ComponentDescription
TopicsNamed streams of messages
PartitionsSubdivisions of topics for parallelism; ordering guaranteed within partition only
ProducersWrite messages to topics
ConsumersRead messages, tracking position via offsets
Consumer groupsMultiple consumers sharing load; each partition assigned to one consumer per group
BrokersKafka server instances
ReplicationPartitions replicated across brokers for fault tolerance

Performance characteristics:

  • Sequential disk writes (high throughput)
  • Consumer-controlled replay from any offset
  • Append-only writes, sequential reads

Q17: How do you handle failures in distributed data processing?

Failure TypeRecovery Strategy
Task failureRetry on same or different node
Node failureReassign tasks to healthy nodes
State corruptionRestore from checkpoint
Full job failureRestart from beginning (requires idempotent pipeline)

Recovery mechanisms:

MechanismDescription
CheckpointingPeriodic state snapshots for recovery
Write-ahead logsRecord operations before execution
Lineage (Spark)Recompute lost data from parent RDDs
Exponential backoffIncreasing delay between retry attempts

Q18: What is data locality and why does it matter?

Data locality refers to processing data where it resides, minimizing network transfer.

Locality levels (best to worst):

LevelDescription
Process localData in JVM memory
Node localData on local disk
Rack localData on same rack
NetworkData requires network transfer

Network transfer is typically the bottleneck in distributed systems. Spark and Hadoop schedulers attempt to schedule tasks on nodes containing the required data.

Tools and Technologies

AspectSparkFlink
OriginBatch-first, streaming added laterStreaming-first, batch added later
Streaming modelMicro-batchesTrue event-at-a-time
LatencySecondsMilliseconds
State managementExternal checkpointsBuilt-in managed state
EcosystemLarge, matureSmaller, growing
Best forBatch workloads, SQL analyticsReal-time processing, stateful streaming

Selection criteria:

  • Spark: Primarily batch workloads, existing team expertise
  • Flink: Sub-second latency requirements, complex stateful stream processing

Q20: When would you use a columnar vs row-based storage format?

Format TypeStorage PatternAdvantagesDisadvantages
Row (CSV, JSON, Avro)All fields of record togetherEfficient single-record accessInefficient for column aggregations
Columnar (Parquet, ORC)All values of column togetherEfficient aggregations, better compressionInefficient for full-record access

Columnar advantages for analytics:

  • Read only required columns
  • Superior compression (similar values grouped)
  • Vectorized processing optimization

Guidelines:

  • Parquet: Data lakes, warehouses, ML workloads
  • Avro: CDC streams, message passing

Q21: Explain the role of a data catalog.

A data catalog provides centralized metadata management for data assets.

FunctionDescription
DiscoverySearch for datasets by name, description, tags
MetadataSchemas, owners, update timestamps
LineageUpstream dependencies and downstream consumers
Quality signalsFreshness, completeness, trust scores
Access controlPermission management

Tools: DataHub, Amundsen, Alation, AWS Glue Catalog.

A data catalog becomes necessary when:

  • Table count exceeds tribal knowledge capacity
  • New team members require onboarding
  • Compliance requires data flow documentation

Q22: What is CDC (Change Data Capture)?

CDC captures insert, update, and delete operations from databases and streams them to downstream systems.

Implementation approaches:

ApproachMethodSource ImpactCaptures Deletes
Query-basedPoll for rows where updated_at > last_checkMedium-highNo
Trigger-basedDatabase triggers on changesHighYes
Log-basedRead database transaction logMinimalYes

Log-based CDC is preferred:

  • Zero impact on source database (reads existing logs)
  • Captures all change types including deletes
  • Low latency

Tools: Debezium (open source), AWS DMS, Fivetran, Airbyte.

Use cases:

  • Data warehouse synchronization
  • Event-driven architectures
  • Cache invalidation

Design and Architecture

Q23: How would you design a real-time analytics pipeline?

Architecture layers:

LayerPurposeExample Technologies
IngestionEvent captureKafka, Kinesis
ProcessingStream transformationsFlink, Spark Streaming
StorageTime-series dataInfluxDB, Druid, ClickHouse
ServingQuery APIREST/GraphQL endpoints

Design considerations:

ConsiderationApproach
Exactly-once deliveryFlink checkpoints + idempotent writes
Late dataWatermarks + allowed lateness configuration
BackpressureConsumer scaling, rate limiting
Horizontal scalingPartition-based parallelism

Q24: How do you handle slowly changing dimensions (SCD)?

Slowly changing dimensions are attributes that change over time (customer address, product price).

TypeBehaviorHistory Preserved
Type 1OverwriteNo
Type 2Add new row with date rangeFull history
Type 3Add column for previous valueLimited (one prior value)

Type 2 example:

A customer who moved from NYC to LA would have two rows: one with NYC (valid from 2020-01-01 to 2023-01-01, is_current = false) and one with LA (valid from 2023-01-01 to 9999-12-31, is_current = true).

Type 2 is most common as it enables historical queries ("customer's city at time of order").

Q25: What factors do you consider when choosing a data storage solution?

FactorConsiderations
ScaleCurrent volume, projected growth
Query patternsOLTP (many small operations) vs OLAP (large aggregations)
Latency requirementsMilliseconds vs seconds vs minutes acceptable
Consistency needsStrong vs eventual consistency
Schema stabilityFixed vs frequently evolving
CostStorage, compute, operational overhead
Team expertiseExisting knowledge and maintenance capability

Team expertise is a significant factor. A well-understood technology often outperforms a theoretically superior but unfamiliar alternative.