SQL for Data Science Interviews

SQL is a required skill for data science roles. This section covers query patterns and concepts commonly tested in interviews.

Query Execution Order

SQL statements execute in a specific order that differs from the written syntax:

Order	Clause	Function
1	FROM / JOIN	Identify source tables
2	WHERE	Filter individual rows
3	GROUP BY	Create groups
4	HAVING	Filter groups
5	SELECT	Choose columns
6	ORDER BY	Sort results
7	LIMIT	Restrict output rows

Implication: Column aliases defined in SELECT cannot be referenced in WHERE because SELECT executes after WHERE.

Filtering with WHERE

This query selects all columns from the users table where the status is 'active', the account was created after January 1, 2024, the country is one of US, UK, or CA, and the email is not null.

Comparison Operators

Operator	Description
`=`, `!=`, `<`, `>`, `<=`, `>=`	Standard comparisons
`IN (list)`	Matches any value in list
`BETWEEN x AND y`	Inclusive range
`LIKE 'pattern%'`	Pattern matching (% = any characters)
`IS NULL`, `IS NOT NULL`	NULL value checks

Aggregation

This query groups users by country and calculates: the total number of users per country, the count of distinct cities, the average revenue, and the total revenue. It filters to only include countries with more than 100 users and orders results by total revenue in descending order.

Requirement: All non-aggregated columns in SELECT must appear in GROUP BY.

JOIN Operations

INNER JOIN

Returns only rows with matches in both tables.

An INNER JOIN between users and orders on the user ID returns only users who have placed orders, along with their order amounts.

LEFT JOIN

Returns all rows from left table, with NULL for non-matching right table rows.

A LEFT JOIN between users and orders returns all users, with order amounts for those who have orders and NULL for those who have not placed any orders.

Self Join

Joins a table to itself.

A self-join on the employees table matching employee manager_id to manager id returns each employee paired with their manager's name.

JOIN Comparison

JOIN Type	Left Table Rows	Right Table Rows
INNER	Only matching	Only matching
LEFT	All	Only matching (NULL if none)
RIGHT	Only matching (NULL if none)	All
FULL OUTER	All	All

Window Functions

Window functions compute values across related rows without collapsing them into groups.

Syntax

Window functions use the OVER clause with optional PARTITION BY (to create partitions similar to GROUP BY) and ORDER BY (to order rows within each partition).

Ranking Functions

This query selects employee name, department, and salary, then assigns three types of rankings within each department ordered by salary (descending): ROW_NUMBER assigns unique sequential numbers, RANK assigns the same rank for ties and skips the next number, and DENSE_RANK assigns the same rank for ties without skipping numbers.

Function	Tie Handling	Example Sequence
ROW_NUMBER()	No ties	1, 2, 3, 4
RANK()	Same rank, skip next	1, 2, 2, 4
DENSE_RANK()	Same rank, no skip	1, 2, 2, 3

LAG and LEAD

Access values from previous or subsequent rows.

This query selects date and revenue from daily_sales, then uses LAG to get the previous day's revenue and calculates the daily change by subtracting the previous day's revenue from the current day's revenue.

Running Totals

This query calculates a running total by using SUM as a window function ordered by date, which accumulates the revenue from the beginning up to the current row.

Common Table Expressions (CTEs)

CTEs create named temporary result sets for improved readability.

This query uses two CTEs: the first (active_users) selects user_id and country for active users, and the second (user_revenue) calculates total order amount per user. The main query then joins these CTEs to calculate average revenue per country for active users.

Common Query Patterns

Top N Per Group

Retrieve the top 3 products by revenue in each category:

This query uses a CTE to assign row numbers within each category (partitioned by category, ordered by revenue descending), then filters to keep only the top 3 products per category.

Year-Over-Year Comparison

This query first aggregates sales data by month using DATE_TRUNC, then uses LAG with an offset of 12 to get the same month from the previous year. It calculates year-over-year growth as a percentage by comparing current revenue to the revenue from 12 months prior.

Retention Analysis

Calculate the percentage of users who signed up in one month and were active the following month:

This query creates two CTEs: one for user cohorts (grouping users by their signup month) and one for activity (distinct user activity by month). It then joins these and calculates: the total users in each cohort, how many were active in the following month, and the retention rate as a percentage.

Finding Duplicates

This query groups users by email and filters to show only emails that appear more than once, effectively identifying duplicate email addresses in the database.

Percentiles

This query calculates the median (50th percentile) and 90th percentile of salaries using PERCENTILE_CONT, which performs continuous interpolation between values.

Cumulative Distribution

This query uses CUME_DIST to calculate the cumulative distribution (percentile rank) for each user's revenue, showing what percentage of users have revenue at or below that value.

NULL Handling

NULL values require special handling in SQL.

Expression	Result
`NULL = NULL`	NULL (not TRUE)
`NULL != value`	NULL
`COUNT(*)`	Counts all rows
`COUNT(column)`	Counts non-NULL values only

Important: A query filtering for status not equal to 'active' will NOT find rows where status is NULL. To include NULL values, you must explicitly add an OR condition checking IS NULL.

Division Operations

Division by Zero

Dividing success by total directly can cause errors when total is zero. The safe approach uses a CASE statement to check if total is greater than zero before performing the division, returning 0 otherwise.

Integer Division

Some databases perform integer division by default. To force a decimal result, either use a decimal literal (like 5.0 instead of 5) or explicitly CAST one of the operands to DECIMAL.

Performance Considerations

Technique	Description
Filter early	Place conditions in WHERE rather than HAVING when possible
Select specific columns	Avoid SELECT * in production queries
Use EXISTS for existence checks	Often faster than IN for large subqueries
Index awareness	Filter and join on indexed columns

Query Execution Order​

Filtering with WHERE​

Comparison Operators​

Aggregation​

JOIN Operations​

INNER JOIN​

LEFT JOIN​

Self Join​

JOIN Comparison​

Window Functions​

Syntax​

Ranking Functions​

LAG and LEAD​

Running Totals​

Common Table Expressions (CTEs)​

Common Query Patterns​

Top N Per Group​

Year-Over-Year Comparison​

Retention Analysis​

Finding Duplicates​

Percentiles​

Cumulative Distribution​

NULL Handling​

Division Operations​

Division by Zero​

Integer Division​

Performance Considerations​

Table of Contents