Skip to main content

SQL for Data Science Interviews

SQL is a required skill for data science roles. This section covers query patterns and concepts commonly tested in interviews.

Query Execution Order

SQL statements execute in a specific order that differs from the written syntax:

OrderClauseFunction
1FROM / JOINIdentify source tables
2WHEREFilter individual rows
3GROUP BYCreate groups
4HAVINGFilter groups
5SELECTChoose columns
6ORDER BYSort results
7LIMITRestrict output rows

Implication: Column aliases defined in SELECT cannot be referenced in WHERE because SELECT executes after WHERE.

Filtering with WHERE

This query selects all columns from the users table where the status is 'active', the account was created after January 1, 2024, the country is one of US, UK, or CA, and the email is not null.

Comparison Operators

OperatorDescription
=, !=, <, >, <=, >=Standard comparisons
IN (list)Matches any value in list
BETWEEN x AND yInclusive range
LIKE 'pattern%'Pattern matching (% = any characters)
IS NULL, IS NOT NULLNULL value checks

Aggregation

This query groups users by country and calculates: the total number of users per country, the count of distinct cities, the average revenue, and the total revenue. It filters to only include countries with more than 100 users and orders results by total revenue in descending order.

Requirement: All non-aggregated columns in SELECT must appear in GROUP BY.

JOIN Operations

INNER JOIN

Returns only rows with matches in both tables.

An INNER JOIN between users and orders on the user ID returns only users who have placed orders, along with their order amounts.

LEFT JOIN

Returns all rows from left table, with NULL for non-matching right table rows.

A LEFT JOIN between users and orders returns all users, with order amounts for those who have orders and NULL for those who have not placed any orders.

Self Join

Joins a table to itself.

A self-join on the employees table matching employee manager_id to manager id returns each employee paired with their manager's name.

JOIN Comparison

JOIN TypeLeft Table RowsRight Table Rows
INNEROnly matchingOnly matching
LEFTAllOnly matching (NULL if none)
RIGHTOnly matching (NULL if none)All
FULL OUTERAllAll

Window Functions

Window functions compute values across related rows without collapsing them into groups.

Syntax

Window functions use the OVER clause with optional PARTITION BY (to create partitions similar to GROUP BY) and ORDER BY (to order rows within each partition).

Ranking Functions

This query selects employee name, department, and salary, then assigns three types of rankings within each department ordered by salary (descending): ROW_NUMBER assigns unique sequential numbers, RANK assigns the same rank for ties and skips the next number, and DENSE_RANK assigns the same rank for ties without skipping numbers.

FunctionTie HandlingExample Sequence
ROW_NUMBER()No ties1, 2, 3, 4
RANK()Same rank, skip next1, 2, 2, 4
DENSE_RANK()Same rank, no skip1, 2, 2, 3

LAG and LEAD

Access values from previous or subsequent rows.

This query selects date and revenue from daily_sales, then uses LAG to get the previous day's revenue and calculates the daily change by subtracting the previous day's revenue from the current day's revenue.

Running Totals

This query calculates a running total by using SUM as a window function ordered by date, which accumulates the revenue from the beginning up to the current row.

Common Table Expressions (CTEs)

CTEs create named temporary result sets for improved readability.

This query uses two CTEs: the first (active_users) selects user_id and country for active users, and the second (user_revenue) calculates total order amount per user. The main query then joins these CTEs to calculate average revenue per country for active users.

Common Query Patterns

Top N Per Group

Retrieve the top 3 products by revenue in each category:

This query uses a CTE to assign row numbers within each category (partitioned by category, ordered by revenue descending), then filters to keep only the top 3 products per category.

Year-Over-Year Comparison

This query first aggregates sales data by month using DATE_TRUNC, then uses LAG with an offset of 12 to get the same month from the previous year. It calculates year-over-year growth as a percentage by comparing current revenue to the revenue from 12 months prior.

Retention Analysis

Calculate the percentage of users who signed up in one month and were active the following month:

This query creates two CTEs: one for user cohorts (grouping users by their signup month) and one for activity (distinct user activity by month). It then joins these and calculates: the total users in each cohort, how many were active in the following month, and the retention rate as a percentage.

Finding Duplicates

This query groups users by email and filters to show only emails that appear more than once, effectively identifying duplicate email addresses in the database.

Percentiles

This query calculates the median (50th percentile) and 90th percentile of salaries using PERCENTILE_CONT, which performs continuous interpolation between values.

Cumulative Distribution

This query uses CUME_DIST to calculate the cumulative distribution (percentile rank) for each user's revenue, showing what percentage of users have revenue at or below that value.

NULL Handling

NULL values require special handling in SQL.

ExpressionResult
NULL = NULLNULL (not TRUE)
NULL != valueNULL
COUNT(*)Counts all rows
COUNT(column)Counts non-NULL values only

Important: A query filtering for status not equal to 'active' will NOT find rows where status is NULL. To include NULL values, you must explicitly add an OR condition checking IS NULL.

Division Operations

Division by Zero

Dividing success by total directly can cause errors when total is zero. The safe approach uses a CASE statement to check if total is greater than zero before performing the division, returning 0 otherwise.

Integer Division

Some databases perform integer division by default. To force a decimal result, either use a decimal literal (like 5.0 instead of 5) or explicitly CAST one of the operands to DECIMAL.

Performance Considerations

TechniqueDescription
Filter earlyPlace conditions in WHERE rather than HAVING when possible
Select specific columnsAvoid SELECT * in production queries
Use EXISTS for existence checksOften faster than IN for large subqueries
Index awarenessFilter and join on indexed columns