SQL for Data Science Interviews
SQL is a required skill for data science roles. This section covers query patterns and concepts commonly tested in interviews.
Query Execution Order
SQL statements execute in a specific order that differs from the written syntax:
| Order | Clause | Function |
|---|---|---|
| 1 | FROM / JOIN | Identify source tables |
| 2 | WHERE | Filter individual rows |
| 3 | GROUP BY | Create groups |
| 4 | HAVING | Filter groups |
| 5 | SELECT | Choose columns |
| 6 | ORDER BY | Sort results |
| 7 | LIMIT | Restrict output rows |
Implication: Column aliases defined in SELECT cannot be referenced in WHERE because SELECT executes after WHERE.
Filtering with WHERE
This query selects all columns from the users table where the status is 'active', the account was created after January 1, 2024, the country is one of US, UK, or CA, and the email is not null.
Comparison Operators
| Operator | Description |
|---|---|
=, !=, <, >, <=, >= | Standard comparisons |
IN (list) | Matches any value in list |
BETWEEN x AND y | Inclusive range |
LIKE 'pattern%' | Pattern matching (% = any characters) |
IS NULL, IS NOT NULL | NULL value checks |
Aggregation
This query groups users by country and calculates: the total number of users per country, the count of distinct cities, the average revenue, and the total revenue. It filters to only include countries with more than 100 users and orders results by total revenue in descending order.
Requirement: All non-aggregated columns in SELECT must appear in GROUP BY.
JOIN Operations
INNER JOIN
Returns only rows with matches in both tables.
An INNER JOIN between users and orders on the user ID returns only users who have placed orders, along with their order amounts.
LEFT JOIN
Returns all rows from left table, with NULL for non-matching right table rows.
A LEFT JOIN between users and orders returns all users, with order amounts for those who have orders and NULL for those who have not placed any orders.
Self Join
Joins a table to itself.
A self-join on the employees table matching employee manager_id to manager id returns each employee paired with their manager's name.
JOIN Comparison
| JOIN Type | Left Table Rows | Right Table Rows |
|---|---|---|
| INNER | Only matching | Only matching |
| LEFT | All | Only matching (NULL if none) |
| RIGHT | Only matching (NULL if none) | All |
| FULL OUTER | All | All |
Window Functions
Window functions compute values across related rows without collapsing them into groups.
Syntax
Window functions use the OVER clause with optional PARTITION BY (to create partitions similar to GROUP BY) and ORDER BY (to order rows within each partition).
Ranking Functions
This query selects employee name, department, and salary, then assigns three types of rankings within each department ordered by salary (descending): ROW_NUMBER assigns unique sequential numbers, RANK assigns the same rank for ties and skips the next number, and DENSE_RANK assigns the same rank for ties without skipping numbers.
| Function | Tie Handling | Example Sequence |
|---|---|---|
| ROW_NUMBER() | No ties | 1, 2, 3, 4 |
| RANK() | Same rank, skip next | 1, 2, 2, 4 |
| DENSE_RANK() | Same rank, no skip | 1, 2, 2, 3 |
LAG and LEAD
Access values from previous or subsequent rows.
This query selects date and revenue from daily_sales, then uses LAG to get the previous day's revenue and calculates the daily change by subtracting the previous day's revenue from the current day's revenue.
Running Totals
This query calculates a running total by using SUM as a window function ordered by date, which accumulates the revenue from the beginning up to the current row.
Common Table Expressions (CTEs)
CTEs create named temporary result sets for improved readability.
This query uses two CTEs: the first (active_users) selects user_id and country for active users, and the second (user_revenue) calculates total order amount per user. The main query then joins these CTEs to calculate average revenue per country for active users.
Common Query Patterns
Top N Per Group
Retrieve the top 3 products by revenue in each category:
This query uses a CTE to assign row numbers within each category (partitioned by category, ordered by revenue descending), then filters to keep only the top 3 products per category.
Year-Over-Year Comparison
This query first aggregates sales data by month using DATE_TRUNC, then uses LAG with an offset of 12 to get the same month from the previous year. It calculates year-over-year growth as a percentage by comparing current revenue to the revenue from 12 months prior.
Retention Analysis
Calculate the percentage of users who signed up in one month and were active the following month:
This query creates two CTEs: one for user cohorts (grouping users by their signup month) and one for activity (distinct user activity by month). It then joins these and calculates: the total users in each cohort, how many were active in the following month, and the retention rate as a percentage.
Finding Duplicates
This query groups users by email and filters to show only emails that appear more than once, effectively identifying duplicate email addresses in the database.
Percentiles
This query calculates the median (50th percentile) and 90th percentile of salaries using PERCENTILE_CONT, which performs continuous interpolation between values.
Cumulative Distribution
This query uses CUME_DIST to calculate the cumulative distribution (percentile rank) for each user's revenue, showing what percentage of users have revenue at or below that value.
NULL Handling
NULL values require special handling in SQL.
| Expression | Result |
|---|---|
NULL = NULL | NULL (not TRUE) |
NULL != value | NULL |
COUNT(*) | Counts all rows |
COUNT(column) | Counts non-NULL values only |
Important: A query filtering for status not equal to 'active' will NOT find rows where status is NULL. To include NULL values, you must explicitly add an OR condition checking IS NULL.
Division Operations
Division by Zero
Dividing success by total directly can cause errors when total is zero. The safe approach uses a CASE statement to check if total is greater than zero before performing the division, returning 0 otherwise.
Integer Division
Some databases perform integer division by default. To force a decimal result, either use a decimal literal (like 5.0 instead of 5) or explicitly CAST one of the operands to DECIMAL.
Performance Considerations
| Technique | Description |
|---|---|
| Filter early | Place conditions in WHERE rather than HAVING when possible |
| Select specific columns | Avoid SELECT * in production queries |
| Use EXISTS for existence checks | Often faster than IN for large subqueries |
| Index awareness | Filter and join on indexed columns |