Python Coding for Data Science Interviews
Data science Python interviews focus on practical data manipulation using pandas, numpy, and standard library tools. This section covers the patterns and techniques commonly tested.
Interview Question Categories
| Type | Focus Area | Example |
|---|---|---|
| Data manipulation | pandas/numpy operations | Clean dataset and compute metrics |
| Algorithms | Basic computer science concepts | Find duplicates, implement moving average |
| Statistics | Implementing statistical functions | Write correlation function from scratch |
| Data structures | Lists, dictionaries, sets | Group records efficiently |
Python Fundamentals
List Comprehensions
Basic comprehension: Creates a list of squares for numbers 0-9 by iterating through range(10) and squaring each value.
With condition: Creates squares only for even numbers by adding a conditional filter (if x % 2 == 0) to the comprehension.
Nested comprehension: Creates a 3x3 matrix where each element is the product of its row and column indices.
Dictionary comprehension: Creates a dictionary mapping each word to its length.
Set comprehension: Creates a set of unique first characters from a list of words.
Lambda Functions
Sort by second element: Uses sorted() with a lambda key function that returns the second element of each pair for comparison.
Filter positive numbers: Uses filter() with a lambda that returns True only for positive values.
Map transformation: Uses map() with a lambda to double each number in the list.
pandas apply: Applies a lambda to a DataFrame column to create a new column, categorizing values as 'high' if greater than 100, otherwise 'low'.
Collections Module
Counter: Creates a dictionary-like object that counts occurrences of each element. The most_common(n) method returns the n most frequent items.
defaultdict: A dictionary that automatically creates default values for missing keys. Using defaultdict(list) allows appending to lists without first checking if the key exists.
deque: A double-ended queue that supports efficient append and pop operations from both ends. Setting maxlen creates a fixed-size sliding window that automatically discards old items.
Itertools Module
groupby: Groups consecutive elements with the same key. Data must be sorted by the grouping key first. Iterates over groups yielding (key, group_iterator) pairs.
chain: Flattens multiple iterables into a single sequence by chaining them together without creating intermediate lists.
combinations: Generates all unique r-length combinations from an iterable. For example, combinations of 2 items from [1, 2, 3] produces (1,2), (1,3), (2,3).
NumPy Operations
Array Creation
From list: Convert a Python list to a NumPy array using np.array().
Common patterns:
- zeros: Creates an array filled with zeros of specified shape (e.g., 3x4 matrix)
- ones: Creates an array filled with ones of specified shape
- arange: Creates evenly spaced values within a range with specified step size
- linspace: Creates evenly spaced values between start and end with specified count
- random.randn: Generates random samples from a standard normal distribution
Vectorized Operations
Vectorized operations execute faster than Python loops.
Avoid loops: Iterating through an array with a for loop to perform element-wise operations is slow.
Preferred vectorized approach: Apply operations directly to the array (e.g., arr * 2 + 1), which NumPy executes efficiently in compiled code.
Boolean indexing: Use conditions to select elements (arr[arr > 0] returns positive values) or modify elements in place (setting all negative values to zero).
Statistical Operations
Basic statistics: NumPy provides methods for mean, standard deviation, variance, min, max, median, and percentiles.
Aggregation along axis: Specifying axis=0 aggregates across rows (column-wise), while axis=1 aggregates across columns (row-wise).
Reshape: The reshape method changes array dimensions while preserving data. The flatten method converts any array to 1D.
Concatenate: Use concatenate to join arrays along an existing axis, vstack to stack arrays vertically (row-wise), and hstack to stack horizontally (column-wise).
Broadcasting
NumPy automatically expands dimensions when shapes are compatible.
Example: Adding a 1D row array to a 2D matrix broadcasts the row to match the matrix shape, adding it to each row.
Column normalization: Calculate column means and standard deviations, then subtract means and divide by standard deviations. Broadcasting applies these operations element-wise across all rows.
Pandas Operations
DataFrame Creation and Selection
From dictionary: Create a DataFrame by passing a dictionary where keys become column names and values are lists of column data.
Selection methods:
- Single column: Use bracket notation with column name to get a Series
- Multiple columns: Pass a list of column names to get a DataFrame
- Row by label: Use .loc[label] for label-based indexing
- Row by position: Use .iloc[position] for integer-based indexing
- Specific cell: Combine .loc with row label and column name
- Filter rows: Pass a boolean condition to .loc
Filtering
Single condition: Filter by passing a boolean Series to bracket notation.
Multiple conditions: Combine conditions using & (and) and | (or) operators. Each condition must be in parentheses. Do not use Python's 'and' and 'or' keywords.
Query method: Alternative syntax using a string expression, more readable for complex filters.
isin method: Check if values are in a list of acceptable values.
GroupBy Operations
Basic aggregation: Group by a column and apply an aggregation function to another column.
Multiple aggregations: Use .agg() with a dictionary specifying different aggregations for different columns.
Named aggregations: Use .agg() with keyword arguments specifying output column names and (input_column, function) tuples.
Transform: Apply a function to each group while preserving the original DataFrame shape. Useful for adding group-level statistics back to individual rows (e.g., normalizing values within each group).
Merging and Joining
merge (SQL-like JOIN): Combine DataFrames on common columns. Specify the join column with 'on' and join type with 'how' (inner, left, right, outer).
Different join keys: When column names differ between DataFrames, use left_on and right_on to specify the join columns.
Multiple keys: Pass a list of column names to join on multiple columns.
concat (stack DataFrames): Use concat to combine DataFrames by stacking rows (default) or columns (axis=1). Use ignore_index=True to reset the index.
| Join Type | Description |
|---|---|
| inner | Only matching rows |
| left | All left rows, matching right rows |
| right | Matching left rows, all right rows |
| outer | All rows from both tables |
Handling Missing Data
Check for nulls: Use isnull().sum() to count nulls per column, or isna() to get a boolean mask.
Drop nulls: Use dropna() to remove rows with any null values, or specify a subset of columns to check.
Fill nulls: Use fillna() to replace nulls with a constant value, column means, forward-fill from previous values, or specify different fill values per column using a dictionary.
Time Series Operations
Convert to datetime: Use pd.to_datetime() to convert string columns to datetime objects.
Extract components: Access .dt accessor to extract year, month, day of week, and other datetime components.
Resample: Group time series data by frequency (D=daily, M=monthly, etc.) and apply aggregations. Requires datetime index.
Rolling window: Calculate statistics over a moving window of specified size (e.g., 7-day moving average).
Shift (lag): Create lagged values using shift() or calculate percentage changes with pct_change().
Common Interview Patterns
Moving Average
Pure Python implementation: Iterate through the array, returning None for the first (window-1) elements, then computing the average of the previous window elements for each position.
NumPy implementation: Use np.convolve with a kernel of ones divided by window size. The 'valid' mode returns only complete windows.
pandas implementation: Use the rolling() method with mean() aggregation, which handles edge cases and returns NaN for incomplete windows.
Finding Duplicates
In a list: Maintain two sets: one for seen elements and one for duplicates. For each element, check if it is in seen (add to duplicates if so), then add to seen.
In pandas: Use the duplicated() method with keep=False to mark all duplicate rows (not just the first or last). Specify subset to check only certain columns.
Two Sum
Find two numbers that sum to a target value.
Algorithm: Use a dictionary to store each number's index. For each number, calculate its complement (target minus current number). If the complement exists in the dictionary, return both indices. Otherwise, store the current number and index. This achieves O(n) time complexity with a single pass through the array.
Implementing Correlation
Pearson correlation coefficient from scratch: Calculate means of both arrays, then compute the numerator as the sum of products of deviations from means. The denominator is the product of the standard deviations (square roots of sum of squared deviations). Divide numerator by denominator.
NumPy implementation: Use np.corrcoef(x, y) which returns a correlation matrix; extract the [0, 1] element for the correlation between x and y.
Sampling Methods
Random sample without replacement: Use random.sample() to select k unique items from a population.
Random sample with replacement (bootstrap): Use random.choice() in a loop or list comprehension to allow the same item to be selected multiple times.
Weighted random sample: Use random.choices() with a weights parameter to bias selection toward certain items.
pandas sampling: Use df.sample() to randomly select rows. Specify n for count, frac for fraction, replace=True for bootstrap sampling, and weights for probability-weighted selection.
Reservoir Sampling
Sample k items from a stream of unknown size.
Algorithm: Fill the reservoir with the first k items. For each subsequent item at position i, generate a random integer j between 0 and i (inclusive). If j is less than k, replace the item at position j in the reservoir. This ensures each item has equal probability of being in the final sample, regardless of stream size.
Pivot Tables
Basic pivot: Use pivot_table() with values (column to aggregate), index (row labels), columns (column labels), and aggfunc (aggregation function) to reshape data.
Multiple aggregations: Pass a list of aggregation functions to aggfunc to compute multiple statistics in one pivot table.
Performance Optimization
Avoid DataFrame Iteration
Avoid row iteration: Using iterrows() in a loop is slow because it creates a new Series for each row.
Preferred vectorized approach: Apply operations directly to columns. pandas executes these operations in optimized compiled code, often 100x faster than row-by-row iteration.
Data Type Optimization
Convert to categorical: Use astype('category') for columns with repeated string values. Categorical storage uses integer codes internally, reducing memory usage significantly.
Downcast numeric types: Use pd.to_numeric with downcast parameter to automatically select the smallest integer or float type that can represent the data.
Chunked Reading
For large files that exceed available memory:
Chunked processing: Use the chunksize parameter in read_csv to read the file in smaller pieces. Iterate through chunks, process each one, collect results, then concatenate. This allows processing files larger than available RAM.
Practice Problem Types
| Problem | Concepts Tested |
|---|---|
| Calculate user retention by cohort | groupby, joins, date manipulation |
| Find top N items per category | groupby, sorting, rank |
| Detect anomalies in time series | rolling statistics, z-scores |
| Implement A/B test analysis | statistics, group comparison |
| Clean and merge multiple data sources | data wrangling, joins |
| Compute similarity between users | matrix operations, cosine similarity |