Skip to main content

Python Coding for Data Science Interviews

Data science Python interviews focus on practical data manipulation using pandas, numpy, and standard library tools. This section covers the patterns and techniques commonly tested.

Interview Question Categories

TypeFocus AreaExample
Data manipulationpandas/numpy operationsClean dataset and compute metrics
AlgorithmsBasic computer science conceptsFind duplicates, implement moving average
StatisticsImplementing statistical functionsWrite correlation function from scratch
Data structuresLists, dictionaries, setsGroup records efficiently

Python Fundamentals

List Comprehensions

Basic comprehension: Creates a list of squares for numbers 0-9 by iterating through range(10) and squaring each value.

With condition: Creates squares only for even numbers by adding a conditional filter (if x % 2 == 0) to the comprehension.

Nested comprehension: Creates a 3x3 matrix where each element is the product of its row and column indices.

Dictionary comprehension: Creates a dictionary mapping each word to its length.

Set comprehension: Creates a set of unique first characters from a list of words.

Lambda Functions

Sort by second element: Uses sorted() with a lambda key function that returns the second element of each pair for comparison.

Filter positive numbers: Uses filter() with a lambda that returns True only for positive values.

Map transformation: Uses map() with a lambda to double each number in the list.

pandas apply: Applies a lambda to a DataFrame column to create a new column, categorizing values as 'high' if greater than 100, otherwise 'low'.

Collections Module

Counter: Creates a dictionary-like object that counts occurrences of each element. The most_common(n) method returns the n most frequent items.

defaultdict: A dictionary that automatically creates default values for missing keys. Using defaultdict(list) allows appending to lists without first checking if the key exists.

deque: A double-ended queue that supports efficient append and pop operations from both ends. Setting maxlen creates a fixed-size sliding window that automatically discards old items.

Itertools Module

groupby: Groups consecutive elements with the same key. Data must be sorted by the grouping key first. Iterates over groups yielding (key, group_iterator) pairs.

chain: Flattens multiple iterables into a single sequence by chaining them together without creating intermediate lists.

combinations: Generates all unique r-length combinations from an iterable. For example, combinations of 2 items from [1, 2, 3] produces (1,2), (1,3), (2,3).

NumPy Operations

Array Creation

From list: Convert a Python list to a NumPy array using np.array().

Common patterns:

  • zeros: Creates an array filled with zeros of specified shape (e.g., 3x4 matrix)
  • ones: Creates an array filled with ones of specified shape
  • arange: Creates evenly spaced values within a range with specified step size
  • linspace: Creates evenly spaced values between start and end with specified count
  • random.randn: Generates random samples from a standard normal distribution

Vectorized Operations

Vectorized operations execute faster than Python loops.

Avoid loops: Iterating through an array with a for loop to perform element-wise operations is slow.

Preferred vectorized approach: Apply operations directly to the array (e.g., arr * 2 + 1), which NumPy executes efficiently in compiled code.

Boolean indexing: Use conditions to select elements (arr[arr > 0] returns positive values) or modify elements in place (setting all negative values to zero).

Statistical Operations

Basic statistics: NumPy provides methods for mean, standard deviation, variance, min, max, median, and percentiles.

Aggregation along axis: Specifying axis=0 aggregates across rows (column-wise), while axis=1 aggregates across columns (row-wise).

Reshape: The reshape method changes array dimensions while preserving data. The flatten method converts any array to 1D.

Concatenate: Use concatenate to join arrays along an existing axis, vstack to stack arrays vertically (row-wise), and hstack to stack horizontally (column-wise).

Broadcasting

NumPy automatically expands dimensions when shapes are compatible.

Example: Adding a 1D row array to a 2D matrix broadcasts the row to match the matrix shape, adding it to each row.

Column normalization: Calculate column means and standard deviations, then subtract means and divide by standard deviations. Broadcasting applies these operations element-wise across all rows.

Pandas Operations

DataFrame Creation and Selection

From dictionary: Create a DataFrame by passing a dictionary where keys become column names and values are lists of column data.

Selection methods:

  • Single column: Use bracket notation with column name to get a Series
  • Multiple columns: Pass a list of column names to get a DataFrame
  • Row by label: Use .loc[label] for label-based indexing
  • Row by position: Use .iloc[position] for integer-based indexing
  • Specific cell: Combine .loc with row label and column name
  • Filter rows: Pass a boolean condition to .loc

Filtering

Single condition: Filter by passing a boolean Series to bracket notation.

Multiple conditions: Combine conditions using & (and) and | (or) operators. Each condition must be in parentheses. Do not use Python's 'and' and 'or' keywords.

Query method: Alternative syntax using a string expression, more readable for complex filters.

isin method: Check if values are in a list of acceptable values.

GroupBy Operations

Basic aggregation: Group by a column and apply an aggregation function to another column.

Multiple aggregations: Use .agg() with a dictionary specifying different aggregations for different columns.

Named aggregations: Use .agg() with keyword arguments specifying output column names and (input_column, function) tuples.

Transform: Apply a function to each group while preserving the original DataFrame shape. Useful for adding group-level statistics back to individual rows (e.g., normalizing values within each group).

Merging and Joining

merge (SQL-like JOIN): Combine DataFrames on common columns. Specify the join column with 'on' and join type with 'how' (inner, left, right, outer).

Different join keys: When column names differ between DataFrames, use left_on and right_on to specify the join columns.

Multiple keys: Pass a list of column names to join on multiple columns.

concat (stack DataFrames): Use concat to combine DataFrames by stacking rows (default) or columns (axis=1). Use ignore_index=True to reset the index.

Join TypeDescription
innerOnly matching rows
leftAll left rows, matching right rows
rightMatching left rows, all right rows
outerAll rows from both tables

Handling Missing Data

Check for nulls: Use isnull().sum() to count nulls per column, or isna() to get a boolean mask.

Drop nulls: Use dropna() to remove rows with any null values, or specify a subset of columns to check.

Fill nulls: Use fillna() to replace nulls with a constant value, column means, forward-fill from previous values, or specify different fill values per column using a dictionary.

Time Series Operations

Convert to datetime: Use pd.to_datetime() to convert string columns to datetime objects.

Extract components: Access .dt accessor to extract year, month, day of week, and other datetime components.

Resample: Group time series data by frequency (D=daily, M=monthly, etc.) and apply aggregations. Requires datetime index.

Rolling window: Calculate statistics over a moving window of specified size (e.g., 7-day moving average).

Shift (lag): Create lagged values using shift() or calculate percentage changes with pct_change().

Common Interview Patterns

Moving Average

Pure Python implementation: Iterate through the array, returning None for the first (window-1) elements, then computing the average of the previous window elements for each position.

NumPy implementation: Use np.convolve with a kernel of ones divided by window size. The 'valid' mode returns only complete windows.

pandas implementation: Use the rolling() method with mean() aggregation, which handles edge cases and returns NaN for incomplete windows.

Finding Duplicates

In a list: Maintain two sets: one for seen elements and one for duplicates. For each element, check if it is in seen (add to duplicates if so), then add to seen.

In pandas: Use the duplicated() method with keep=False to mark all duplicate rows (not just the first or last). Specify subset to check only certain columns.

Two Sum

Find two numbers that sum to a target value.

Algorithm: Use a dictionary to store each number's index. For each number, calculate its complement (target minus current number). If the complement exists in the dictionary, return both indices. Otherwise, store the current number and index. This achieves O(n) time complexity with a single pass through the array.

Implementing Correlation

Pearson correlation coefficient from scratch: Calculate means of both arrays, then compute the numerator as the sum of products of deviations from means. The denominator is the product of the standard deviations (square roots of sum of squared deviations). Divide numerator by denominator.

NumPy implementation: Use np.corrcoef(x, y) which returns a correlation matrix; extract the [0, 1] element for the correlation between x and y.

Sampling Methods

Random sample without replacement: Use random.sample() to select k unique items from a population.

Random sample with replacement (bootstrap): Use random.choice() in a loop or list comprehension to allow the same item to be selected multiple times.

Weighted random sample: Use random.choices() with a weights parameter to bias selection toward certain items.

pandas sampling: Use df.sample() to randomly select rows. Specify n for count, frac for fraction, replace=True for bootstrap sampling, and weights for probability-weighted selection.

Reservoir Sampling

Sample k items from a stream of unknown size.

Algorithm: Fill the reservoir with the first k items. For each subsequent item at position i, generate a random integer j between 0 and i (inclusive). If j is less than k, replace the item at position j in the reservoir. This ensures each item has equal probability of being in the final sample, regardless of stream size.

Pivot Tables

Basic pivot: Use pivot_table() with values (column to aggregate), index (row labels), columns (column labels), and aggfunc (aggregation function) to reshape data.

Multiple aggregations: Pass a list of aggregation functions to aggfunc to compute multiple statistics in one pivot table.

Performance Optimization

Avoid DataFrame Iteration

Avoid row iteration: Using iterrows() in a loop is slow because it creates a new Series for each row.

Preferred vectorized approach: Apply operations directly to columns. pandas executes these operations in optimized compiled code, often 100x faster than row-by-row iteration.

Data Type Optimization

Convert to categorical: Use astype('category') for columns with repeated string values. Categorical storage uses integer codes internally, reducing memory usage significantly.

Downcast numeric types: Use pd.to_numeric with downcast parameter to automatically select the smallest integer or float type that can represent the data.

Chunked Reading

For large files that exceed available memory:

Chunked processing: Use the chunksize parameter in read_csv to read the file in smaller pieces. Iterate through chunks, process each one, collect results, then concatenate. This allows processing files larger than available RAM.

Practice Problem Types

ProblemConcepts Tested
Calculate user retention by cohortgroupby, joins, date manipulation
Find top N items per categorygroupby, sorting, rank
Detect anomalies in time seriesrolling statistics, z-scores
Implement A/B test analysisstatistics, group comparison
Clean and merge multiple data sourcesdata wrangling, joins
Compute similarity between usersmatrix operations, cosine similarity