Feature Engineering
Feature engineering transforms raw data into representations suitable for machine learning models. Feature quality often determines model performance more than algorithm selection.
Numerical Features
Scaling
Standardization (z-score): X_scaled = (X - mean) / std
Centers data to mean 0 and standard deviation 1.
Min-Max scaling: X_scaled = (X - min) / (max - min)
Scales data to the range [0, 1].
Transformations
| Transform | Use Case |
|---|---|
| Log transform | Skewed distributions |
| Power transform | Box-Cox, Yeo-Johnson normalization |
| Binning | Convert to categorical |
Categorical Features
Encoding Methods
| Method | Use Case | Characteristics |
|---|---|---|
| One-hot | Low cardinality | No ordering assumed |
| Label encoding | Ordinal categories | Compact representation |
| Target encoding | High cardinality | Uses target information |
| Embedding | Very high cardinality | Learned representations |
One-hot encoding: Convert each category into a binary column. A category with k values becomes k binary features.
Target encoding: Replace each category with the mean target value for that category. Requires care to prevent data leakage (compute means on training data only).
Feature Creation
Aggregations
Group data by entity (e.g., user_id) and compute aggregate statistics:
| Aggregation | Example |
|---|---|
| Mean | Average purchase amount per user |
| Sum | Total purchase amount per user |
| Count | Number of purchases per user |
| Max | Most recent purchase date |
Interactions
Create new features by combining existing features:
| Interaction | Formula | Interpretation |
|---|---|---|
| Ratio | price / sqft | Price per square foot |
| Product | age * income | Captures joint effect of age and income |
Time Features
Extract components from datetime fields:
| Feature | Extraction | Use Case |
|---|---|---|
| hour | Hour of day (0-23) | Time-of-day patterns |
| day_of_week | Day (0=Monday to 6=Sunday) | Weekly patterns |
| is_weekend | True if Saturday or Sunday | Weekend vs weekday behavior |
Feature Selection
Filter Methods
| Method | Description |
|---|---|
| Correlation | Relationship with target |
| Mutual information | Non-linear dependency measure |
| Chi-squared test | Categorical feature significance |
Wrapper Methods
| Method | Description |
|---|---|
| Forward selection | Add features incrementally |
| Backward elimination | Remove features incrementally |
| Recursive feature elimination | Iteratively remove least important |
Embedded Methods
| Method | Description |
|---|---|
| Lasso (L1) regularization | Zeros out unimportant weights |
| Tree-based importance | Built-in feature ranking |
Handling Missing Values
| Strategy | Use Case |
|---|---|
| Drop rows | Few missing values, random missingness |
| Mean/median | Numerical features, MCAR |
| Mode | Categorical features |
| Model-based | Complex missingness patterns |
| Indicator | Missingness is informative |
Reference
| Topic | Guidance |
|---|---|
| High-cardinality categoricals | Target encoding, frequency encoding, or embeddings. One-hot encoding creates excessive dimensionality. Target encoding requires leakage prevention. |
| Target encoding risks | Leakage if computed on training data used for model fitting. Use cross-validation or holdout set for encoding computation. |
| Feature selection approach | Start with domain knowledge. Use correlation with target, tree-based importance, or regularization to identify weak features. Remove and validate. |
| Missing value handling | Depends on missingness mechanism. Random: impute. Non-random: create indicator. Excessive missingness: drop feature. |
| Domain-specific features | Consider signals predictive of the outcome. E-commerce example: purchase recency, price sensitivity, category preferences, time-of-day patterns. |