Update 2025-06-16:
- Elaborated on the concept of SVD and added a code example.
- Expanded the section on arbitury value imputation with explanation and code example.
Effective data preparation is essential for building robust machine learning models. This document summarizes and elaborates on the key techniques involved in preparing data for supervised and unsupervised learning tasks.
1. Understanding Types of Data
There are two types of data. Qualitataive data describes characteristics of an object, while quantitative data describes the quantity of an object.
Qualitative (Categorical) Data
- Nominal: Named categories with no order (e.g., gender, country).
- Cannot perform arithmetic operations.
- Encoded using one-hot or label encoding.
- Ordinal: Categories with a natural order (e.g., satisfaction: low, medium, high).
- Often encoded with integer mapping, preserving order.
Quantitative (Numerical) Data
- Interval: Numeric values with meaningful differences, but no true zero (e.g., temperature in Celsius).
- Can compute mean, median, std deviation.
- Ratio: Numeric values with a true zero (e.g., income, age).
- All arithmetic operations valid.
Discrete vs. Continuous Attributes
- Discrete: Countable values (e.g., number of children).
- Continuous: Infinite values within a range (e.g., height, weight).
2. Exploring and Summarizing Data
One we obtained data (data collection) from the real world, we need to explore and summarize the data (data analysis). Visualization is often used in this stage to understand the data distribution (measures of data spread).
Measures of Central Tendency
- Mean: Sensitive to outliers.
- Median: Robust to outliers, useful in skewed data.
- Mode: Most frequently occurring value.
Measures of Spread
- Variance & Standard Deviation: Show how data is distributed around the mean.
- Range, Quartiles, IQR: Help detect outliers and data skew.
3. Visualizing Data
There are many ways to visualize data. Here are some common ones.
In the examples we will use the matplotlib
library to plot them.
Box Plot
- Visualizes five-number summary: min, Q1, median, Q3, max.
- Highlights outliers beyond 1.5 × IQR.
Histogram
- Shows frequency distribution.
- Helps identify skewness, modality, and spread.
Scatter Plot
- Used for bivariate relationships.
- Reveals correlation and patterns between two variables.
Cross-tabulation
- Used to explore relationships between categorical variables.
- Displays frequency distribution in a matrix.
4. Data Quality Issues and Remediation
Handling missing values and outliers is an important step in data preparation. Real world data, often than not, are far from perfect. Missing data, outliers, and many other issues need to be addressed in this step for effective machine learning.
Missing Values
- Causes: Survey non-responses, manual entry errors, data corruption.
- Remedies:
- Deletion: Remove rows/columns with missing data (only if safe).
- Imputation:
- Mean/Median (numerical)
- Mode (categorical)
- Group-based imputation (e.g., by similar rows)
- Model-based estimation: Use predictive modeling or similarity functions.
Deletion is often applied when we don’t lose much information by deleting some rows of data. This is often associated with the dropna()
method. Imputation on the other hand may become a more practical method to preserve important data attributes while not affecting the data distribution by supplying artificial values for the missing data.
What is Imputation?
Imputation is the process of replacing missing data with substituted values. It’s crucial because most ML algorithms cannot handle missing values directly.
Common Imputation Methods:
- Mean/Median Imputation: Replace missing values with the mean or median of the column.
- Best for: Normally distributed data without outliers
- When to use: When data is missing completely at random
- Group-based Imputation: Replace missing values with the mean/median of a group
- Best for: When data has meaningful groups
- Example: Filling missing horsepower based on car cylinder count
- KNN Imputation: Use k-nearest neighbors to impute missing values
- Best for: When patterns exist in the data
- Most accurate but computationally expensive
- Arbitrary Value Imputation: Replace missing values with a distinct value like -999 or 9999
- Best for: Tree-based models (Decision Trees, Random Forests, XGBoost)
- When to use:
- When missingness itself might be informative
- When you want to highlight that the value was originally missing
- When missing values have a specific meaning (e.g., “not measured” vs “zero”)
- Why make them stand out?:
- Preserves information: The model can learn that “missing” is different from any actual value
- Pattern detection: The model might discover that missing values correlate with the target variable
- Prevents false patterns: Using mean/median might create artificial patterns that don’t exist
- Handles MNAR: Particularly useful when data is Missing Not At Random (MNAR)
- Example: In credit scoring, if income is missing, it might indicate self-employment or other special cases
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Sample credit scoring data
data = {
'age': [25, 30, 35, 40, 45, 50, 55, 60],
'income': [50000, 75000, np.nan, 90000, np.nan, 120000, 150000, np.nan],
'credit_score': [650, 700, 720, 680, 800, 750, 820, 780],
'default': [0, 0, 1, 0, 0, 1, 1, 0] # 1 = default, 0 = no default
}
df = pd.DataFrame(data)
# Before imputation
print("Before imputation:")
print(df)
# Using -999 as the arbitrary value for missing income
ARBITRARY_VALUE = -999
df_imputed = df.fillna(ARBITRARY_VALUE)
print("\nAfter imputation:")
print(df_imputed)
# Train a simple Random Forest to see how it handles the arbitrary value
X = df_imputed[['age', 'income', 'credit_score']]
y = df_imputed['default']
# The model will treat -999 as a special category
model = RandomForestClassifier(random_state=42)
model.fit(X, y)
# Feature importance will show how much the 'missingness' matters
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature importance:")
print(feature_importance)
This example shows how missing income values are replaced with -999, and how a tree-based model can use this information to make predictions. The feature importance output will show how much the model relies on the ‘missingness’ of income as a predictive feature.
Outliers
Outliers are data points significantly different from other observations. They can be caused by measurement errors, data entry errors, or natural variations.
Impact of Outliers:
- Can skew statistical measures
- May affect model performance
- Can cause models to be overly influenced by extreme values
Detection Methods:
- IQR Method:
- Calculate Q1 (25th percentile) and Q3 (75th percentile)
- IQR = Q3 - Q1
- Lower bound = Q1 - 1.5*IQR
- Upper bound = Q3 + 1.5*IQR
- Points outside these bounds are considered outliers
- Z-score Method:
- Calculate z-scores: z = (x - mean) / std
-
Points with z > 3 are typically considered outliers
Handling Techniques:
- Capping (Winsorization): Replace outliers with the nearest non-outlier value
- Transformation: Apply log, square root, or other transformations
- Removal: If outliers are errors or not representative
- Separate Modeling: Create a separate model for outliers
5. Feature Scaling
Many machine learning algorithms perform better or converge faster when features are on similar scales. Scaling also ensures that features with different magnitudes do not dominate model learning.
Standardization (Z-score)
\(x' = \frac{x - \mu}{\sigma}\)
- Centers data at 0 mean and unit variance.
- Used when data has outliers or normal distribution.
Normalization (Min-Max Scaling)
\(x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}\)
- Scales features to [0, 1] range.
- Sensitive to outliers.
Robust Scaling
- Uses median and IQR
- Formula: (x - median) / IQR
- Best for: Data with outliers
When to Scale?
- Required for:
- Distance-based algorithms (KNN, K-means, SVM with RBF kernel)
- Neural networks
- Regularized models (Ridge, Lasso)
- PCA
- Not needed for:
- Tree-based models (Decision Trees, Random Forest, XGBoost)
- Naive Bayes
6. Dimensionality Reduction
Reduces the number of features while preserving important information.
Why Reduce Dimensions?
- Curse of Dimensionality: As dimensions increase, data becomes sparse
- Reduces Overfitting: Fewer features mean fewer parameters to learn
- Speeds Up Training: Less computation required
- Improves Visualization: Easier to visualize 2D or 3D data
Principal Component Analysis (PCA)
- Projects data onto principal components that maximize variance
- Steps:
- Standardize the data
- Calculate covariance matrix
- Calculate eigenvectors and eigenvalues
- Select top k eigenvectors
- Transform data to new space
When to Use:
- When features are correlated
- For visualization
- Before training models with many features
- For noise reduction
SVD (Singular Value Decomposition)
- Matrix factorization method that decomposes a matrix A into three matrices: A = UΣVᵀ
- U: Left singular vectors (orthonormal)
- Σ: Diagonal matrix of singular values (in descending order)
- V: Right singular vectors (orthonormal)
- Key properties:
- Captures latent features in data
- Preserves maximum variance with fewer dimensions
- Used in recommendation systems, image compression, and NLP
Example: Document-Term Matrix
Consider a simplified document-term matrix showing word counts in documents:
movie | film | show | book | read | |
---|---|---|---|---|---|
Doc1 | 2 | 1 | 0 | 3 | 2 |
Doc2 | 1 | 1 | 2 | 1 | 1 |
Doc3 | 0 | 0 | 1 | 2 | 3 |
After applying SVD (using 2 components), we can identify:
- Latent topics: Underlying themes not directly visible in the raw data
- Example: Words like “movie”, “film”, “show” might form an “entertainment” topic
- Example: Words like “book”, “read” might form a “reading” topic
- These topics emerge from patterns in how words co-occur across documents
- Document similarity based on topic distribution
- Important terms that define each topic
Understanding Latent Topics:
- They represent hidden themes that aren’t explicitly labeled in the data
- Each document can be a mixture of multiple topics
- They help uncover the underlying structure in text data
- Useful for organizing, searching, and analyzing large text collections
Example Interpretation:
- If a document scores high on the “entertainment” topic and low on “reading”, it likely discusses movies/shows rather than books
- The strength of topic associations helps in document clustering and recommendation systems
Python Implementation
from sklearn.decomposition import TruncatedSVD
import numpy as np
# Example document-term matrix
dtm = np.array([
[2, 1, 0, 3, 2],
[1, 1, 2, 1, 1],
[0, 0, 1, 2, 3]
])
# Apply SVD with 2 components
svd = TruncatedSVD(n_components=2)
svd.fit(dtm)
# Transformed data (documents in latent space)
transformed = svd.transform(dtm)
print("Transformed document vectors:")
print(transformed)
# Explained variance ratio
print("\nExplained variance ratio:", svd.explained_variance_ratio_)
When to Use:
- For high-dimensional data reduction
- In recommendation systems (collaborative filtering)
- For document clustering and topic modeling
- When you need to capture latent relationships in data
7. Feature Selection
Selects the most relevant subset of features to:
- Reduce overfitting
- Improve model interpretability
- Lower computational cost
Types of Features:
- Irrelevant: Contribute no predictive power.
- Redundant: Duplicate information from other features.
Methods:
- Filter Methods:
- Select features based on statistical tests
- Example: Correlation coefficient, Chi-square test
- Fast but doesn’t consider feature interactions
- Wrapper Methods:
- Use a subset of features and train a model
- Example: Recursive Feature Elimination (RFE)
- Computationally expensive but more accurate
- Embedded Methods:
- Feature selection as part of the model training process
- Example: Lasso regression, Decision trees
- Efficient and accurate but model-specific
Summary Table
Task | Technique |
---|---|
Identify variable types | Nominal, Ordinal, Interval, Ratio |
Summarize numeric data | Mean, Median, Std Dev, IQR |
Visualize data | Histogram, Box plot, Scatter plot |
Handle missing values | Drop, Impute, Predict |
Treat outliers | Remove, Cap, Investigate |
Scale features | Standardize, Normalize |
Reduce dimensions | PCA, SVD |
Select features | Filter, Wrapper, Embedded Methods |
This notebook illustrates key technical points from data preparation for machine learning using a fake dataset about car attributes and fuel efficiency.
Example Dataset
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression
# Fake dataset
data = {
"car_name": ["car_a", "car_b", "car_c", "car_d", "car_e", "car_f"],
"cylinders": [4, 6, 8, 4, 4, 8],
"displacement": [140, 200, 360, 150, 130, 3700],
"horsepower": [90, 105, 215, 92, np.nan, 220], # np (numpy - numeric python - library for scientific computing. nan: not a number/null)
"weight": [2400, 3000, 4300, 2500, 2200, 4400],
"acceleration": [15.5, 14.0, 12.5, 16.0, 15.0, 11.0],
"model_year": [80, 78, 76, 82, 81, 77],
"origin": [1, 1, 1, 2, 3, 1],
"mpg": [30.5, 24.0, 13.0, 29.5, 32.0, 10.0]
}
df = pd.DataFrame(data)
df
car_name | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | mpg | |
---|---|---|---|---|---|---|---|---|---|
0 | car_a | 4 | 140 | 90.0 | 2400 | 15.5 | 80 | 1 | 30.5 |
1 | car_b | 6 | 200 | 105.0 | 3000 | 14.0 | 78 | 1 | 24.0 |
2 | car_c | 8 | 360 | 215.0 | 4300 | 12.5 | 76 | 1 | 13.0 |
3 | car_d | 4 | 150 | 92.0 | 2500 | 16.0 | 82 | 2 | 29.5 |
4 | car_e | 4 | 130 | NaN | 2200 | 15.0 | 81 | 3 | 32.0 |
5 | car_f | 8 | 3700 | 220.0 | 4400 | 11.0 | 77 | 1 | 10.0 |
Data Types
car_name
: Nominal (categorical)cylinders
,origin
: Ordinal/Categoricaldisplacement
,horsepower
,weight
,acceleration
,mpg
: Ratio (numeric)model_year
: Interval
Handling Missing Values
# 1. Handling Missing Values Example
print("=== Missing Values Before Imputation ===")
print(df.isna().sum())
# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
df['horsepower_mean'] = mean_imputer.fit_transform(df[['horsepower']])
# Group-based imputation
group_means = df.groupby('cylinders')['horsepower'].transform('mean')
df['horsepower_group'] = df['horsepower'].fillna(group_means)
# KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
df['horsepower_knn'] = knn_imputer.fit_transform(df[['horsepower']])
print("\n=== After Imputation ===")
df[['horsepower', 'horsepower_mean', 'horsepower_group', 'horsepower_knn']]
=== Missing Values Before Imputation ===
car_name 0
cylinders 0
displacement 0
horsepower 1
weight 0
acceleration 0
model_year 0
origin 0
mpg 0
dtype: int64
=== After Imputation ===
horsepower | horsepower_mean | horsepower_group | horsepower_knn | |
---|---|---|---|---|
0 | 90.0 | 90.0 | 90.0 | 90.0 |
1 | 105.0 | 105.0 | 105.0 | 105.0 |
2 | 215.0 | 215.0 | 215.0 | 215.0 |
3 | 92.0 | 92.0 | 92.0 | 92.0 |
4 | NaN | 144.4 | 91.0 | 144.4 |
5 | 220.0 | 220.0 | 220.0 | 220.0 |
Handling Outliers
# 2. Handling Outliers Example
def detect_and_handle_outliers(df, column):
# Calculate IQR
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Detect outliers
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
print(f'Detected {len(outliers)} outliers in {column}')
# Visualize before and after
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.boxplot(y=df[column])
plt.title(f'Original {column}')
# Capping outliers
df[f'{column}_capped'] = np.where(df[column] > upper_bound, upper_bound,
np.where(df[column] < lower_bound, lower_bound, df[column]))
plt.subplot(1, 2, 2)
sns.boxplot(y=df[f'{column}_capped'])
plt.title(f'After Capping {column}')
plt.tight_layout()
plt.show()
return df
df = detect_and_handle_outliers(df, 'displacement')
Detected 1 outliers in displacement
Feature Scaling (Standardization)
# 3. Feature Scaling Example
# Original data
numeric_cols = ['weight', 'acceleration', 'displacement']
print('Original data:')
print(df[numeric_cols].head())
# Standardization
scaler = StandardScaler()
df_std = df.copy()
df_std[numeric_cols] = scaler.fit_transform(df[numeric_cols])
# Min-Max Scaling
minmax = MinMaxScaler()
df_minmax = df.copy()
df_minmax[numeric_cols] = minmax.fit_transform(df[numeric_cols])
print('\nStandardized data (mean=0, std=1):')
print(df_std[numeric_cols].head())
print('Min-Max Scaled data (range [0,1]):')
print(df_minmax[numeric_cols].head())
Original data:
weight acceleration displacement
0 2400 15.5 140
1 3000 14.0 200
2 4300 12.5 360
3 2500 16.0 150
4 2200 15.0 130
Standardized data (mean=0, std=1):
weight acceleration displacement
0 -0.820462 0.854242 -0.489225
1 -0.149175 0.000000 -0.443360
2 1.305280 -0.854242 -0.321054
3 -0.708580 1.138990 -0.481581
4 -1.044224 0.569495 -0.496869
Min-Max Scaled data (range [0,1]):
weight acceleration displacement
0 0.090909 0.9 0.002801
1 0.363636 0.6 0.019608
2 0.954545 0.3 0.064426
3 0.136364 1.0 0.005602
4 0.000000 0.8 0.000000
Box Plot Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 5))
sns.boxplot(data=df[['mpg', 'weight', 'acceleration']])
plt.title("Box Plot of Numeric Features")
plt.show()
Histogram
df[['acceleration']].hist(bins=5, figsize=(6, 4))
plt.title("Histogram of Acceleration")
plt.show()
Scatter Plot
sns.scatterplot(x='weight', y='mpg', data=df)
plt.title("Scatter Plot: Weight vs MPG")
plt.show()
Cross-Tabulation
pd.crosstab(df['origin'], df['cylinders'])
cylinders | 4 | 6 | 8 |
---|---|---|---|
origin | |||
1 | 1 | 1 | 2 |
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
Dimensionality Reduction (PCA)
# 4. Dimensionality Reduction Example
# Prepare data for PCA
X = df[['weight', 'acceleration', 'displacement_capped']]
y = df['mpg']
# Standardize the data first
X_scaled = StandardScaler().fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Create a new dataframe for the principal components
df_pca = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
df_pca['mpg'] = y.values
# Plot the results
plt.figure(figsize=(8, 6))
scatter = plt.scatter(df_pca['PC1'], df_pca['PC2'], c=df_pca['mpg'], cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.colorbar(scatter, label='MPG')
plt.title('PCA of Car Features')
plt.show()
print(f'Explained variance ratio: {pca.explained_variance_ratio_}')
print(f'Total explained variance: {sum(pca.explained_variance_ratio_):.2f}%')
Explained variance ratio: [0.95929265 0.02632386]
Total explained variance: 0.99%
Feature Selection
We might drop car_name
or model_year
if they are found irrelevant using feature importance techniques.