Key Techniques in Data Preparation for Machine Learning |

Update 2025-06-16:

Elaborated on the concept of SVD and added a code example.
Expanded the section on arbitury value imputation with explanation and code example.

Effective data preparation is essential for building robust machine learning models. This document summarizes and elaborates on the key techniques involved in preparing data for supervised and unsupervised learning tasks.

1. Understanding Types of Data

There are two types of data. Qualitataive data describes characteristics of an object, while quantitative data describes the quantity of an object.

Qualitative (Categorical) Data

Nominal: Named categories with no order (e.g., gender, country).
- Cannot perform arithmetic operations.
- Encoded using one-hot or label encoding.
Ordinal: Categories with a natural order (e.g., satisfaction: low, medium, high).
- Often encoded with integer mapping, preserving order.

Quantitative (Numerical) Data

Interval: Numeric values with meaningful differences, but no true zero (e.g., temperature in Celsius).
- Can compute mean, median, std deviation.
Ratio: Numeric values with a true zero (e.g., income, age).
- All arithmetic operations valid.

Discrete vs. Continuous Attributes

Discrete: Countable values (e.g., number of children).
Continuous: Infinite values within a range (e.g., height, weight).

2. Exploring and Summarizing Data

One we obtained data (data collection) from the real world, we need to explore and summarize the data (data analysis). Visualization is often used in this stage to understand the data distribution (measures of data spread).

Measures of Central Tendency

Mean: Sensitive to outliers.
Median: Robust to outliers, useful in skewed data.
Mode: Most frequently occurring value.

Measures of Spread

Variance & Standard Deviation: Show how data is distributed around the mean.
Range, Quartiles, IQR: Help detect outliers and data skew.

3. Visualizing Data

There are many ways to visualize data. Here are some common ones. In the examples we will use the matplotlib library to plot them.

Box Plot

Visualizes five-number summary: min, Q1, median, Q3, max.
Highlights outliers beyond 1.5 × IQR.

Histogram

Shows frequency distribution.
Helps identify skewness, modality, and spread.

Scatter Plot

Used for bivariate relationships.
Reveals correlation and patterns between two variables.

Cross-tabulation

Used to explore relationships between categorical variables.
Displays frequency distribution in a matrix.

4. Data Quality Issues and Remediation

Handling missing values and outliers is an important step in data preparation. Real world data, often than not, are far from perfect. Missing data, outliers, and many other issues need to be addressed in this step for effective machine learning.

Missing Values

Causes: Survey non-responses, manual entry errors, data corruption.
Remedies:
- Deletion: Remove rows/columns with missing data (only if safe).
- Imputation:
  - Mean/Median (numerical)
  - Mode (categorical)
  - Group-based imputation (e.g., by similar rows)
- Model-based estimation: Use predictive modeling or similarity functions.

Deletion is often applied when we don’t lose much information by deleting some rows of data. This is often associated with the dropna() method. Imputation on the other hand may become a more practical method to preserve important data attributes while not affecting the data distribution by supplying artificial values for the missing data.

What is Imputation?

Imputation is the process of replacing missing data with substituted values. It’s crucial because most ML algorithms cannot handle missing values directly.

Common Imputation Methods:

Mean/Median Imputation: Replace missing values with the mean or median of the column.
- Best for: Normally distributed data without outliers
- When to use: When data is missing completely at random
Group-based Imputation: Replace missing values with the mean/median of a group
- Best for: When data has meaningful groups
- Example: Filling missing horsepower based on car cylinder count
KNN Imputation: Use k-nearest neighbors to impute missing values
- Best for: When patterns exist in the data
- Most accurate but computationally expensive
Arbitrary Value Imputation: Replace missing values with a distinct value like -999 or 9999
- Best for: Tree-based models (Decision Trees, Random Forests, XGBoost)
- When to use:
  - When missingness itself might be informative
  - When you want to highlight that the value was originally missing
  - When missing values have a specific meaning (e.g., “not measured” vs “zero”)
- Why make them stand out?:
  1. Preserves information: The model can learn that “missing” is different from any actual value
  2. Pattern detection: The model might discover that missing values correlate with the target variable
  3. Prevents false patterns: Using mean/median might create artificial patterns that don’t exist
  4. Handles MNAR: Particularly useful when data is Missing Not At Random (MNAR)
- Example: In credit scoring, if income is missing, it might indicate self-employment or other special cases

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Sample credit scoring data
data = {
    'age': [25, 30, 35, 40, 45, 50, 55, 60],
    'income': [50000, 75000, np.nan, 90000, np.nan, 120000, 150000, np.nan],
    'credit_score': [650, 700, 720, 680, 800, 750, 820, 780],
    'default': [0, 0, 1, 0, 0, 1, 1, 0]  # 1 = default, 0 = no default
}

df = pd.DataFrame(data)

# Before imputation
print("Before imputation:")
print(df)

# Using -999 as the arbitrary value for missing income
ARBITRARY_VALUE = -999
df_imputed = df.fillna(ARBITRARY_VALUE)

print("\nAfter imputation:")
print(df_imputed)

# Train a simple Random Forest to see how it handles the arbitrary value
X = df_imputed[['age', 'income', 'credit_score']]
y = df_imputed['default']

# The model will treat -999 as a special category
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Feature importance will show how much the 'missingness' matters
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeature importance:")
print(feature_importance)

This example shows how missing income values are replaced with -999, and how a tree-based model can use this information to make predictions. The feature importance output will show how much the model relies on the ‘missingness’ of income as a predictive feature.

Outliers

Outliers are data points significantly different from other observations. They can be caused by measurement errors, data entry errors, or natural variations.

Impact of Outliers:

Can skew statistical measures
May affect model performance
Can cause models to be overly influenced by extreme values

Detection Methods:

IQR Method:
- Calculate Q1 (25th percentile) and Q3 (75th percentile)
- IQR = Q3 - Q1
- Lower bound = Q1 - 1.5*IQR
- Upper bound = Q3 + 1.5*IQR
- Points outside these bounds are considered outliers
Z-score Method:
- Calculate z-scores: z = (x - mean) / std
- Points with z > 3 are typically considered outliers

Handling Techniques:

Capping (Winsorization): Replace outliers with the nearest non-outlier value
Transformation: Apply log, square root, or other transformations
Removal: If outliers are errors or not representative
Separate Modeling: Create a separate model for outliers

5. Feature Scaling

Many machine learning algorithms perform better or converge faster when features are on similar scales. Scaling also ensures that features with different magnitudes do not dominate model learning.

Standardization (Z-score)

\(x' = \frac{x - \mu}{\sigma}\)

Centers data at 0 mean and unit variance.
Used when data has outliers or normal distribution.

Normalization (Min-Max Scaling)

\(x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}\)

Scales features to [0, 1] range.
Sensitive to outliers.

Robust Scaling

Uses median and IQR
Formula: (x - median) / IQR
Best for: Data with outliers

When to Scale?

Required for:
- Distance-based algorithms (KNN, K-means, SVM with RBF kernel)
- Neural networks
- Regularized models (Ridge, Lasso)
- PCA
Not needed for:
- Tree-based models (Decision Trees, Random Forest, XGBoost)
- Naive Bayes

6. Dimensionality Reduction

Reduces the number of features while preserving important information.

Why Reduce Dimensions?

Curse of Dimensionality: As dimensions increase, data becomes sparse
Reduces Overfitting: Fewer features mean fewer parameters to learn
Speeds Up Training: Less computation required
Improves Visualization: Easier to visualize 2D or 3D data

Principal Component Analysis (PCA)

Projects data onto principal components that maximize variance
Steps:
1. Standardize the data
2. Calculate covariance matrix
3. Calculate eigenvectors and eigenvalues
4. Select top k eigenvectors
5. Transform data to new space

When to Use:

When features are correlated
For visualization
Before training models with many features
For noise reduction

SVD (Singular Value Decomposition)

Matrix factorization method that decomposes a matrix A into three matrices: A = UΣVᵀ
- U: Left singular vectors (orthonormal)
- Σ: Diagonal matrix of singular values (in descending order)
- V: Right singular vectors (orthonormal)
Key properties:
- Captures latent features in data
- Preserves maximum variance with fewer dimensions
- Used in recommendation systems, image compression, and NLP

Example: Document-Term Matrix

Consider a simplified document-term matrix showing word counts in documents:

	movie	film	show	book	read
Doc1	2	1	0	3	2
Doc2	1	1	2	1	1
Doc3	0	0	1	2	3

After applying SVD (using 2 components), we can identify:

Latent topics: Underlying themes not directly visible in the raw data
- Example: Words like “movie”, “film”, “show” might form an “entertainment” topic
- Example: Words like “book”, “read” might form a “reading” topic
- These topics emerge from patterns in how words co-occur across documents
Document similarity based on topic distribution
Important terms that define each topic

Understanding Latent Topics:

They represent hidden themes that aren’t explicitly labeled in the data
Each document can be a mixture of multiple topics
They help uncover the underlying structure in text data
Useful for organizing, searching, and analyzing large text collections

Example Interpretation:

If a document scores high on the “entertainment” topic and low on “reading”, it likely discusses movies/shows rather than books
The strength of topic associations helps in document clustering and recommendation systems

Python Implementation

from sklearn.decomposition import TruncatedSVD
import numpy as np

# Example document-term matrix
dtm = np.array([
    [2, 1, 0, 3, 2],
    [1, 1, 2, 1, 1],
    [0, 0, 1, 2, 3]
])

# Apply SVD with 2 components
svd = TruncatedSVD(n_components=2)
svd.fit(dtm)

# Transformed data (documents in latent space)
transformed = svd.transform(dtm)
print("Transformed document vectors:")
print(transformed)

# Explained variance ratio
print("\nExplained variance ratio:", svd.explained_variance_ratio_)

When to Use:

For high-dimensional data reduction
In recommendation systems (collaborative filtering)
For document clustering and topic modeling
When you need to capture latent relationships in data

7. Feature Selection

Selects the most relevant subset of features to:

Reduce overfitting
Improve model interpretability
Lower computational cost

Types of Features:

Irrelevant: Contribute no predictive power.
Redundant: Duplicate information from other features.

Methods:

Filter Methods:
- Select features based on statistical tests
- Example: Correlation coefficient, Chi-square test
- Fast but doesn’t consider feature interactions
Wrapper Methods:
- Use a subset of features and train a model
- Example: Recursive Feature Elimination (RFE)
- Computationally expensive but more accurate
Embedded Methods:
- Feature selection as part of the model training process
- Example: Lasso regression, Decision trees
- Efficient and accurate but model-specific

Summary Table

Task	Technique
Identify variable types	Nominal, Ordinal, Interval, Ratio
Summarize numeric data	Mean, Median, Std Dev, IQR
Visualize data	Histogram, Box plot, Scatter plot
Handle missing values	Drop, Impute, Predict
Treat outliers	Remove, Cap, Investigate
Scale features	Standardize, Normalize
Reduce dimensions	PCA, SVD
Select features	Filter, Wrapper, Embedded Methods

This notebook illustrates key technical points from data preparation for machine learning using a fake dataset about car attributes and fuel efficiency.

Example Dataset

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression

# Fake dataset
data = {
    "car_name": ["car_a", "car_b", "car_c", "car_d", "car_e", "car_f"],
    "cylinders": [4, 6, 8, 4, 4, 8],
    "displacement": [140, 200, 360, 150, 130, 3700],
    "horsepower": [90, 105, 215, 92, np.nan, 220], # np (numpy - numeric python - library for scientific computing. nan: not a number/null)
    "weight": [2400, 3000, 4300, 2500, 2200, 4400],
    "acceleration": [15.5, 14.0, 12.5, 16.0, 15.0, 11.0],
    "model_year": [80, 78, 76, 82, 81, 77],
    "origin": [1, 1, 1, 2, 3, 1],
    "mpg": [30.5, 24.0, 13.0, 29.5, 32.0, 10.0]
}
df = pd.DataFrame(data)
df

	car_name	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	mpg
0	car_a	4	140	90.0	2400	15.5	80	1	30.5
1	car_b	6	200	105.0	3000	14.0	78	1	24.0
2	car_c	8	360	215.0	4300	12.5	76	1	13.0
3	car_d	4	150	92.0	2500	16.0	82	2	29.5
4	car_e	4	130	NaN	2200	15.0	81	3	32.0
5	car_f	8	3700	220.0	4400	11.0	77	1	10.0

Data Types

car_name: Nominal (categorical)
cylinders, origin: Ordinal/Categorical
displacement, horsepower, weight, acceleration, mpg: Ratio (numeric)
model_year: Interval

Handling Missing Values

# 1. Handling Missing Values Example
print("=== Missing Values Before Imputation ===")
print(df.isna().sum())

# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
df['horsepower_mean'] = mean_imputer.fit_transform(df[['horsepower']])

# Group-based imputation
group_means = df.groupby('cylinders')['horsepower'].transform('mean')
df['horsepower_group'] = df['horsepower'].fillna(group_means)

# KNN imputation
knn_imputer = KNNImputer(n_neighbors=2)
df['horsepower_knn'] = knn_imputer.fit_transform(df[['horsepower']])

print("\n=== After Imputation ===")
df[['horsepower', 'horsepower_mean', 'horsepower_group', 'horsepower_knn']]

=== Missing Values Before Imputation ===
car_name        0
cylinders       0
displacement    0
horsepower      1
weight          0
acceleration    0
model_year      0
origin          0
mpg             0
dtype: int64

=== After Imputation ===

	horsepower	horsepower_mean	horsepower_group	horsepower_knn
0	90.0	90.0	90.0	90.0
1	105.0	105.0	105.0	105.0
2	215.0	215.0	215.0	215.0
3	92.0	92.0	92.0	92.0
4	NaN	144.4	91.0	144.4
5	220.0	220.0	220.0	220.0

Handling Outliers

# 2. Handling Outliers Example
def detect_and_handle_outliers(df, column):
    # Calculate IQR
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Detect outliers
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    print(f'Detected {len(outliers)} outliers in {column}')

    # Visualize before and after
    plt.figure(figsize=(12, 4))

    plt.subplot(1, 2, 1)
    sns.boxplot(y=df[column])
    plt.title(f'Original {column}')

    # Capping outliers
    df[f'{column}_capped'] = np.where(df[column] > upper_bound, upper_bound,
                                     np.where(df[column] < lower_bound, lower_bound, df[column]))

    plt.subplot(1, 2, 2)
    sns.boxplot(y=df[f'{column}_capped'])
    plt.title(f'After Capping {column}')

    plt.tight_layout()
    plt.show()

    return df

df = detect_and_handle_outliers(df, 'displacement')

Detected 1 outliers in displacement

png

Feature Scaling (Standardization)

# 3. Feature Scaling Example
# Original data
numeric_cols = ['weight', 'acceleration', 'displacement']
print('Original data:')
print(df[numeric_cols].head())

# Standardization
scaler = StandardScaler()
df_std = df.copy()
df_std[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Min-Max Scaling
minmax = MinMaxScaler()
df_minmax = df.copy()
df_minmax[numeric_cols] = minmax.fit_transform(df[numeric_cols])

print('\nStandardized data (mean=0, std=1):')
print(df_std[numeric_cols].head())

print('Min-Max Scaled data (range [0,1]):')
print(df_minmax[numeric_cols].head())

Original data:
   weight  acceleration  displacement
  2400          15.5           140
  3000          14.0           200
  4300          12.5           360
  2500          16.0           150
  2200          15.0           130

Standardized data (mean=0, std=1):
     weight  acceleration  displacement
-0.820462      0.854242     -0.489225
-0.149175      0.000000     -0.443360
1.305280     -0.854242     -0.321054
-0.708580      1.138990     -0.481581
-1.044224      0.569495     -0.496869
Min-Max Scaled data (range [0,1]):
     weight  acceleration  displacement
0.090909           0.9      0.002801
0.363636           0.6      0.019608
0.954545           0.3      0.064426
0.136364           1.0      0.005602
0.000000           0.8      0.000000

Box Plot Visualization

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.boxplot(data=df[['mpg', 'weight', 'acceleration']])
plt.title("Box Plot of Numeric Features")
plt.show()

png

Histogram

df[['acceleration']].hist(bins=5, figsize=(6, 4))
plt.title("Histogram of Acceleration")
plt.show()

png

Scatter Plot

sns.scatterplot(x='weight', y='mpg', data=df)
plt.title("Scatter Plot: Weight vs MPG")
plt.show()

png

Cross-Tabulation

pd.crosstab(df['origin'], df['cylinders'])

cylinders	4	6	8
origin
1	1	1	2
2	1	0	0
3	1	0	0

Dimensionality Reduction (PCA)

# 4. Dimensionality Reduction Example
# Prepare data for PCA
X = df[['weight', 'acceleration', 'displacement_capped']]
y = df['mpg']

# Standardize the data first
X_scaled = StandardScaler().fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Create a new dataframe for the principal components
df_pca = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
df_pca['mpg'] = y.values

# Plot the results
plt.figure(figsize=(8, 6))
scatter = plt.scatter(df_pca['PC1'], df_pca['PC2'], c=df_pca['mpg'], cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.colorbar(scatter, label='MPG')
plt.title('PCA of Car Features')
plt.show()

print(f'Explained variance ratio: {pca.explained_variance_ratio_}')
print(f'Total explained variance: {sum(pca.explained_variance_ratio_):.2f}%')

png

Explained variance ratio: [0.95929265 0.02632386]
Total explained variance: 0.99%

Feature Selection

We might drop car_name or model_year if they are found irrelevant using feature importance techniques.