Multicollinearity |

Multicollinearity

Multicollinearity is a statistical issue in regression analysis where two or more independent (predictor) variables are highly correlated with each other.

What this means

When predictors move together, the model has difficulty separating their individual effects on the dependent variable. As a result, coefficient estimates become unreliable.

Why it’s a problem

Multicollinearity does not reduce the overall predictive power of the model much, but it does affect interpretation and inference:

Regression coefficients become unstable (small data changes → big coefficient changes)
Standard errors increase, making predictors appear statistically insignificant
Coefficient signs or magnitudes may be counter-intuitive
Hard to determine which variable actually matters

Simple example

Suppose you regress house price on:

house_size_m2
number_of_rooms

These two predictors are strongly correlated. The model struggles to decide whether size or rooms explain the price increase.

Common causes

Including redundant variables
Using derived variables (e.g., total_sales and sales_per_day)
Polynomial terms without centering (e.g., x and x²)

How to detect it

Correlation matrix (high pairwise correlations)
Variance Inflation Factor (VIF)
VIF > 5 (moderate)
VIF > 10 (severe)
Large standard errors with nonsignificant coefficients despite good model fit

How to fix or mitigate it

Remove or combine correlated predictors
Use feature selection
Apply regularization (Ridge, Lasso)
Use Principal Component Analysis (PCA)
Center variables when using polynomial terms

One-sentence definition

Multicollinearity occurs when independent variables in a regression model are highly correlated, making coefficient estimates unstable and difficult to interpret.