Search

The Silent Killer of Your Regression Model

Michael Lee, MBA
1 day ago
4 min read

How Multicollinearity Destroys Trust in Your Analysis — and What PCA Can Do About It

Your report shows a high R². The CEO is happy. Budgets are adjusted. But a month later… sales drop. The campaign didn’t fail — your model did.

Welcome to multicollinearity — the invisible force that can quietly wreck your regression models.

🚨 A Real Problem, Not Just a Technicality

Let’s say you’re a retail analyst. You’ve built a regression model to predict monthly sales based on:

Email Campaign Budget
Social Media Ads
Search Engine Ads
Store Footfall

The model says:

Email doesn’t seem to matter
Search Ads have a weak positive effect
Social Media Ads are the strongest driver

But something’s off. You know from experience that email performs well. Why is your model suggesting otherwise?

You might be suffering from multicollinearity — when variables are so closely related that the model can’t distinguish their true effects.

👣 If You're New to Regression...

This article builds on concepts from our Intro to Regression under Inferential Analysis. 📘 Read that primer here

But if you’re already comfortable with regression basics, keep going.

🤔 What Exactly Is Multicollinearity?

Multicollinearity occurs when two or more independent variables are highly correlated — they say the same thing in different ways.

Imagine asking three advisors what to do, but they all just repeat each other. You’re no wiser — just more confused.

This messes up the math behind regression. Your model tries to assign credit (or blame) to each predictor, but if they overlap, it can’t do it reliably.

🔎 How to Spot Multicollinearity (the Practical Way)

1. Look at the Correlation Matrix

If two predictors have a correlation above 0.8 or 0.9, red flags are flying.

2. Use VIF — Variance Inflation Factor

What is VIF? It tells you how much the variance of a regression coefficient is inflated due to multicollinearity. Here’s how to calculate it:

VIF = 1 / (1 - R² from an auxiliary regression)

To get the R²:

Run a regression of one variable (e.g. Email Budget) against all the others.
Plug the resulting R² into the formula.

✅ Example:

If Email Budget’s R² from that regression is 0.96:

VIF = 1 / (1 - 0.96) = 25

📊 Excel Note: Excel’s built-in regression won’t give VIF, but you can compute it manually using R² from separate regressions or use add-ins like Real Statistics or XLSTAT.

📉 Is R² Lying Too?

R² tells you how much of the variation in your outcome is explained by the model.

The catch? R² always increases when you add more variables — even if they’re redundant or irrelevant.

That’s why we use Adjusted R²:

🔍 Adjusted R² penalizes unnecessary complexity. It only goes up if your new predictor actually improves the model.

Think of R² as how impressed your boss is with your report. Adjusted R² is the audit — checking if you padded your results.

🧠 So How Do We Fix Multicollinearity?

One common fix is to drop redundant variables — but that isn’t always ideal, especially if all your variables matter in theory.

Another powerful solution? PCA — Principal Component Analysis

🧹 What PCA Does (and Why It Helps)

PCA transforms your correlated variables into a new set of uncorrelated components called Principal Components.

These components:

Are mathematically constructed combinations of your original variables
Are ordered by how much variation they explain (PC1 = the most)
Remove multicollinearity by design

Think of PCA like Marie Kondo — it takes a cluttered wardrobe of variables and folds them neatly into 1 or 2 compact components.

🧪 Example: Fitness Tracker

Let’s say you’re building a model to understand physical activity. Your data includes:

Steps per day
Minutes walked
Distance walked
Calories burned

These are all good indicators of activity, but they’re also highly correlated — essentially repeating the same story. That’s where PCA comes in.

🔄 What PCA Does with These Variables

When you apply PCA to these four inputs, it transforms them into two clean, uncorrelated components:

🧱 PC1 = Overall Activity

This is your main signal — the one that captures the general level of movement across all four measures. It’s driven by high values across steps, minutes, distance, and calories.

Think of PC1 as a Daily Activity Index. If someone moves a lot — in any form — PC1 picks it up.

🎯 PC2 = Movement Style or Intensity

While PC1 tells you how much someone moves, PC2 captures how they move. It highlights contrasts:

Someone who takes many steps but burns fewer calories (slow, flat walking)
Versus someone who takes fewer steps but burns more (faster, uphill, gym work)

PC2 helps differentiate between “steady walkers” and “intense movers” — even if their overall activity level is the same.

👥 Example Breakdown

Person	Steps	Minutes	Distance	Calories	PC1 (Activity)	PC2 (Style/Intensity)
A	High	High	High	High	High	Neutral
B	High	High	Low	Low	Medium	Negative
C	Low	Low	High	High	Medium	Positive

A is just highly active → High PC1
B moves a lot, but slowly → PC2 shifts negative
C moves less, but with power → PC2 shifts positive

✅ The Result?

Instead of juggling four overlapping metrics, your model now works with:

PC1: Overall activity
PC2: Movement pattern

Both are clean, uncorrelated, and far easier for your regression model to handle — without any multicollinearity.

⚖️ Trade-Off: Simplicity vs Interpretability

✅ Use PCA when...	❌ Avoid PCA when...
You want a stable, predictive model	You need to explain what each variable means
You have many correlated predictors	You’re briefing stakeholders or policymakers

🧮 Can I Do PCA in Excel?

Yes — but it’s manual:

Standardize the data
Compute the covariance matrix
Extract eigenvalues and eigenvectors
Multiply data by the eigenvectors

👉 Use Real Statistics Add-in or XLSTAT to simplify this👉 Or use Python, R, or SPSS where PCA and VIF are built-in

🧠 Final Takeaway

Even the prettiest regression model can be misleading if multicollinearity sneaks in. Your variables start overlapping, your coefficients lose meaning, and your decisions may suffer.