top of page

The Silent Killer of Your Regression Model

How Multicollinearity Destroys Trust in Your Analysis — and What PCA Can Do About It

Your report shows a high R². The CEO is happy. Budgets are adjusted. But a month later… sales drop. The campaign didn’t fail — your model did.


Welcome to multicollinearity — the invisible force that can quietly wreck your regression models.



🚨 A Real Problem, Not Just a Technicality

Let’s say you’re a retail analyst. You’ve built a regression model to predict monthly sales based on:

  • Email Campaign Budget

  • Social Media Ads

  • Search Engine Ads

  • Store Footfall


The model says:

  • Email doesn’t seem to matter

  • Search Ads have a weak positive effect

  • Social Media Ads are the strongest driver


But something’s off. You know from experience that email performs well. Why is your model suggesting otherwise?

You might be suffering from multicollinearity — when variables are so closely related that the model can’t distinguish their true effects.

👣 If You're New to Regression...

This article builds on concepts from our Intro to Regression under Inferential Analysis. 📘 Read that primer here


But if you’re already comfortable with regression basics, keep going.



🤔 What Exactly Is Multicollinearity?

Multicollinearity occurs when two or more independent variables are highly correlated — they say the same thing in different ways.

Imagine asking three advisors what to do, but they all just repeat each other. You’re no wiser — just more confused.

This messes up the math behind regression. Your model tries to assign credit (or blame) to each predictor, but if they overlap, it can’t do it reliably.



🔎 How to Spot Multicollinearity (the Practical Way)


1. Look at the Correlation Matrix

If two predictors have a correlation above 0.8 or 0.9, red flags are flying.


2. Use VIF — Variance Inflation Factor

What is VIF? It tells you how much the variance of a regression coefficient is inflated due to multicollinearity. Here’s how to calculate it:


VIF = 1 / (1 - R² from an auxiliary regression)


To get the R²:

  • Run a regression of one variable (e.g. Email Budget) against all the others.

  • Plug the resulting R² into the formula.


✅ Example:

If Email Budget’s R² from that regression is 0.96:


VIF = 1 / (1 - 0.96) = 25

📊 Excel Note: Excel’s built-in regression won’t give VIF, but you can compute it manually using R² from separate regressions or use add-ins like Real Statistics or XLSTAT.



📉 Is R² Lying Too?

 tells you how much of the variation in your outcome is explained by the model.


The catch? R² always increases when you add more variables — even if they’re redundant or irrelevant.

That’s why we use Adjusted R²:

🔍 Adjusted R² penalizes unnecessary complexity. It only goes up if your new predictor actually improves the model.

Think of R² as how impressed your boss is with your report. Adjusted R² is the audit — checking if you padded your results.


🧠 So How Do We Fix Multicollinearity?

One common fix is to drop redundant variables — but that isn’t always ideal, especially if all your variables matter in theory.


Another powerful solution? PCA — Principal Component Analysis



🧹 What PCA Does (and Why It Helps)

PCA transforms your correlated variables into a new set of uncorrelated components called Principal Components.


These components:

  • Are mathematically constructed combinations of your original variables

  • Are ordered by how much variation they explain (PC1 = the most)

  • Remove multicollinearity by design

Think of PCA like Marie Kondo — it takes a cluttered wardrobe of variables and folds them neatly into 1 or 2 compact components.

🧪 Example: Fitness Tracker

Let’s say you’re building a model to understand physical activity. Your data includes:

  • Steps per day

  • Minutes walked

  • Distance walked

  • Calories burned


These are all good indicators of activity, but they’re also highly correlated — essentially repeating the same story. That’s where PCA comes in.


🔄 What PCA Does with These Variables

When you apply PCA to these four inputs, it transforms them into two clean, uncorrelated components:


🧱 PC1 = Overall Activity

This is your main signal — the one that captures the general level of movement across all four measures. It’s driven by high values across steps, minutes, distance, and calories.

Think of PC1 as a Daily Activity Index. If someone moves a lot — in any form — PC1 picks it up.

🎯 PC2 = Movement Style or Intensity

While PC1 tells you how much someone moves, PC2 captures how they move. It highlights contrasts:

  • Someone who takes many steps but burns fewer calories (slow, flat walking)

  • Versus someone who takes fewer steps but burns more (faster, uphill, gym work)

PC2 helps differentiate between “steady walkers” and “intense movers” — even if their overall activity level is the same.

👥 Example Breakdown

Person

Steps

Minutes

Distance

Calories

PC1 (Activity)

PC2 (Style/Intensity)

A

High

High

High

High

High

Neutral

B

High

High

Low

Low

Medium

Negative

C

Low

Low

High

High

Medium

Positive

  • A is just highly active → High PC1

  • B moves a lot, but slowly → PC2 shifts negative

  • C moves less, but with power → PC2 shifts positive


✅ The Result?

Instead of juggling four overlapping metrics, your model now works with:

  • PC1: Overall activity

  • PC2: Movement pattern


Both are clean, uncorrelated, and far easier for your regression model to handle — without any multicollinearity.


⚖️ Trade-Off: Simplicity vs Interpretability

✅ Use PCA when...

❌ Avoid PCA when...

You want a stable, predictive model

You need to explain what each variable means

You have many correlated predictors

You’re briefing stakeholders or policymakers


🧮 Can I Do PCA in Excel?

Yes — but it’s manual:

  1. Standardize the data

  2. Compute the covariance matrix

  3. Extract eigenvalues and eigenvectors

  4. Multiply data by the eigenvectors


👉 Use Real Statistics Add-in or XLSTAT to simplify this👉 Or use Python, R, or SPSS where PCA and VIF are built-in


🧠 Final Takeaway

Even the prettiest regression model can be misleading if multicollinearity sneaks in. Your variables start overlapping, your coefficients lose meaning, and your decisions may suffer.

PCA doesn’t just fix the math — it gives you clarity, stability, and confidence in your model.

🔜 What’s Next?

In our next article, we’ll walk through a real retail case study:

  • Build a regression model with real marketing data

  • Watch how multicollinearity causes chaos

  • Apply PCA to restore order

  • And compare the before-and-after results


📘 Stay tuned!

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Featured Posts
Recent Posts

Copyright by FYT CONSULTING PTE LTD - All rights reserved

  • LinkedIn App Icon
bottom of page