Looks Right, Feels Wrong
- Michael Lee, MBA
- Jun 21
- 2 min read
How Multicollinearity Destroys Trust in Your Regression Model — and What PCA Can Do About It
📘 This is Part 2 in our regression (PCA) series. If you're unfamiliar with multicollinearity, start with Part 1: The Silent Killer of Your Regression Model

🚨 The Setup: Retail Marketing Spend
Imagine you're a retail analyst building a regression model to understand what drives monthly sales. You include:
Email Campaign Budget
Social Media Ads Budget
Search Engine Ads Budget
Store Footfall
Seems like a solid list, right? Let’s run a regression.
🔢 The Problem: Your Model Doesn’t Know Who to Credit
Here’s the raw regression result:
Regression Before PCA
R²: 0.440
Adjusted R²: 0.417
P-values:
Email_Spend: 0.234 ❌
Social_Spend: 0.186 ❌
Search_Spend: 0.240 ❌
Footfall: 0.011 ✅
Despite a decent R², none of the marketing variables are significant. Why?
Let’s take a look under the hood.
🔍 Multicollinearity in Action
🔄 Correlation Heatmap
Email, Social, and Search spends are highly correlated — above 0.9. This means they are essentially repeating the same information.

🔢 VIF Scores
Variable | VIF |
Email_Spend | 95.58 |
Social_Spend | 61.34 |
Search_Spend | 35.45 |
Footfall | 1.06 |
❗️ When VIF > 10, multicollinearity is serious. Here, it’s screaming high multicollinearity. The model is confused who deserves credit.

🧹 The Fix: Principal Component Analysis (PCA)
PCA creates new, uncorrelated variables (called principal components) by combining the original predictors.
Think of it like reorganizing your messy closet into neat drawers:
🧵 PC1: Overall Marketing Activity
Combines Email, Social, and Search into a single, powerful signal of digital spend intensity.
🛍️ PC2: Channel Mix — Offline vs Online
Differentiates between heavy store footfall and online channels. Helps us understand balance in strategy.
PCA Loadings
Component | Social | Search | Footfall | |
PC1 | -0.58 | -0.58 | -0.57 | 0.09 |
PC2 | 0.04 | 0.05 | 0.07 | 1.00 |
📊 Regression After PCA
Now we run regression again, this time using PC1 and PC2.
Regression After PCA
R²: 0.421
Adjusted R²: 0.409
P-values:
PC1: < 0.001 ✅
PC2: 0.003 ✅
🚀 Both components are statistically significant. We now have a model that is cleaner, clearer, and no longer confused by overlapping variables.
💪 Takeaway: Fixing the Story Behind the Numbers
This time, your model isn’t just technically right — it feels right too.
PC1 gives credit to combined digital effort
PC2 adds insight into strategic channel balance
No variable was dropped. No signal was lost. Just smarter math.
Comments