The Statistical Mirage of Aggregated Data
Most people believe that if a treatment is better for men and better for women, it must be better for the population as a whole. In reality, it might be worse. This happens because of “lurking variables” or weights. Simpson’s Paradox occurs when a trend appearing in several groups of data disappears or reverses when those groups are combined. It’s a sobering reminder that data without context isn’t just incomplete—it can be 100% misleading.
Visual Interpretation in Manim
The Manim animation visualizes this paradox through Vector Slopes. By representing success rates as the steepness of a line, we can see how two “steep” lines can combine to create a “shallow” result.
- The Grouped Vectors: Local TruthsTwo distinct colors (e.g., Blue and Red) show individual group performance. In their own localized space, both show a positive upward trend.
- The Aggregate Vector: The Global LieA white dashed line represents the combined data. Notice how it tilts downward even though its components tilt upward.
- The HUD: Real-Time RatiosThe scoreboard remains stationary in the corner, showing the success ratios (e.g., 5/10 vs 90/100). This highlights the Weighting Bias—the real culprit behind the paradox.
Why it matters:
In medical trials or Berkeley admission cases, failing to account for group size leads to “Common Cause” fallacies.
The Visual Logic:
Slopes represent rates ($Success/Total$). Adding vectors is not the same as adding slopes.
Note: This phenomenon is a critical study in Probability Theory and Causal Inference. It demonstrates how Confounding Variables can distort a Correlation until it no longer reflects the Statistical Significance of the underlying groups.
The Mathematical Proof
Simpson’s Paradox occurs because of weighted averages. A high success rate in a small group cannot overcome a low success rate in a massive group when they are merged.
Mathematically, it is possible for these three inequalities to exist simultaneously:
(a2 / b2) > (c2 / d2)
BUT
(a1 + a2) / (b1 + b2) < (c1 + c2) / (d1 + d2)
The weights (the denominators b and d) are the “lurking variables.” If Group 2 is much larger than Group 1, its lower performance will “drag” the total average down, regardless of how well Group 1 performed.
The Weighting Trap:
Total success is driven by volume, not just percentage. Massive groups dominate the final average.
Causal Direction:
To avoid the paradox, one must ask: “What is the cause?” and split data accordingly.
Name: Source Code: Manim Implementation *
