Chapter 12: Chi Square Distribution
1. What is the Chi-Square distribution really?
The Chi-Square distribution (written χ²) is the distribution of the sum of squares of k independent standard normal random variables.
In simple words:
If you take k independent random numbers from a standard normal distribution (mean=0, sd=1), square each of them, and add them all together → the result follows a Chi-Square distribution with k degrees of freedom.
Mathematical definition:
Let Z₁, Z₂, …, Zₖ ~ N(0,1) and independent Then: Q = Z₁² + Z₂² + … + Zₖ² ~ χ²(k)
Key properties (write these down):
- Only defined for x ≥ 0 (because squares are non-negative)
- Degrees of freedom (df or k) is the only parameter
- Mean = k
- Variance = 2k
- Shape: always right-skewed, but becomes more symmetric as k increases
- When k ≥ 30 → looks quite similar to a normal distribution (Central Limit Theorem)
2. Visual intuition – how the shape changes with degrees of freedom
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
fig, axes = plt.subplots(2, 2, figsize=(14, 10), sharey=False) dfs = [1, 2, 5, 10, 20, 50] x = np.linspace(0, 80, 1000) for df in dfs: y = stats.chi2.pdf(x, df=df) plt.plot(x, y, lw=2.2, label=f'df = {df}', alpha=0.9) plt.title("Chi-Square density for different degrees of freedom", fontsize=14, pad=15) plt.xlabel("Value (χ²)", fontsize=12) plt.ylabel("Density", fontsize=12) plt.xlim(0, 80) plt.legend(title="Degrees of freedom (k)", fontsize=11, title_fontsize=12) plt.show() |
What you should observe:
- df = 1 → very strong right skew, peaks at 0
- df = 2 → still skewed, but flatter
- df = 5 → peak moves right, skew decreases
- df = 10 → starting to look bell-like
- df = 20+ → almost symmetric, looks similar to normal
3. Generating Chi-Square random numbers in NumPy / SciPy
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Chi-Square with 5 degrees of freedom chi5 = stats.chi2.rvs(df=5, size=50000) # Multiple at once chi_data = stats.chi2.rvs(df=[3, 8, 15, 30], size=(40000, 4)) print("Mean of df=5 sample:", chi5.mean().round(2)) # should be ≈ 5 print("Variance of df=5 sample:", chi5.var().round(2)) # should be ≈ 10 |
4. Where does Chi-Square appear in real life? (very important)
Most common situations you will actually meet
- Variance testing Sample variance of normal data → (n-1)S²/σ² ~ χ²(n-1)
- Goodness-of-fit test Comparing observed vs expected frequencies (classic χ² test)
- Independence test Contingency tables (χ² test of independence)
- Confidence interval for variance Used in quality control, process capability
- F-distribution (very important connection) F = (χ₁² / df₁) / (χ₂² / df₂) → used in ANOVA, regression
- Multiple linear regression Residual sum of squares / σ² ~ χ²(n-p)
5. Realistic examples & code you will actually write
Example 1 – Testing variance of measurements
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# Suppose true process variance σ² = 4 (sd = 2) # We take n=25 samples, compute sample variance S² n = 25 df = n - 1 # Simulate many such experiments sample_variances_scaled = np.zeros(20000) for i in range(20000): sample = np.random.normal(0, 2, n) # sd = 2 S2 = np.var(sample, ddof=1) # sample variance sample_variances_scaled[i] = (n-1) * S2 / 4 # scaled → should be χ²(24) sns.histplot(sample_variances_scaled, bins=80, stat="density", kde=True, color="teal", alpha=0.7) x = np.linspace(0, 80, 1000) plt.plot(x, stats.chi2.pdf(x, df=24), color="darkred", lw=2.8, label="Theoretical χ²(24)") plt.title("Scaled sample variance ~ χ²(n-1)", fontsize=14) plt.legend() plt.show() |
Example 2 – Chi-Square goodness-of-fit (classic dice test)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Simulate rolling a fair die 6000 times observed = np.random.multinomial(6000, [1/6]*6) expected = 6000 / 6 chi2_stat = np.sum((observed - expected)**2 / expected) df = 6 - 1 print("Chi-square statistic:", chi2_stat.round(3)) print("p-value:", stats.chi2.sf(chi2_stat, df).round(5)) |
6. Summary – Chi-Square Distribution Quick Reference
| Property | Value / Formula |
|---|---|
| Shape | Right-skewed (skew decreases as df increases) |
| Defined by | degrees of freedom k (or df) |
| Mean | k |
| Variance | 2k |
| Standard deviation | √(2k) |
| Support | x ≥ 0 |
| complicated (involves gamma function) | |
| Most common use cases | variance testing, goodness-of-fit, independence tests, F-distribution, regression diagnostics |
Final teacher messages
- Whenever you see “sum of squares of normal variables” or “scaled variance” → think Chi-Square.
- Chi-Square is the building block for many other important distributions:
- F-distribution
- Chi-Square goodness-of-fit / independence tests
- Confidence intervals for variance
- As df increases → Chi-Square becomes more symmetric → normal approximation becomes good (mean=k, variance=2k)
Would you like to go deeper into any of these next?
- How to perform a real Chi-Square goodness-of-fit test step-by-step
- Chi-Square vs F-distribution (very important connection)
- Confidence interval for population variance using Chi-Square
- Realistic mini-project: test whether dice rolls are fair
- Difference between Chi-Square and non-central Chi-Square
Just tell me what feels most useful or interesting for you right now! 😊
