Chapter 15: Zipf Distribution
1. What is the Zipf distribution really?
The Zipf distribution is a discrete power-law distribution that describes phenomena where:
- A small number of items are extremely frequent / popular / large
- The vast majority of items are very rare / small / low-frequency
It is the discrete version of the Pareto distribution — but instead of continuous values, we deal with ranks or frequencies.
The famous Zipf’s law (in plain English):
The frequency of the k-th most frequent item is roughly proportional to 1/k^s (where s is usually close to 1)
This creates the classic long-tail pattern:
- Rank 1 item is enormously popular
- Rank 2 is about half as frequent (when s ≈ 1)
- Rank 10 is about 1/10th as frequent
- Rank 100 is about 1/100th as frequent
- … and it keeps going for a very long time
2. Classic real-world examples (you will see these everywhere)
| Phenomenon | Typical s (exponent) | What follows Zipf’s law |
|---|---|---|
| Word frequencies in natural language | 0.9 – 1.2 | “the” is #1, “of” #2, very long tail of rare words |
| City population sizes | ~1.0 | Few megacities, many small towns |
| Web page views / website traffic | 1.0 – 1.5 | Few extremely popular pages |
| YouTube video views | 1.2 – 1.8 | Few viral videos, millions with almost no views |
| Twitter / X followers | 1.5 – 2.5 | Few accounts with millions, most with very few |
| Book sales / music sales | 1.0 – 2.0 | Few bestsellers, long tail of niche titles |
| Company sizes / revenues | 1.0 – 1.5 | Few giant corporations |
| Number of links pointing to websites | ~1.0 | Few extremely linked sites |
3. Mathematical definition (two common forms)
Form 1 – Zipf’s law (approximation used in practice)
P(rank = k) ∝ 1 / k^s for k = 1, 2, 3, …
s is called the Zipf exponent or scaling parameter
Form 2 – Zeta distribution (exact probability distribution)
The zeta distribution is the proper normalized version:
P(X = k) = 1 / (k^s × ζ(s)) for k = 1, 2, 3, …
where ζ(s) is the Riemann zeta function (normalization constant)
In NumPy/SciPy, we usually use the zeta distribution when we want exact probabilities.
4. Generating Zipf / zeta random numbers
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# SciPy's zeta distribution (exact Zipf / zeta law) # alpha = s (exponent), must be >1 for finite mean alpha = 1.7 zipf_data = stats.zeta.rvs(a=alpha, size=100000) print("First 20 values (ranks / frequencies):", zipf_data[:20]) print("Most common value:", stats.mode(zipf_data)[0]) print("Average value:", zipf_data.mean().round(2)) |
Important note: The zeta distribution generates rank values (1, 2, 3, …) with probability decreasing as 1/k^α.
If you want frequencies (how many times each rank appears), you need to count them.
5. Visualizing Zipf / zeta distribution
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5.5)) alphas = [1.3, 1.7, 2.5, 3.5] for a in alphas: data = stats.zeta.rvs(a=a, size=50000) sns.histplot(data, bins=np.logspace(0, 5, 60), stat="probability", label=f"α = {a}", alpha=0.7, ax=ax1) ax1.set_title("Linear scale – very hard to see the tail", fontsize=13) ax1.set_xlabel("Value (rank / frequency)", fontsize=11) ax1.set_ylabel("Probability", fontsize=11) ax1.set_xscale('log') ax1.set_xlim(1, 10000) ax1.legend(title="Shape parameter α") # Log-log survival plot – the signature view for a in alphas: # Survival function P(X > k) ≈ k^(-a) x = np.logspace(0, 5, 1000) y = (1 / x)**a ax2.loglog(x, y, lw=2.4, label=f"α = {a}") ax2.set_title("Log-log survival plot – straight line = power law", fontsize=13) ax2.set_xlabel("Value k (log)", fontsize=11) ax2.set_ylabel("P(X > k) (log)", fontsize=11) ax2.legend(title="Shape parameter α") ax2.grid(True, which="both", ls="--", alpha=0.4) plt.tight_layout() plt.show() |
Key observations:
- On normal scale → almost everything looks like it’s near zero (tail is invisible)
- On log-log scale → power-law becomes a straight line
- Smaller α → much heavier tail (more extreme values)
6. Realistic code patterns you will actually write
Pattern 1 – Simulate word frequencies in a large text corpus
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Typical Zipf exponent for English text ≈ 1.0–1.2 alpha = 1.15 # Simulate frequencies of ~50,000 unique words word_freq = stats.zeta.rvs(a=alpha, size=50000) # Sort descending (most frequent first) word_freq_sorted = np.sort(word_freq)[::-1] # Plot rank vs frequency (classic Zipf plot) plt.loglog(range(1, len(word_freq_sorted)+1), word_freq_sorted, '.', ms=3, alpha=0.7) plt.title("Zipf plot – word frequency vs rank (log-log)", fontsize=14) plt.xlabel("Rank (log)", fontsize=12) plt.ylabel("Frequency (log)", fontsize=12) plt.grid(True, which="both", ls="--", alpha=0.4) plt.show() |
Pattern 2 – Check how much the top-k items dominate
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Same simulated frequencies total = word_freq.sum() top_1_percent = word_freq_sorted[:int(0.01 * len(word_freq_sorted))] top_5_percent = word_freq_sorted[:int(0.05 * len(word_freq_sorted))] print(f"Top 1% of words account for: {top_1_percent.sum() / total * 100:.1f}% of total frequency") print(f"Top 5% of words account for: {top_5_percent.sum() / total * 100:.1f}% of total frequency") |
Pattern 3 – Simulate YouTube video views (classic Zipf-like behavior)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# α ≈ 1.5–1.8 for video views alpha = 1.6 n_videos = 100000 views = stats.zeta.rvs(a=alpha, size=n_videos) print(f"Top 100 videos have {views[:100].sum() / views.sum() * 100:.1f}% of total views") print(f"Top 1000 videos have {views[:1000].sum() / views.sum() * 100:.1f}% of total views") |
Summary – Zipf / Zeta Distribution Quick Reference
| Property | Value / Formula |
|---|---|
| Shape | Extremely heavy right tail (power-law) |
| Defined by | shape α (exponent), usually 1 < α < 3 |
| Support | k = 1, 2, 3, … (positive integers) |
| Mean (α > 1) | ζ(α−1) / ζ(α) |
| Variance (α > 2) | complicated (involves zeta functions) |
| NumPy / SciPy | scipy.stats.zeta.rvs(a=α, size=…) |
| Most common use cases | word frequencies, city sizes, website traffic, video views, sales, citations, followers |
Final teacher messages
- Whenever you see “a few items dominate everything, and it keeps going for a very long tail” → think Zipf / power-law.
- Log-log plot showing a straight line is the strongest visual signature of Zipf / power-law behavior.
- α close to 1 → extremely unequal distributions (a tiny fraction owns almost everything)
- α > 2 → tails are still heavy, but mean and variance exist
Would you like to continue with any of these next?
- How to estimate α from real data (Hill estimator, log-log regression)
- Zipf vs Pareto — differences and when to use which
- Realistic mini-project: simulate word frequencies or YouTube views + analyze dominance
- Zipf’s law in natural language processing (vocabulary size, Heap’s law connection)
- Comparing Zipf with log-normal (two main explanations for heavy tails)
Just tell me what you want to explore next! 😊
