Chapter 15: Zipf Distribution

1. What is the Zipf distribution really?

The Zipf distribution is a discrete power-law distribution that describes phenomena where:

  • A small number of items are extremely frequent / popular / large
  • The vast majority of items are very rare / small / low-frequency

It is the discrete version of the Pareto distribution — but instead of continuous values, we deal with ranks or frequencies.

The famous Zipf’s law (in plain English):

The frequency of the k-th most frequent item is roughly proportional to 1/k^s (where s is usually close to 1)

This creates the classic long-tail pattern:

  • Rank 1 item is enormously popular
  • Rank 2 is about half as frequent (when s ≈ 1)
  • Rank 10 is about 1/10th as frequent
  • Rank 100 is about 1/100th as frequent
  • … and it keeps going for a very long time

2. Classic real-world examples (you will see these everywhere)

Phenomenon Typical s (exponent) What follows Zipf’s law
Word frequencies in natural language 0.9 – 1.2 “the” is #1, “of” #2, very long tail of rare words
City population sizes ~1.0 Few megacities, many small towns
Web page views / website traffic 1.0 – 1.5 Few extremely popular pages
YouTube video views 1.2 – 1.8 Few viral videos, millions with almost no views
Twitter / X followers 1.5 – 2.5 Few accounts with millions, most with very few
Book sales / music sales 1.0 – 2.0 Few bestsellers, long tail of niche titles
Company sizes / revenues 1.0 – 1.5 Few giant corporations
Number of links pointing to websites ~1.0 Few extremely linked sites

3. Mathematical definition (two common forms)

Form 1 – Zipf’s law (approximation used in practice)

P(rank = k) ∝ 1 / k^s for k = 1, 2, 3, …

s is called the Zipf exponent or scaling parameter

Form 2 – Zeta distribution (exact probability distribution)

The zeta distribution is the proper normalized version:

P(X = k) = 1 / (k^s × ζ(s)) for k = 1, 2, 3, …

where ζ(s) is the Riemann zeta function (normalization constant)

In NumPy/SciPy, we usually use the zeta distribution when we want exact probabilities.

4. Generating Zipf / zeta random numbers

Python

Important note: The zeta distribution generates rank values (1, 2, 3, …) with probability decreasing as 1/k^α.

If you want frequencies (how many times each rank appears), you need to count them.

5. Visualizing Zipf / zeta distribution

Python

Key observations:

  • On normal scale → almost everything looks like it’s near zero (tail is invisible)
  • On log-log scale → power-law becomes a straight line
  • Smaller α → much heavier tail (more extreme values)

6. Realistic code patterns you will actually write

Pattern 1 – Simulate word frequencies in a large text corpus

Python

Pattern 2 – Check how much the top-k items dominate

Python

Pattern 3 – Simulate YouTube video views (classic Zipf-like behavior)

Python

Summary – Zipf / Zeta Distribution Quick Reference

Property Value / Formula
Shape Extremely heavy right tail (power-law)
Defined by shape α (exponent), usually 1 < α < 3
Support k = 1, 2, 3, … (positive integers)
Mean (α > 1) ζ(α−1) / ζ(α)
Variance (α > 2) complicated (involves zeta functions)
NumPy / SciPy scipy.stats.zeta.rvs(a=α, size=…)
Most common use cases word frequencies, city sizes, website traffic, video views, sales, citations, followers

Final teacher messages

  1. Whenever you see “a few items dominate everything, and it keeps going for a very long tail” → think Zipf / power-law.
  2. Log-log plot showing a straight line is the strongest visual signature of Zipf / power-law behavior.
  3. α close to 1 → extremely unequal distributions (a tiny fraction owns almost everything)
  4. α > 2 → tails are still heavy, but mean and variance exist

Would you like to continue with any of these next?

  • How to estimate α from real data (Hill estimator, log-log regression)
  • Zipf vs Pareto — differences and when to use which
  • Realistic mini-project: simulate word frequencies or YouTube views + analyze dominance
  • Zipf’s law in natural language processing (vocabulary size, Heap’s law connection)
  • Comparing Zipf with log-normal (two main explanations for heavy tails)

Just tell me what you want to explore next! 😊

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *