{"id":2527,"date":"2026-02-02T09:57:35","date_gmt":"2026-02-02T09:57:35","guid":{"rendered":"https:\/\/demo.materiamedica.net\/demo6\/?p=2527"},"modified":"2026-02-02T09:57:35","modified_gmt":"2026-02-02T09:57:35","slug":"chapter-15-zipf-distribution","status":"publish","type":"post","link":"https:\/\/demo.materiamedica.net\/demo6\/chapter-15-zipf-distribution\/","title":{"rendered":"Chapter 15: Zipf Distribution"},"content":{"rendered":"<h3 dir=\"auto\">1. What is the Zipf distribution really?<\/h3>\n<p dir=\"auto\">The <strong>Zipf distribution<\/strong> is a <strong>discrete power-law distribution<\/strong> that describes phenomena where:<\/p>\n<ul dir=\"auto\">\n<li>A small number of items are <strong>extremely frequent \/ popular \/ large<\/strong><\/li>\n<li>The vast majority of items are <strong>very rare \/ small \/ low-frequency<\/strong><\/li>\n<\/ul>\n<p dir=\"auto\">It is the discrete version of the <strong>Pareto distribution<\/strong> \u2014 but instead of continuous values, we deal with <strong>ranks<\/strong> or <strong>frequencies<\/strong>.<\/p>\n<p dir=\"auto\"><strong>The famous Zipf&#8217;s law<\/strong> (in plain English):<\/p>\n<blockquote dir=\"auto\">\n<p dir=\"auto\">The frequency of the k-th most frequent item is roughly <strong>proportional to 1\/k^s<\/strong> (where s is usually close to 1)<\/p>\n<\/blockquote>\n<p dir=\"auto\">This creates the classic <strong>long-tail<\/strong> pattern:<\/p>\n<ul dir=\"auto\">\n<li>Rank 1 item is enormously popular<\/li>\n<li>Rank 2 is about half as frequent (when s \u2248 1)<\/li>\n<li>Rank 10 is about 1\/10th as frequent<\/li>\n<li>Rank 100 is about 1\/100th as frequent<\/li>\n<li>\u2026 and it keeps going for a very long time<\/li>\n<\/ul>\n<h3 dir=\"auto\">2. Classic real-world examples (you will see these everywhere)<\/h3>\n<div>\n<div dir=\"auto\">\n<table dir=\"auto\">\n<thead>\n<tr>\n<th data-col-size=\"lg\">Phenomenon<\/th>\n<th data-col-size=\"xs\">Typical s (exponent)<\/th>\n<th data-col-size=\"lg\">What follows Zipf&#8217;s law<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td data-col-size=\"lg\">Word frequencies in natural language<\/td>\n<td data-col-size=\"xs\">0.9 \u2013 1.2<\/td>\n<td data-col-size=\"lg\">&#8220;the&#8221; is #1, &#8220;of&#8221; #2, very long tail of rare words<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"lg\">City population sizes<\/td>\n<td data-col-size=\"xs\">~1.0<\/td>\n<td data-col-size=\"lg\">Few megacities, many small towns<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"lg\">Web page views \/ website traffic<\/td>\n<td data-col-size=\"xs\">1.0 \u2013 1.5<\/td>\n<td data-col-size=\"lg\">Few extremely popular pages<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"lg\">YouTube video views<\/td>\n<td data-col-size=\"xs\">1.2 \u2013 1.8<\/td>\n<td data-col-size=\"lg\">Few viral videos, millions with almost no views<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"lg\">Twitter \/ X followers<\/td>\n<td data-col-size=\"xs\">1.5 \u2013 2.5<\/td>\n<td data-col-size=\"lg\">Few accounts with millions, most with very few<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"lg\">Book sales \/ music sales<\/td>\n<td data-col-size=\"xs\">1.0 \u2013 2.0<\/td>\n<td data-col-size=\"lg\">Few bestsellers, long tail of niche titles<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"lg\">Company sizes \/ revenues<\/td>\n<td data-col-size=\"xs\">1.0 \u2013 1.5<\/td>\n<td data-col-size=\"lg\">Few giant corporations<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"lg\">Number of links pointing to websites<\/td>\n<td data-col-size=\"xs\">~1.0<\/td>\n<td data-col-size=\"lg\">Few extremely linked sites<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div><\/div>\n<\/div>\n<\/div>\n<h3 dir=\"auto\">3. Mathematical definition (two common forms)<\/h3>\n<h4 dir=\"auto\">Form 1 \u2013 Zipf&#8217;s law (approximation used in practice)<\/h4>\n<p dir=\"auto\">P(rank = k) \u221d 1 \/ k^s for k = 1, 2, 3, &#8230;<\/p>\n<p dir=\"auto\">s is called the <strong>Zipf exponent<\/strong> or <strong>scaling parameter<\/strong><\/p>\n<h4 dir=\"auto\">Form 2 \u2013 Zeta distribution (exact probability distribution)<\/h4>\n<p dir=\"auto\">The <strong>zeta distribution<\/strong> is the proper normalized version:<\/p>\n<p dir=\"auto\">P(X = k) = 1 \/ (k^s \u00d7 \u03b6(s)) for k = 1, 2, 3, &#8230;<\/p>\n<p dir=\"auto\">where <strong>\u03b6(s)<\/strong> is the Riemann zeta function (normalization constant)<\/p>\n<p dir=\"auto\">In NumPy\/SciPy, we usually use the <strong>zeta distribution<\/strong> when we want exact probabilities.<\/p>\n<h3 dir=\"auto\">4. Generating Zipf \/ zeta random numbers<\/h3>\n<div dir=\"auto\">\n<div data-testid=\"code-block\">\n<div>\n<div>Python<\/div>\n<div>\n<pre tabindex=\"0\"><code># SciPy's zeta distribution (exact Zipf \/ zeta law)\r\n# alpha = s (exponent), must be &gt;1 for finite mean\r\nalpha = 1.7\r\n\r\nzipf_data = stats.zeta.rvs(a=alpha, size=100000)\r\n\r\nprint(\"First 20 values (ranks \/ frequencies):\", zipf_data[:20])\r\nprint(\"Most common value:\", stats.mode(zipf_data)[0])\r\nprint(\"Average value:\", zipf_data.mean().round(2))<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p dir=\"auto\"><strong>Important note<\/strong>: The zeta distribution generates <strong>rank values<\/strong> (1, 2, 3, \u2026) with probability decreasing as 1\/k^\u03b1.<\/p>\n<p dir=\"auto\">If you want <strong>frequencies<\/strong> (how many times each rank appears), you need to count them.<\/p>\n<h3 dir=\"auto\">5. Visualizing Zipf \/ zeta distribution<\/h3>\n<div dir=\"auto\">\n<div data-testid=\"code-block\">\n<div>\n<div>Python<\/div>\n<div>\n<pre tabindex=\"0\"><code>fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5.5))\r\n\r\nalphas = [1.3, 1.7, 2.5, 3.5]\r\n\r\nfor a in alphas:\r\n    data = stats.zeta.rvs(a=a, size=50000)\r\n    sns.histplot(data, bins=np.logspace(0, 5, 60), stat=\"probability\",\r\n                 label=f\"\u03b1 = {a}\", alpha=0.7, ax=ax1)\r\n    \r\nax1.set_title(\"Linear scale \u2013 very hard to see the tail\", fontsize=13)\r\nax1.set_xlabel(\"Value (rank \/ frequency)\", fontsize=11)\r\nax1.set_ylabel(\"Probability\", fontsize=11)\r\nax1.set_xscale('log')\r\nax1.set_xlim(1, 10000)\r\nax1.legend(title=\"Shape parameter \u03b1\")\r\n\r\n# Log-log survival plot \u2013 the signature view\r\nfor a in alphas:\r\n    # Survival function P(X &gt; k) \u2248 k^(-a)\r\n    x = np.logspace(0, 5, 1000)\r\n    y = (1 \/ x)**a\r\n    ax2.loglog(x, y, lw=2.4, label=f\"\u03b1 = {a}\")\r\n\r\nax2.set_title(\"Log-log survival plot \u2013 straight line = power law\", fontsize=13)\r\nax2.set_xlabel(\"Value k (log)\", fontsize=11)\r\nax2.set_ylabel(\"P(X &gt; k) (log)\", fontsize=11)\r\nax2.legend(title=\"Shape parameter \u03b1\")\r\nax2.grid(True, which=\"both\", ls=\"--\", alpha=0.4)\r\n\r\nplt.tight_layout()\r\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p dir=\"auto\"><strong>Key observations<\/strong>:<\/p>\n<ul dir=\"auto\">\n<li>On normal scale \u2192 almost everything looks like it\u2019s near zero (tail is invisible)<\/li>\n<li>On <strong>log-log scale<\/strong> \u2192 power-law becomes a <strong>straight line<\/strong><\/li>\n<li>Smaller \u03b1 \u2192 <strong>much heavier tail<\/strong> (more extreme values)<\/li>\n<\/ul>\n<h3 dir=\"auto\">6. Realistic code patterns you will actually write<\/h3>\n<p dir=\"auto\"><strong>Pattern 1 \u2013 Simulate word frequencies in a large text corpus<\/strong><\/p>\n<div dir=\"auto\">\n<div data-testid=\"code-block\">\n<div>\n<div>Python<\/div>\n<div>\n<pre tabindex=\"0\"><code># Typical Zipf exponent for English text \u2248 1.0\u20131.2\r\nalpha = 1.15\r\n\r\n# Simulate frequencies of ~50,000 unique words\r\nword_freq = stats.zeta.rvs(a=alpha, size=50000)\r\n\r\n# Sort descending (most frequent first)\r\nword_freq_sorted = np.sort(word_freq)[::-1]\r\n\r\n# Plot rank vs frequency (classic Zipf plot)\r\nplt.loglog(range(1, len(word_freq_sorted)+1), word_freq_sorted, '.', ms=3, alpha=0.7)\r\nplt.title(\"Zipf plot \u2013 word frequency vs rank (log-log)\", fontsize=14)\r\nplt.xlabel(\"Rank (log)\", fontsize=12)\r\nplt.ylabel(\"Frequency (log)\", fontsize=12)\r\nplt.grid(True, which=\"both\", ls=\"--\", alpha=0.4)\r\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p dir=\"auto\"><strong>Pattern 2 \u2013 Check how much the top-k items dominate<\/strong><\/p>\n<div dir=\"auto\">\n<div data-testid=\"code-block\">\n<div>\n<div>Python<\/div>\n<div>\n<pre tabindex=\"0\"><code># Same simulated frequencies\r\ntotal = word_freq.sum()\r\n\r\ntop_1_percent = word_freq_sorted[:int(0.01 * len(word_freq_sorted))]\r\ntop_5_percent = word_freq_sorted[:int(0.05 * len(word_freq_sorted))]\r\n\r\nprint(f\"Top 1% of words account for: {top_1_percent.sum() \/ total * 100:.1f}% of total frequency\")\r\nprint(f\"Top 5% of words account for: {top_5_percent.sum() \/ total * 100:.1f}% of total frequency\")<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p dir=\"auto\"><strong>Pattern 3 \u2013 Simulate YouTube video views (classic Zipf-like behavior)<\/strong><\/p>\n<div dir=\"auto\">\n<div data-testid=\"code-block\">\n<div>\n<div>Python<\/div>\n<div>\n<pre tabindex=\"0\"><code># \u03b1 \u2248 1.5\u20131.8 for video views\r\nalpha = 1.6\r\nn_videos = 100000\r\n\r\nviews = stats.zeta.rvs(a=alpha, size=n_videos)\r\n\r\nprint(f\"Top 100 videos have {views[:100].sum() \/ views.sum() * 100:.1f}% of total views\")\r\nprint(f\"Top 1000 videos have {views[:1000].sum() \/ views.sum() * 100:.1f}% of total views\")<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<h3 dir=\"auto\">Summary \u2013 Zipf \/ Zeta Distribution Quick Reference<\/h3>\n<div>\n<div dir=\"auto\">\n<table dir=\"auto\">\n<thead>\n<tr>\n<th data-col-size=\"md\">Property<\/th>\n<th data-col-size=\"lg\">Value \/ Formula<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td data-col-size=\"md\">Shape<\/td>\n<td data-col-size=\"lg\">Extremely heavy right tail (power-law)<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"md\">Defined by<\/td>\n<td data-col-size=\"lg\">shape \u03b1 (exponent), usually 1 &lt; \u03b1 &lt; 3<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"md\">Support<\/td>\n<td data-col-size=\"lg\">k = 1, 2, 3, \u2026 (positive integers)<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"md\">Mean (\u03b1 &gt; 1)<\/td>\n<td data-col-size=\"lg\">\u03b6(\u03b1\u22121) \/ \u03b6(\u03b1)<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"md\">Variance (\u03b1 &gt; 2)<\/td>\n<td data-col-size=\"lg\">complicated (involves zeta functions)<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"md\">NumPy \/ SciPy<\/td>\n<td data-col-size=\"lg\">scipy.stats.zeta.rvs(a=\u03b1, size=&#8230;)<\/td>\n<\/tr>\n<tr>\n<td data-col-size=\"md\">Most common use cases<\/td>\n<td data-col-size=\"lg\">word frequencies, city sizes, website traffic, video views, sales, citations, followers<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div><\/div>\n<\/div>\n<\/div>\n<h3 dir=\"auto\">Final teacher messages<\/h3>\n<ol dir=\"auto\">\n<li><strong>Whenever you see \u201ca few items dominate everything, and it keeps going for a very long tail\u201d<\/strong> \u2192 think Zipf \/ power-law.<\/li>\n<li><strong>Log-log plot showing a straight line<\/strong> is the strongest visual signature of Zipf \/ power-law behavior.<\/li>\n<li><strong>\u03b1 close to 1<\/strong> \u2192 extremely unequal distributions (a tiny fraction owns almost everything)<\/li>\n<li><strong>\u03b1 &gt; 2<\/strong> \u2192 tails are still heavy, but mean and variance exist<\/li>\n<\/ol>\n<p dir=\"auto\">Would you like to continue with any of these next?<\/p>\n<ul dir=\"auto\">\n<li>How to estimate \u03b1 from real data (Hill estimator, log-log regression)<\/li>\n<li>Zipf vs Pareto \u2014 differences and when to use which<\/li>\n<li>Realistic mini-project: simulate word frequencies or YouTube views + analyze dominance<\/li>\n<li>Zipf\u2019s law in natural language processing (vocabulary size, Heap&#8217;s law connection)<\/li>\n<li>Comparing Zipf with log-normal (two main explanations for heavy tails)<\/li>\n<\/ul>\n<p dir=\"auto\">Just tell me what you want to explore next! \ud83d\ude0a<\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. What is the Zipf distribution really? The Zipf distribution is a discrete power-law distribution that describes phenomena where: A small number of items are extremely frequent \/ popular \/ large The vast majority&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[75],"tags":[],"class_list":["post-2527","post","type-post","status-publish","format-standard","hentry","category-numpy"],"_links":{"self":[{"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/posts\/2527","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/comments?post=2527"}],"version-history":[{"count":1,"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/posts\/2527\/revisions"}],"predecessor-version":[{"id":2528,"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/posts\/2527\/revisions\/2528"}],"wp:attachment":[{"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/media?parent=2527"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/categories?post=2527"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/demo.materiamedica.net\/demo6\/wp-json\/wp\/v2\/tags?post=2527"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}