Chapter 1: NumPy Tutorial
NumPy – The Real Beginning (What You Actually Need to Know First)
NumPy is not just “faster lists”. It is a completely different way of thinking about data.
Most important mindset change:
In normal Python → you think item by item In NumPy → you think whole array at once
This single change makes your code 10–100× faster and much cleaner.
|
0 1 2 3 4 5 6 |
import numpy as np |
(We almost always use np — just accept this convention)
1. Creating arrays – the 6 most common ways people really use
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# Way 1: From existing Python list/tuple a = np.array([10, 20, 35, 42]) b = np.array([[2, 3, 5], [7, 11, 13]]) # 2D # Way 2: Special creator functions (very frequent) zeros = np.zeros((4, 6)) # shape = rows × columns ones = np.ones((3, 5)) empty = np.empty((2, 4)) # ⚠️ contains garbage — very fast but dangerous # Way 3: Fill with same value full7 = np.full((3, 4), 7) identity = np.eye(5) # 5×5 identity matrix # Way 4: Sequence generators (very common) ar = np.arange(0, 20, 3) # 0, 3, 6, 9, 12, 15, 18 lin = np.linspace(0, 1, 11) # 0.0, 0.1, ..., 1.0 ← very useful for plotting # Way 5: Random numbers (you will use these EVERY day) np.random.seed(42) # ← make experiments reproducible uniform = np.random.rand(4, 3) # values between 0.0 and 1.0 normal = np.random.randn(1000) # standard normal distribution (mean=0, std=1) integers = np.random.randint(1, 100, size=(5, 6)) # inclusive low, exclusive high |
Quick tip people forget: randn → normal distribution rand → uniform [0,1)
2. The 4 things you must check every time you create an array
|
0 1 2 3 4 5 6 7 8 9 10 11 |
x = np.random.randint(0, 50, size=(3, 4, 5)) print(x.shape) # (3, 4, 5) ← most important line! print(x.ndim) # 3 print(x.size) # 60 (= 3×4×5) print(x.dtype) # int64 (or int32, float64, uint8, bool...) |
Memory layout reminder (very useful when debugging):
|
0 1 2 3 4 5 6 |
shape = (depth, height, width) or (z, y, x) |
Example real meaning:
|
0 1 2 3 4 5 6 7 8 |
# 100 images of 256×256 pixels, RGB images = np.zeros((100, 256, 256, 3), dtype=np.uint8) # shape → (batch, height, width, channels) |
3. The golden rule: NumPy almost never copies data by default
This is where 80% of beginners get confused and angry.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
a = np.array([10, 20, 30, 40]) b = a # ← this is NOT a copy! b[1] = 999 print(a) # [ 10 999 30 40 ] ← surprise! # Correct ways to actually copy: c = a.copy() d = np.copy(a) e = a[:] # sometimes view, sometimes copy — be careful |
View vs Copy cheat sheet (very important)
| Operation | Usually View or Copy? |
|---|---|
| b = a | View (same data) |
| b = a.copy() | Copy |
| b = a[:] | View (most cases) |
| b = a[::2] | View |
| b = a.reshape(…) | View (if possible) |
| b = a.T | View |
| b = a[a > 5] | Copy |
4. Vectorization – why NumPy feels like magic
Classic slow Python:
|
0 1 2 3 4 5 6 7 8 9 10 |
prices = [120.5, 88.9, 245.0, 19.99, 340.0] discount = 0.15 final = [] for p in prices: final.append(p * (1 - discount)) |
NumPy version (10–100× faster):
|
0 1 2 3 4 5 6 7 8 9 |
prices = np.array([120.5, 88.9, 245.0, 19.99, 340.0]) discount = 0.15 final = prices * (1 - discount) # array([102.425, 75.565, 208.25 , 16.9915, 289. ]) |
All these operations are vectorized (done on whole array at once):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
x = np.array([1, 2, 3, 4, 5]) x + 10 x * 3 x ** 2 x % 2 np.sqrt(x) np.sin(x) np.log(x + 1) # +1 to avoid log(0) x > 3 # returns boolean array |
5. Broadcasting – the feature that feels like cheating
Broadcasting rule (memorize this):
Two arrays can be used together if: • their dimensions are equal, or • one of them has dimension 1 (it gets stretched)
Examples that work:
|
0 1 2 3 4 5 6 7 8 9 |
(5, 3) + (3,) → works (row gets repeated 5 times) (7, 1) + (1, 6) → works (4, 1, 1) + 10 → works (100, 784) + (784,) → very common in machine learning |
Examples that fail:
|
0 1 2 3 4 5 6 7 |
np.ones((3,4)) + np.ones((2,5)) # ValueError np.ones((4,5)) + np.ones((5,4)) # also fails |
6. Indexing & Slicing – real-world patterns
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
arr = np.arange(36).reshape(6, 6) # Classic slices arr[0, :] # first row arr[:, -1] # last column arr[2:5, 1:4] # sub-matrix # Boolean indexing (very powerful) mask = arr % 5 == 0 multiples_of_5 = arr[mask] # Fancy indexing (using lists/arrays of indices) rows = [0, 2, 5] cols = [1, 4, 3] selected = arr[rows, cols] # gets arr[0,1], arr[2,4], arr[5,3] # Combined – very common pattern big = np.random.randint(0, 100, (1000, 50)) good_rows = big[:, 0] > 80 # first column > 80 important = big[good_rows, :] # keep only those rows |
7. Reshaping & stacking – daily bread
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
a = np.arange(24) a.reshape(4, 6) # normal a.reshape(3, -1) # -1 = "figure it out" a.reshape(-1, 8) # also fine # Very common in ML images = np.random.rand(100, 784) # 100 flattened images images_reshaped = images.reshape(-1, 28, 28) # → 28×28 images # Stacking np.vstack([a1, a2]) # vertical np.hstack([a1, a2]) # horizontal np.concatenate([a,b,c], axis=0) np.column_stack([vec1, vec2]) # turns 1D vectors into columns |
8. Most useful statistics & reductions
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
data = np.random.randn(500, 8) data.sum() # everything data.sum(axis=0) # column sums data.mean(axis=1) # mean of each row data.std(axis=0) # standard deviation per column data.min(), data.max() data.argmin(), data.argmax() # index of min/max np.median(data) np.percentile(data, [25, 75]) np.quantile(data, 0.9) |
Very common normalization (you will write this 1000 times):
|
0 1 2 3 4 5 6 7 |
X = np.random.randn(10000, 20) X_norm = (X - X.mean(axis=0)) / X.std(axis=0) |
9. Quick real-life mini-examples you will actually use
Example 1: Remove outliers
|
0 1 2 3 4 5 6 7 8 9 |
scores = np.random.normal(70, 15, 200) Q1, Q3 = np.percentile(scores, [25, 75]) IQR = Q3 - Q1 clean = scores[(scores >= Q1 - 1.5*IQR) & (scores <= Q3 + 1.5*IQR)] |
Example 2: Simple moving average
|
0 1 2 3 4 5 6 7 8 |
ts = np.random.randn(1000).cumsum() # random walk window = 20 sma = np.convolve(ts, np.ones(window)/window, mode='valid') |
Example 3: Distance matrix between points
|
0 1 2 3 4 5 6 7 8 |
points = np.random.rand(300, 2) # 300 points in 2D diff = points[:, np.newaxis, :] - points dist = np.sqrt(np.sum(diff**2, axis=2)) # 300×300 distance matrix |
Summary Table – Print & Keep
| Operation | Most common syntax |
|---|---|
| Create array | np.array(), zeros(), ones(), arange(), linspace() |
| Random | rand(), randn(), randint() |
| Shape info | .shape, .ndim, .size, .dtype |
| Copy | .copy() |
| Math on arrays | + – * / ** sin exp log sqrt abs |
| Matrix multiply | a @ b or np.dot(a,b) |
| Transpose | a.T |
| Reshape | reshape(), -1 is magic |
| Stack | vstack, hstack, concatenate |
| Boolean select | arr[arr > 5] |
| Where | np.where(cond, value_if_true, value_if_false) |
Where do you want to go deeper next?
- Matrix operations & np.linalg (eigenvalues, SVD, solve…)
- Advanced indexing & memory views vs copies in detail
- Performance tricks & when vectorization fails
- Common bugs & how to avoid them
- NumPy + matplotlib mini visualization session
- NumPy in machine learning (data preparation patterns)
Tell me what feels most useful for you right now! 🚀
