Chapter 17: NumPy Filter Array
NumPy Filtering Arrays — written exactly like a patient teacher sitting next to you, explaining slowly, showing many realistic examples, pointing out the most common mistakes students make, and teaching you the patterns you will actually use in real data science, machine learning, and scientific code.
Let’s pretend we’re looking at the same screen.
|
0 1 2 3 4 5 6 |
import numpy as np |
What does “filtering” mean in NumPy?
Filtering = selecting only the elements that satisfy one or more conditions.
In NumPy, filtering is extremely powerful because it is:
- vectorized (very fast — no Python loops)
- clean to read
- memory efficient in most cases
- works naturally with multi-dimensional arrays
The two main ways people filter in NumPy are:
- Boolean indexing (most common & most important)
- np.where() (useful for both filtering and replacement)
1. Boolean indexing – The #1 most used filtering method
You create a boolean mask (array of True/False) with the same shape as your data, then use that mask inside square brackets.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
scores = np.array([78, 92, 65, 88, 71, 95, 82, 59, 67, 91]) # Step 1: create a boolean mask passed = scores >= 70 print(passed) # [ True True False True True True True False False True] # Step 2: use the mask to filter passed_students = scores[passed] print(passed_students) # [78 92 88 71 95 82 91] |
Even shorter & very common style (one-liner):
|
0 1 2 3 4 5 6 7 8 |
high_scores = scores[scores >= 90] print(high_scores) # [92 95 91] |
2. Multiple conditions – very frequent pattern
Use & (and), (or), ~ (not) — always with parentheses when combining.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
ages = np.array([23, 45, 17, 67, 34, 19, 29, 91, 55, 41]) # Adults between 25 and 60 inclusive adult_working_age = ages[(ages >= 25) & (ages <= 60)] print(adult_working_age) # [45 34 29 55 41] # Teenagers OR seniors extremes = ages[(ages < 20) | (ages > 65)] print(extremes) # [17 19 67 91] |
Common mistake students make — forgetting parentheses:
|
0 1 2 3 4 5 6 7 8 9 10 |
# WRONG - operator precedence bug wrong = ages[ages >= 25 & ages <= 60] # SyntaxError or wrong result # Correct correct = ages[(ages >= 25) & (ages <= 60)] |
3. Filtering 2D arrays (matrices) – very important
|
0 1 2 3 4 5 6 7 |
data = np.random.randint(0, 100, size=(6, 4)) print(data) |
Select only rows where first column > 50
|
0 1 2 3 4 5 6 7 8 |
good_rows = data[:, 0] > 50 important_rows = data[good_rows] print(important_rows.shape) # (number_of_good_rows, 4) |
Select only values > 80 (returns 1D array)
|
0 1 2 3 4 5 6 7 |
high_values = data[data > 80] print(high_values) |
Select rows where any value > 90
|
0 1 2 3 4 5 6 7 |
rows_with_extreme = np.any(data > 90, axis=1) extreme_rows = data[rows_with_extreme] |
Select rows where all values > 30
|
0 1 2 3 4 5 6 7 |
good_quality = np.all(data > 30, axis=1) good_rows = data[good_quality] |
4. Combining filtering with assignment (very common pattern)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
temperatures = np.random.uniform(-15, 45, 1000) # Replace unrealistic values temperatures[temperatures < -5] = -5 temperatures[temperatures > 40] = 40 # Or more elegant with np.clip (but boolean is very readable) # temperatures = np.clip(temperatures, -5, 40) |
Replace negatives with zero (classic cleaning)
|
0 1 2 3 4 5 6 7 |
measurements = np.random.randn(500) * 10 measurements[measurements < 0] = 0 |
Set outliers to missing value (NaN)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
sales = np.random.randint(100, 10000, 1000) q1, q3 = np.percentile(sales, [25, 75]) iqr = q3 - q1 outliers = (sales < q1 - 1.5*iqr) | (sales > q3 + 1.5*iqr) sales[outliers] = np.nan |
5. Filtering with np.where() – when you want indices or conditional values
Get indices instead of values
|
0 1 2 3 4 5 6 7 8 |
idx = np.where(scores >= 90)[0] # [0] because it's a tuple print(idx) # [1 5 9] print(scores[idx]) # [92 95 91] |
Conditional replacement (if-else on whole array)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
profit = np.array([-1200, 3400, -800, 5600, -200, 9100]) labeled = np.where(profit > 0, "Profit", "Loss") print(labeled) # ['Loss' 'Profit' 'Loss' 'Profit' 'Loss' 'Profit'] # Or replace with different values adjusted = np.where(profit > 0, profit * 1.1, profit * 0.9) |
6. Realistic patterns you will use every week
Pattern 1 – Remove missing / invalid values
|
0 1 2 3 4 5 6 7 8 9 |
raw_data = np.array([23.4, np.nan, 19.8, -999, 25.1, np.inf, 22.7]) valid = raw_data[np.isfinite(raw_data)] print(valid) |
Pattern 2 – Keep only rows without outliers in any column
|
0 1 2 3 4 5 6 7 |
X = np.random.randn(1000, 20) no_outliers = X[~np.any(np.abs(X) > 5, axis=1)] |
Pattern 3 – Filter time series by date range
|
0 1 2 3 4 5 6 7 8 9 10 |
dates = np.arange(np.datetime64('2024-01-01'), np.datetime64('2024-12-31')) values = np.random.rand(len(dates)) mask = (dates >= np.datetime64('2024-06-01')) & (dates <= np.datetime64('2024-08-31')) summer_values = values[mask] |
Pattern 4 – Select specific categories
|
0 1 2 3 4 5 6 7 8 |
products = np.array(['apple', 'banana', 'carrot', 'apple', 'date', 'banana']) fruits = products[np.isin(products, ['apple', 'banana', 'date'])] |
Summary – Quick Reference Table
| You want to… | Recommended way |
|---|---|
| Get elements that match condition | arr[condition] |
| Get indices where condition is true | np.where(condition) or np.nonzero(condition) |
| Replace values if condition is true | arr[condition] = value or np.where(condition, new, arr) |
| Multiple conditions | (cond1) & (cond2), ` |
| Filter rows based on one column | arr[arr[:, col] > x] |
| Remove NaN / inf | arr[np.isfinite(arr)] |
| Check membership in list/set | np.isin(arr, allowed_values) |
| Count how many match | np.sum(condition) or condition.sum() |
Most common beginner mistakes
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Mistake 1: Using 'and' instead of '&' wrong = ages[ages > 20 and ages < 40] # ValueError # Mistake 2: Forgetting parentheses wrong = ages[ages > 20 & ages < 40] # wrong precedence # Mistake 3: Thinking filtering always returns same shape high = data[data > 80] # returns 1D array, not same shape |
Would you like to go deeper into any of these areas?
- Advanced filtering with multiple columns / complex conditions
- Filtering with string arrays / object dtype
- Filtering + sorting together (very common combo)
- Performance: boolean indexing vs np.where vs isin
- Mini-exercise: clean a realistic messy dataset together
Just tell me what you want to focus on next! 😊
