Chapter 15: NumPy Searching Arrays
NumPy Searching Arrays — written as if I’m your patient teacher sitting next to you, showing examples line by line, explaining the logic behind each method, comparing them, warning about common traps, and showing realistic patterns you will actually use in real work.
Let’s go step by step.
|
0 1 2 3 4 5 6 |
import numpy as np |
What does “searching” mean in NumPy?
Searching = finding elements that satisfy a condition, or finding positions (indices) of specific values or properties.
NumPy offers several very powerful and fast ways to search — most of them vectorized (no loops), which is why they are much faster than Python loops.
The most important searching tools in NumPy are:
| Method / Function | What it returns | Most common use case |
|---|---|---|
| np.where() | indices or conditional replacement | The most versatile and frequently used |
| Boolean indexing | elements that satisfy condition | Filtering / selecting data |
| np.nonzero() | indices of non-zero / True elements | Finding positions of “events” |
| np.argmax() / argmin() | index of maximum / minimum value | Finding best/worst position |
| np.argsort() | indices that would sort the array | Ranking, top-k, sorting indirectly |
| np.searchsorted() | insertion point for sorted arrays | Fast search in sorted data |
| np.isin() | boolean mask — is element in set? | Membership testing |
1. The most powerful & most used: np.where()
np.where() has two very different behaviors depending on how many arguments you give it.
Behavior 1: np.where(condition) → returns tuple of indices
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
scores = np.array([78, 92, 65, 88, 55, 95, 71, 82, 49, 91]) # Find indices where score >= 90 high_indices = np.where(scores >= 90) print(high_indices) # (array([1, 5, 9]),) ← tuple with one array (because 1D) print(scores[high_indices]) # [92 95 91] |
2D example — very common
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
mat = np.random.randint(0, 100, size=(6, 5)) print(mat) above_80 = np.where(mat > 80) print(above_80) # (array([row1, row2, ...]), array([col1, col2, ...])) # Example output might be: # (array([0, 1, 3, 4, 5]), array([2, 4, 1, 3, 0])) |
You can zip them to get (row, col) pairs:
|
0 1 2 3 4 5 6 7 8 |
rows, cols = np.where(mat > 80) for r, c in zip(rows, cols): print(f"Position ({r},{c}) = {mat[r,c]}") |
Behavior 2: np.where(condition, x, y) → conditional replacement (like if-else on arrays)
This is extremely common in data cleaning and feature engineering.
|
0 1 2 3 4 5 6 7 8 9 10 |
values = np.array([-3, 7, -1, 12, -8, 0, 5, -4]) cleaned = np.where(values < 0, 0, values) print(cleaned) # [ 0 7 0 12 0 0 5 0] |
Realistic example – clipping values
|
0 1 2 3 4 5 6 7 8 9 10 |
temperatures = np.random.uniform(-10, 40, 20) # Clip to realistic range 0–35 °C realistic = np.where(temperatures < 0, 0, np.where(temperatures > 35, 35, temperatures)) |
2. Boolean indexing — the simplest & very powerful way to search & filter
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
data = np.random.randn(1000) # Get only positive values positives = data[data > 0] print(len(positives)) # around 500 # Replace negatives with zero data[data < 0] = 0 |
Combined conditions — very frequent pattern
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
sales = np.random.randint(100, 1000, 500) # Sales between 400 and 700 medium_sales = sales[(sales >= 400) & (sales <= 700)] # Outliers (top 5% or bottom 5%) q05, q95 = np.percentile(sales, [5, 95]) outliers = sales[(sales < q05) | (sales > q95)] |
3. Finding position of extreme values — argmax() / argmin()
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
scores = np.array([78, 92, 65, 88, 95, 71, 82, 49, 91, 87]) best_idx = np.argmax(scores) worst_idx = np.argmin(scores) print(f"Best score: {scores[best_idx]} at position {best_idx}") print(f"Worst score: {scores[worst_idx]} at position {worst_idx}") |
2D version — very important
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
matrix = np.random.randint(0, 100, (5, 6)) # Index of global maximum max_idx_flat = np.argmax(matrix) max_row, max_col = np.unravel_index(max_idx_flat, matrix.shape) # Per row / per column best_per_row = np.argmax(matrix, axis=1) # one index per row best_per_col = np.argmax(matrix, axis=0) # one index per column |
4. Ranking & sorting indirectly — argsort()
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
values = np.array([45, 92, 18, 76, 33, 89, 61]) sorted_indices = np.argsort(values) print(sorted_indices) # [2 4 0 6 3 5 1] ← indices in increasing order print(values[sorted_indices]) # [18 33 45 61 76 89 92] ← sorted values # Top 3 highest top3_idx = sorted_indices[-3:] print(values[top3_idx]) # [76 89 92] |
Descending order trick
|
0 1 2 3 4 5 6 |
top_indices = np.argsort(values)[-5:][::-1] # top 5, highest first |
5. Fast search in sorted arrays — np.searchsorted()
Very efficient when you have sorted data.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
sorted_prices = np.sort(np.random.uniform(10, 500, 1000)) # Where would these values be inserted? new_prices = np.array([45.5, 120.0, 399.9]) positions = np.searchsorted(sorted_prices, new_prices) print(positions) |
Use case: count how many values are less than X
|
0 1 2 3 4 5 6 |
count_less_than_200 = np.searchsorted(sorted_prices, 200) |
Summary – Which search method when?
| You want to… | Best tool(s) |
|---|---|
| Find indices where condition is true | np.where(condition), np.nonzero() |
| Filter / select elements | Boolean indexing arr[condition] |
| Replace values conditionally | np.where(condition, new_value, arr) |
| Find position of maximum / minimum | np.argmax(), np.argmin() |
| Get top-k / bottom-k indices | np.argsort() + slicing |
| Check if values exist in a set | np.isin(values, allowed_set) |
| Search in already sorted array (insertion point) | np.searchsorted() |
| Count how many values satisfy condition | np.sum(condition) or np.count_nonzero(condition) |
Realistic patterns you will write again and again
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Pattern 1: Remove outliers data = np.random.normal(100, 15, 2000) q25, q75 = np.percentile(data, [25, 75]) iqr = q75 - q25 clean = data[~((data < q25 - 1.5*iqr) | (data > q75 + 1.5*iqr))] # Pattern 2: Label encoding / binning ages = np.random.randint(18, 70, 500) age_group = np.where(ages < 30, "Young", np.where(ages < 50, "Adult", "Senior")) # Pattern 3: Find first occurrence first_above_threshold = np.argmax(scores > 90) |
Would you like to go deeper into any of these next?
- More complex np.where chaining and nested conditions
- searchsorted with side=’left’/’right’ and duplicates
- Combining search with assignment patterns
- Performance: boolean indexing vs where vs nonzero
- Mini-exercise: clean a dataset using search methods
Just tell me what feels most useful right now! 😊
