Chapter 8: NumPy Data Types
1. Why do we even care about data types in NumPy?
NumPy arrays are not like Python lists.
Python lists can hold anything (int, float, string, object, mix of everything) → very flexible, but slow and memory-hungry.
NumPy arrays require all elements to have the same data type → this is the secret of why NumPy is so fast:
- Fixed type → better memory layout (contiguous block)
- Fixed type → CPU can use vectorized instructions
- Fixed type → predictable memory usage
So choosing the right dtype is important for:
- Memory usage (especially with large arrays / images / datasets)
- Performance (int8 vs float64 can make huge difference)
- Precision (float32 vs float64)
- Correctness (integer overflow, unexpected rounding)
2. The most common NumPy data types (you should know these by heart)
| Category | dtype name | Alias(es) | Size | Range / Precision | Typical use cases |
|---|---|---|---|---|---|
| Integer | int8 | i1 | 1 byte | -128 to 127 | Small counters, pixel masks |
| Integer | int16 | i2 | 2 bytes | -32,768 to 32,767 | Image processing (older formats) |
| Integer | int32 | i4, int | 4 bytes | -2.1 billion to 2.1 billion | Default integer on most systems |
| Integer | int64 | i8, long, longlong | 8 bytes | Very large (±9 quintillion) | Safe default, indices, counters |
| Unsigned int | uint8 | u1 | 1 byte | 0 to 255 | Images (RGB, grayscale) |
| Unsigned int | uint16 | u2 | 2 bytes | 0 to 65,535 | Some image formats, depth maps |
| Unsigned int | uint32 | u4 | 4 bytes | 0 to 4.29 billion | Large counters |
| Unsigned int | uint64 | u8 | 8 bytes | Very large | Rarely needed |
| Float | float16 | f2, half | 2 bytes | ~3 decimal digits | Machine learning (mixed precision) |
| Float | float32 | f4, single, float | 4 bytes | ~7 decimal digits | Images, ML models, memory saving |
| Float | float64 | f8, double, float_ | 8 bytes | ~15–16 decimal digits | Default float, scientific computing |
| Boolean | bool | ? | 1 byte | True / False | Masks, flags |
| Complex | complex64 | c8 | 8 bytes | Two float32 | Signal processing |
| Complex | complex128 | c16, complex | 16 bytes | Two float64 | Scientific computing |
3. How to see and set the data type
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
a = np.array([1, 2, 3, 4]) print(a.dtype) # int64 (on most 64-bit systems) b = np.array([1.5, 2.7, 3.2]) print(b.dtype) # float64 (default for floats) c = np.array([True, False, True]) print(c.dtype) # bool |
Explicitly setting dtype (very common and important)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Memory-efficient image-like array img = np.zeros((512, 512), dtype=np.uint8) # Machine learning friendly (often use float32) X = np.random.randn(10000, 784).astype(np.float32) # Small integer counters counts = np.zeros(1000, dtype=np.int16) |
4. Very common real-world dtype choices
| Situation | Recommended dtype | Why? |
|---|---|---|
| Normal Python-like numbers | float64, int64 | Safe, no surprises |
| Machine learning input data | float32 | Saves memory, most models don’t need float64 |
| Deep learning weights / activations | float32 (or float16 with care) | Memory & speed |
| RGB images | uint8 | 0–255 per channel |
| Grayscale images | uint8 or float32 | uint8 saves memory, float32 for processing |
| Masks / binary data | bool or uint8 | bool is 1 byte anyway, uint8 sometimes faster |
| Large integer counters / indices | int64 | Avoid overflow |
| Very memory-constrained environment | int8, uint8, float16 | 2–8× memory reduction |
| Scientific computing, finance | float64 | Highest precision |
5. Dangerous / surprising dtype behaviors
Integer overflow (very common bug)
|
0 1 2 3 4 5 6 7 8 |
small = np.array([100, 120, 200], dtype=np.int8) print(small + 100) # [ -28 -16 -56 ] ← overflow! wraps around |
Unsigned wrap-around
|
0 1 2 3 4 5 6 7 8 |
u = np.array([250, 251, 252], dtype=np.uint8) print(u + 10) # [ 4 5 6 ] ← 250+10 = 260 → 4 (mod 256) |
Float → int conversion truncates
|
0 1 2 3 4 5 6 7 8 |
f = np.array([3.9, 4.7, -1.2]) print(f.astype(int)) # [ 3 4 -1 ] ← truncates toward zero, no rounding! |
Better way (if you want rounding)
|
0 1 2 3 4 5 6 7 |
print(np.round(f).astype(int)) # [ 4 5 -1 ] |
6. Creating arrays with specific dtypes – examples
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# Image – classic photo = np.random.randint(0, 256, (480, 640, 3), dtype=np.uint8) # Machine learning dataset – float32 is very common X_train = np.random.randn(50000, 3072).astype(np.float32) # Boolean mask mask = np.random.rand(1000) > 0.7 # float64 by default mask_bool = (np.random.rand(1000) > 0.7).astype(bool) # Memory-efficient counter ids = np.arange(100000, dtype=np.int32) # 4 bytes instead of 8 # Mixed data → becomes object (avoid this!) bad = np.array([1, 3.14, "hello", True]) print(bad.dtype) # object ← very slow, not vectorized |
7. Changing dtype – .astype() vs view casting
Safe way – creates copy
|
0 1 2 3 4 5 6 7 |
a = np.array([1, 2, 3, 4], dtype=np.int32) b = a.astype(np.float64) # new array, safe |
Dangerous but fast – view casting (only when you know what you’re doing)
|
0 1 2 3 4 5 6 7 |
c = a.view(np.float32) # same memory, interpreted as float32 # → almost always garbage unless you really understand memory layout |
Rule of thumb for beginners: Use .astype() 99% of the time. Only use .view() when optimizing very performance-critical code and you fully understand the memory layout.
Quick Reference – Most Used dtypes in 2024/2025
Use this as your cheat sheet:
| Purpose | dtype you should usually pick |
|---|---|
| General purpose numbers | float64, int64 |
| Machine learning data/weights | float32 |
| Deep learning (memory saving) | float16 (with AMP or mixed precision) |
| Images (input / storage) | uint8 |
| Image processing (internal) | float32 |
| Boolean masks | bool |
| Large safe indices / counts | int64 |
| Very small memory counters | int16, int8, uint8 |
Where would you like to go next?
- How dtype affects performance & memory (with real numbers)
- Working with structured arrays / record arrays (mini dataframes)
- Type casting rules and surprises in operations
- Common dtype bugs in machine learning pipelines
- Realistic mini-example: image processing with correct dtypes
Just tell me what feels most useful for you right now! 😊
