Why NumPy?
DfNumPy
NumPy (Numerical Python) is the foundational library for scientific computing in Python. It provides a high-performance multi-dimensional array object (ndarray) and tools for working with these arrays. NumPy arrays are stored as contiguous blocks of memory, enabling vectorized operations that are 10-100x faster than Python lists.
NumPy vs Python Lists â Performance Comparison
import numpy as np
import time
size = 1_000_000
# Python list operations
python_list = list(range(size))
start = time.time()
python_result = [x * 2 for x in python_list]
list_time = time.time() - start
# NumPy array operations
numpy_array = np.arange(size)
start = time.time()
numpy_result = numpy_array * 2
numpy_time = time.time() - start
print(f"Python list time: {list_time:.4f}s")
print(f"NumPy array time: {numpy_time:.4f}s")
print(f"NumPy is {list_time/numpy_time:.1f}x faster")
Output:
Python list time: 0.1234s
NumPy array time: 0.0021s
NumPy is 58.8x faster
âšī¸ Why NumPy Is Fast
NumPy's speed comes from three key design choices:
- Contiguous memory layout: Arrays are stored in continuous memory blocks, enabling CPU cache efficiency
- Vectorization: Operations are applied element-wise without Python loops (loops run in C)
- Broadcasting: Arithmetic operations automatically handle arrays of different shapes
Creating NumPy Arrays
From Python Lists
import numpy as np
# 1D array (vector)
arr1d = np.array([1, 2, 3, 4, 5])
print(f"1D Array: {arr1d}")
print(f"Shape: {arr1d.shape}") # (5,)
print(f"Dimension: {arr1d.ndim}") # 1
print(f"Data type: {arr1d.dtype}") # int64
# 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print(f"\n2D Array:\n{arr2d}")
print(f"Shape: {arr2d.shape}") # (2, 3)
print(f"Dimension: {arr2d.ndim}") # 2
# 3D array (tensor)
arr3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"\n3D Array shape: {arr3d.shape}") # (2, 2, 2)
Special Arrays
# Array of zeros
zeros = np.zeros((3, 4)) # 3x4 matrix of zeros
# Array of ones
ones = np.ones((2, 3)) # 2x3 matrix of ones
# Identity matrix
identity = np.eye(4) # 4x4 identity matrix
# Array with evenly spaced values
linear = np.linspace(0, 10, 5) # 5 values from 0 to 10
print(f"Linear: {linear}") # [0, 2.5, 5, 7.5, 10]
# Range with step
range_arr = np.arange(0, 20, 3) # 0 to 20, step 3
print(f"Range: {range_arr}") # [0, 3, 6, 9, 12, 15, 18]
# Random arrays
random_uniform = np.random.rand(3, 3) # Uniform [0, 1)
random_normal = np.random.randn(3, 3) # Standard normal
random_int = np.random.randint(0, 100, (3, 3)) # Random integers
Array Operations (Vectorization)
Element-wise Operations
DfVectorization
Vectorization is the process of applying operations to entire arrays at once, without explicit Python loops. This is possible because NumPy operations are implemented in optimized C code that operates on contiguous memory blocks.
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
# Element-wise operations
print(f"Addition: {a + b}") # [11, 22, 33, 44, 55]
print(f"Subtraction: {b - a}") # [9, 18, 27, 36, 45]
print(f"Multiplication: {a * b}") # [10, 40, 90, 160, 250]
print(f"Division: {b / a}") # [10, 10, 10, 10, 10]
print(f"Power: {a ** 2}") # [1, 4, 9, 16, 25]
# Scalar operations
print(f"\nAdd scalar: {a + 10}") # [11, 12, 13, 14, 15]
print(f"Multiply scalar: {a * 3}") # [3, 6, 9, 12, 15]
Matrix Operations
Matrix Multiplication
Here,
- =First matrix (m à n)
- =Second matrix (n à p)
- =Result matrix (m à p)
- =Element in row i, column k of A
- =Element in row k, column j of B
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
# Matrix multiplication
C = A @ B
print(f"Matrix multiplication:\n{C}")
# [[19, 22],
# [43, 50]]
# Element-wise multiplication (Hadamard product)
D = A * B
print(f"\nElement-wise multiplication:\n{D}")
# [[5, 12],
# [21, 32]]
# Transpose
print(f"\nTranspose of A:\n{A.T}")
# [[1, 3],
# [2, 4]]
# Determinant
det_A = np.linalg.det(A)
print(f"\nDeterminant of A: {det_A:.2f}") # -2.00
# Inverse
inv_A = np.linalg.inv(A)
print(f"\nInverse of A:\n{inv_A}")
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"\nEigenvalues: {eigenvalues}")
print(f"\nEigenvectors:\n{eigenvectors}")
đĄ Matrix Multiplication Rules
For matrix multiplication to be defined:
- The number of columns in must equal the number of rows in
- If is and is , the result is
- Matrix multiplication is not commutative: in general
Broadcasting
DfBroadcasting
Broadcasting is NumPy's mechanism for performing arithmetic operations on arrays with different shapes. It automatically "stretches" the smaller array across the larger array so that they have compatible shapes, without copying data.
# Broadcasting examples
# Add a scalar to an array
arr = np.array([[1, 2, 3], [4, 5, 6]])
result = arr + 10
print(f"Scalar broadcast:\n{result}")
# [[11, 12, 13],
# [14, 15, 16]]
# Add 1D array to 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
arr1d = np.array([10, 20, 30])
result = arr2d + arr1d
print(f"\n1D broadcast:\n{result}")
# [[11, 22, 33],
# [14, 25, 36]]
# Visual explanation:
# arr2d: [[1, 2, 3], arr1d: [10, 20, 30]
# [4, 5, 6]]
#
# Result: [[11, 22, 33], # arr2d[0] + arr1d
# [14, 25, 36]] # arr2d[1] + arr1d
âšī¸ Broadcasting Rules
NumPy follows these rules when broadcasting:
- If arrays have different numbers of dimensions, the shape of the one with fewer dimensions is padded with ones on the left
- Arrays with size 1 along a dimension act as if they had the size of the largest array in that dimension
- If shapes are incompatible (neither equal nor 1), broadcasting raises a
ValueError
Indexing and Slicing
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
# Single element
print(f"Element at (1,2): {arr[1, 2]}") # 7
# Slice: rows 0-1, columns 1-3
print(f"Slice:\n{arr[0:2, 1:3]}")
# [[2, 3],
# [6, 7]]
# All rows, specific column
print(f"Column 2: {arr[:, 2]}") # [3, 7, 11]
# Boolean indexing (filtering)
data = np.array([10, 25, 30, 15, 40])
mask = data > 20
print(f"Boolean mask: {mask}") # [False, True, True, False, True]
print(f"Filtered: {data[mask]}") # [25, 30, 40]
# Fancy indexing
indices = [0, 2, 4]
print(f"Fancy indexing: {data[indices]}") # [10, 30, 40]
Useful NumPy Functions
arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])
# Aggregation functions
print(f"Sum: {np.sum(arr)}") # 40
print(f"Mean: {np.mean(arr)}") # 4.0
print(f"Median: {np.median(arr)}") # 3.5
print(f"Std Dev: {np.std(arr):.2f}") # 2.16
print(f"Min: {np.min(arr)}") # 1
print(f"Max: {np.max(arr)}") # 9
print(f"Argmin: {np.argmin(arr)}") # 1 (index of min)
print(f"Argmax: {np.argmax(arr)}") # 5 (index of max)
# Cumulative operations
print(f"Cumulative sum: {np.cumsum(arr)}") # [3,4,8,9,14,23,25,31,36,39]
# Sorting
print(f"Sorted: {np.sort(arr)}") # [1,1,2,3,3,4,5,5,6,9]
# Unique values
arr2 = np.array([1, 1, 2, 2, 3, 3, 3])
print(f"Unique: {np.unique(arr2)}") # [1, 2, 3]
print(f"Counts: {np.unique(arr2, return_counts=True)}") # ([1,2,3], [2,2,3])
# Where (conditional)
scores = np.array([85, 92, 78, 95, 88])
grades = np.where(scores >= 90, 'A',
np.where(scores >= 80, 'B', 'C'))
print(f"Grades: {grades}") # ['B', 'A', 'C', 'A', 'B']
Practical Example: Data Normalization
Different features often have vastly different scales (e.g., age: 0-100, income: 0-1,000,000). Without normalization, features with larger magnitudes dominate distance calculations, gradient descent, and model training.
Min-Max Normalization
Min-Max Normalization
Here,
- =Original data value
- =Minimum value in the dataset
- =Maximum value in the dataset
- =Normalized value in [0, 1]
đĄ When to Use Min-Max
Use for neural networks with sigmoid/tanh activations, image pixel values, and algorithms sensitive to magnitude (KNN, K-Means). Sensitive to outliers â a single extreme value compresses all other data into a narrow range.
Z-Score Standardization
Z-Score Standardization
Here,
- =Original data value
- =Population mean
- =Population standard deviation
- =Standardized score (number of std devs from mean)
đĄ When to Use Z-Score
Use for algorithms assuming Gaussian distribution (LDA, Gaussian Naive Bayes), regression with regularization, PCA, and any distance-based method. Robust to outliers and centers data at origin.
When to Choose Which?
| Scenario | Use | Why |
|---|---|---|
| Neural network with ReLU | Min-Max | Keeps values positive, bounded range |
| Linear/Logistic Regression | Z-Score | Assumptions align with standardized data |
| KNN / K-Means | Z-Score | Distance metrics need comparable scales |
| PCA | Z-Score | Variance scaling must be equal |
| Tree-based models | Neither | Trees are scale-invariant |
đImplementing Both Normalizations
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Sample data
X = np.array([[25, 50000], [30, 80000], [35, 120000], [40, 95000]])
# Min-Max Normalization
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(X)
print(f"Min-Max normalized:\n{X_minmax}")
# Z-Score Standardization
scaler_standard = StandardScaler()
X_standard = scaler_standard.fit_transform(X)
print(f"\nZ-Score standardized:\n{X_standard}")
print(f"Mean: {X_standard.mean(axis=0)}") # [0, 0]
print(f"Std: {X_standard.std(axis=0)}") # [1, 1]
Key Takeaways
đSummary: NumPy Essentials
- NumPy arrays are significantly faster than Python lists due to contiguous memory and vectorized operations
- Vectorization eliminates explicit loops â operations are applied element-wise in optimized C
- Broadcasting enables operations on arrays of different shapes without data copying
- Matrix multiplication uses
@operator; remember dimension rules ( times ) - Master indexing and slicing for efficient data manipulation (label-based, position-based, boolean)
- Normalization is critical for distance-based algorithms â choose Min-Max or Z-Score based on your model's assumptions
- NumPy is the foundation for Pandas, Scikit-learn, TensorFlow, and PyTorch
Practice Exercise
- Create a 5x5 random matrix and compute its eigenvalues and eigenvectors
- Implement row-wise and column-wise normalization (divide each row/column by its L2 norm)
- Find all elements greater than the matrix mean using boolean indexing
- Use broadcasting to compute the pairwise Euclidean distance matrix between two sets of points
- Verify that the determinant of a matrix equals the product of its eigenvalues