Python for Data Science: NumPy Essentials

Why NumPy?

DfNumPy

NumPy (Numerical Python) is the foundational library for scientific computing in Python. It provides a high-performance multi-dimensional array object (ndarray) and tools for working with these arrays. NumPy arrays are stored as contiguous blocks of memory, enabling vectorized operations that are 10-100x faster than Python lists.

NumPy vs Python Lists — Performance Comparison

import numpy as np
import time

size = 1_000_000

# Python list operations
python_list = list(range(size))
start = time.time()
python_result = [x * 2 for x in python_list]
list_time = time.time() - start

# NumPy array operations
numpy_array = np.arange(size)
start = time.time()
numpy_result = numpy_array * 2
numpy_time = time.time() - start

print(f"Python list time: {list_time:.4f}s")
print(f"NumPy array time: {numpy_time:.4f}s")
print(f"NumPy is {list_time/numpy_time:.1f}x faster")

Output:

Architecture Diagram

Python list time: 0.1234s
NumPy array time: 0.0021s
NumPy is 58.8x faster

ℹ️ Why NumPy Is Fast

NumPy's speed comes from three key design choices:

Contiguous memory layout: Arrays are stored in continuous memory blocks, enabling CPU cache efficiency
Vectorization: Operations are applied element-wise without Python loops (loops run in C)
Broadcasting: Arithmetic operations automatically handle arrays of different shapes

Creating NumPy Arrays

From Python Lists

import numpy as np

# 1D array (vector)
arr1d = np.array([1, 2, 3, 4, 5])
print(f"1D Array: {arr1d}")
print(f"Shape: {arr1d.shape}")        # (5,)
print(f"Dimension: {arr1d.ndim}")     # 1
print(f"Data type: {arr1d.dtype}")    # int64

# 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print(f"\n2D Array:\n{arr2d}")
print(f"Shape: {arr2d.shape}")        # (2, 3)
print(f"Dimension: {arr2d.ndim}")     # 2

# 3D array (tensor)
arr3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"\n3D Array shape: {arr3d.shape}")  # (2, 2, 2)

Special Arrays

# Array of zeros
zeros = np.zeros((3, 4))  # 3x4 matrix of zeros

# Array of ones
ones = np.ones((2, 3))    # 2x3 matrix of ones

# Identity matrix
identity = np.eye(4)      # 4x4 identity matrix

# Array with evenly spaced values
linear = np.linspace(0, 10, 5)  # 5 values from 0 to 10
print(f"Linear: {linear}")     # [0, 2.5, 5, 7.5, 10]

# Range with step
range_arr = np.arange(0, 20, 3)  # 0 to 20, step 3
print(f"Range: {range_arr}")      # [0, 3, 6, 9, 12, 15, 18]

# Random arrays
random_uniform = np.random.rand(3, 3)      # Uniform [0, 1)
random_normal = np.random.randn(3, 3)      # Standard normal
random_int = np.random.randint(0, 100, (3, 3))  # Random integers

Array Operations (Vectorization)

Element-wise Operations

DfVectorization

Vectorization is the process of applying operations to entire arrays at once, without explicit Python loops. This is possible because NumPy operations are implemented in optimized C code that operates on contiguous memory blocks.

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Element-wise operations
print(f"Addition: {a + b}")        # [11, 22, 33, 44, 55]
print(f"Subtraction: {b - a}")     # [9, 18, 27, 36, 45]
print(f"Multiplication: {a * b}")  # [10, 40, 90, 160, 250]
print(f"Division: {b / a}")        # [10, 10, 10, 10, 10]
print(f"Power: {a ** 2}")          # [1, 4, 9, 16, 25]

# Scalar operations
print(f"\nAdd scalar: {a + 10}")   # [11, 12, 13, 14, 15]
print(f"Multiply scalar: {a * 3}") # [3, 6, 9, 12, 15]

Matrix Operations

Matrix Multiplication

C_{ij} = \\sum_{k=1}^{n} A_{ik} \\cdot B_{kj}

Here,

$A$ =First matrix (m × n)
$B$ =Second matrix (n × p)
$C$ =Result matrix (m × p)
$A_{ik}$ =Element in row i, column k of A
$B_{kj}$ =Element in row k, column j of B

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication
C = A @ B
print(f"Matrix multiplication:\n{C}")
# [[19, 22],
#  [43, 50]]

# Element-wise multiplication (Hadamard product)
D = A * B
print(f"\nElement-wise multiplication:\n{D}")
# [[5, 12],
#  [21, 32]]

# Transpose
print(f"\nTranspose of A:\n{A.T}")
# [[1, 3],
#  [2, 4]]

# Determinant
det_A = np.linalg.det(A)
print(f"\nDeterminant of A: {det_A:.2f}")  # -2.00

# Inverse
inv_A = np.linalg.inv(A)
print(f"\nInverse of A:\n{inv_A}")

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"\nEigenvalues: {eigenvalues}")
print(f"\nEigenvectors:\n{eigenvectors}")

💡 Matrix Multiplication Rules

For matrix multiplication $A \times B$ to be defined:

The number of columns in $A$ must equal the number of rows in $B$
If $A$ is $(m \times n)$ and $B$ is $(n \times p)$ , the result $C$ is $(m \times p)$
Matrix multiplication is not commutative: $AB \neq BA$ in general

Broadcasting

DfBroadcasting

Broadcasting is NumPy's mechanism for performing arithmetic operations on arrays with different shapes. It automatically "stretches" the smaller array across the larger array so that they have compatible shapes, without copying data.

# Broadcasting examples

# Add a scalar to an array
arr = np.array([[1, 2, 3], [4, 5, 6]])
result = arr + 10
print(f"Scalar broadcast:\n{result}")
# [[11, 12, 13],
#  [14, 15, 16]]

# Add 1D array to 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
arr1d = np.array([10, 20, 30])
result = arr2d + arr1d
print(f"\n1D broadcast:\n{result}")
# [[11, 22, 33],
#  [14, 25, 36]]

# Visual explanation:
# arr2d:     [[1, 2, 3],     arr1d: [10, 20, 30]
#            [4, 5, 6]]
#
# Result:    [[11, 22, 33],   # arr2d[0] + arr1d
#            [14, 25, 36]]    # arr2d[1] + arr1d

ℹ️ Broadcasting Rules

NumPy follows these rules when broadcasting:

If arrays have different numbers of dimensions, the shape of the one with fewer dimensions is padded with ones on the left
Arrays with size 1 along a dimension act as if they had the size of the largest array in that dimension
If shapes are incompatible (neither equal nor 1), broadcasting raises a ValueError

Indexing and Slicing

arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

# Single element
print(f"Element at (1,2): {arr[1, 2]}")  # 7

# Slice: rows 0-1, columns 1-3
print(f"Slice:\n{arr[0:2, 1:3]}")
# [[2, 3],
#  [6, 7]]

# All rows, specific column
print(f"Column 2: {arr[:, 2]}")  # [3, 7, 11]

# Boolean indexing (filtering)
data = np.array([10, 25, 30, 15, 40])
mask = data > 20
print(f"Boolean mask: {mask}")           # [False, True, True, False, True]
print(f"Filtered: {data[mask]}")          # [25, 30, 40]

# Fancy indexing
indices = [0, 2, 4]
print(f"Fancy indexing: {data[indices]}")  # [10, 30, 40]

Useful NumPy Functions

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])

# Aggregation functions
print(f"Sum: {np.sum(arr)}")           # 40
print(f"Mean: {np.mean(arr)}")         # 4.0
print(f"Median: {np.median(arr)}")     # 3.5
print(f"Std Dev: {np.std(arr):.2f}")   # 2.16
print(f"Min: {np.min(arr)}")           # 1
print(f"Max: {np.max(arr)}")           # 9
print(f"Argmin: {np.argmin(arr)}")     # 1 (index of min)
print(f"Argmax: {np.argmax(arr)}")     # 5 (index of max)

# Cumulative operations
print(f"Cumulative sum: {np.cumsum(arr)}")  # [3,4,8,9,14,23,25,31,36,39]

# Sorting
print(f"Sorted: {np.sort(arr)}")       # [1,1,2,3,3,4,5,5,6,9]

# Unique values
arr2 = np.array([1, 1, 2, 2, 3, 3, 3])
print(f"Unique: {np.unique(arr2)}")    # [1, 2, 3]
print(f"Counts: {np.unique(arr2, return_counts=True)}")  # ([1,2,3], [2,2,3])

# Where (conditional)
scores = np.array([85, 92, 78, 95, 88])
grades = np.where(scores >= 90, 'A', 
         np.where(scores >= 80, 'B', 'C'))
print(f"Grades: {grades}")  # ['B', 'A', 'C', 'A', 'B']

Practical Example: Data Normalization

Different features often have vastly different scales (e.g., age: 0-100, income: 0-1,000,000). Without normalization, features with larger magnitudes dominate distance calculations, gradient descent, and model training.

Min-Max Normalization

x_{\\text{normalized}} = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}

Here,

$x$ =Original data value
$min(x)$ =Minimum value in the dataset
$max(x)$ =Maximum value in the dataset
$x_normalized$ =Normalized value in [0, 1]

💡 When to Use Min-Max

Use for neural networks with sigmoid/tanh activations, image pixel values, and algorithms sensitive to magnitude (KNN, K-Means). Sensitive to outliers — a single extreme value compresses all other data into a narrow range.

Z-Score Standardization

z = \\frac{x - \\mu}{\\sigma}

Here,

$x$ =Original data value
$\mu$ =Population mean
$\sigma$ =Population standard deviation
$z$ =Standardized score (number of std devs from mean)

💡 When to Use Z-Score

Use for algorithms assuming Gaussian distribution (LDA, Gaussian Naive Bayes), regression with regularization, PCA, and any distance-based method. Robust to outliers and centers data at origin.

When to Choose Which?

Scenario	Use	Why
Neural network with ReLU	Min-Max	Keeps values positive, bounded range
Linear/Logistic Regression	Z-Score	Assumptions align with standardized data
KNN / K-Means	Z-Score	Distance metrics need comparable scales
PCA	Z-Score	Variance scaling must be equal
Tree-based models	Neither	Trees are scale-invariant

📝Implementing Both Normalizations

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data
X = np.array([[25, 50000], [30, 80000], [35, 120000], [40, 95000]])

# Min-Max Normalization
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(X)
print(f"Min-Max normalized:\n{X_minmax}")

# Z-Score Standardization
scaler_standard = StandardScaler()
X_standard = scaler_standard.fit_transform(X)
print(f"\nZ-Score standardized:\n{X_standard}")
print(f"Mean: {X_standard.mean(axis=0)}")  # [0, 0]
print(f"Std:  {X_standard.std(axis=0)}")   # [1, 1]

Key Takeaways

📋Summary: NumPy Essentials

NumPy arrays are significantly faster than Python lists due to contiguous memory and vectorized operations
Vectorization eliminates explicit loops — operations are applied element-wise in optimized C
Broadcasting enables operations on arrays of different shapes without data copying
Matrix multiplication uses @ operator; remember dimension rules ( $m \times n$ times $n \times p$ )
Master indexing and slicing for efficient data manipulation (label-based, position-based, boolean)
Normalization is critical for distance-based algorithms — choose Min-Max or Z-Score based on your model's assumptions
NumPy is the foundation for Pandas, Scikit-learn, TensorFlow, and PyTorch

Practice Exercise

Create a 5x5 random matrix and compute its eigenvalues and eigenvectors
Implement row-wise and column-wise normalization (divide each row/column by its L2 norm)
Find all elements greater than the matrix mean using boolean indexing
Use broadcasting to compute the pairwise Euclidean distance matrix between two sets of points
Verify that the determinant of a matrix equals the product of its eigenvalues

Python for Data Science: NumPy Essentials

Why NumPy?

DfNumPy

NumPy vs Python Lists — Performance Comparison

Creating NumPy Arrays

From Python Lists

Special Arrays

Array Operations (Vectorization)

Element-wise Operations

DfVectorization

Matrix Operations

Matrix Multiplication

Broadcasting

DfBroadcasting

Indexing and Slicing

Useful NumPy Functions

Practical Example: Data Normalization

Min-Max Normalization

Min-Max Normalization

Z-Score Standardization

Z-Score Standardization

When to Choose Which?

📝Implementing Both Normalizations

Key Takeaways

📋Summary: NumPy Essentials

Practice Exercise

Need Expert Data Science Help?