Python for Data Science: NumPy Essentials

Module 1: FoundationsFree Lesson

Advertisement

Why NumPy?

DfNumPy

NumPy (Numerical Python) is the foundational library for scientific computing in Python. It provides a high-performance multi-dimensional array object (ndarray) and tools for working with these arrays. NumPy arrays are stored as contiguous blocks of memory, enabling vectorized operations that are 10-100x faster than Python lists.

NumPy vs Python Lists — Performance Comparison

import numpy as np
import time

size = 1_000_000

# Python list operations
python_list = list(range(size))
start = time.time()
python_result = [x * 2 for x in python_list]
list_time = time.time() - start

# NumPy array operations
numpy_array = np.arange(size)
start = time.time()
numpy_result = numpy_array * 2
numpy_time = time.time() - start

print(f"Python list time: {list_time:.4f}s")
print(f"NumPy array time: {numpy_time:.4f}s")
print(f"NumPy is {list_time/numpy_time:.1f}x faster")

Output:

Architecture Diagram
Python list time: 0.1234s
NumPy array time: 0.0021s
NumPy is 58.8x faster

â„šī¸ Why NumPy Is Fast

NumPy's speed comes from three key design choices:

  1. Contiguous memory layout: Arrays are stored in continuous memory blocks, enabling CPU cache efficiency
  2. Vectorization: Operations are applied element-wise without Python loops (loops run in C)
  3. Broadcasting: Arithmetic operations automatically handle arrays of different shapes

Creating NumPy Arrays

From Python Lists

import numpy as np

# 1D array (vector)
arr1d = np.array([1, 2, 3, 4, 5])
print(f"1D Array: {arr1d}")
print(f"Shape: {arr1d.shape}")        # (5,)
print(f"Dimension: {arr1d.ndim}")     # 1
print(f"Data type: {arr1d.dtype}")    # int64

# 2D array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print(f"\n2D Array:\n{arr2d}")
print(f"Shape: {arr2d.shape}")        # (2, 3)
print(f"Dimension: {arr2d.ndim}")     # 2

# 3D array (tensor)
arr3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(f"\n3D Array shape: {arr3d.shape}")  # (2, 2, 2)

Special Arrays

# Array of zeros
zeros = np.zeros((3, 4))  # 3x4 matrix of zeros

# Array of ones
ones = np.ones((2, 3))    # 2x3 matrix of ones

# Identity matrix
identity = np.eye(4)      # 4x4 identity matrix

# Array with evenly spaced values
linear = np.linspace(0, 10, 5)  # 5 values from 0 to 10
print(f"Linear: {linear}")     # [0, 2.5, 5, 7.5, 10]

# Range with step
range_arr = np.arange(0, 20, 3)  # 0 to 20, step 3
print(f"Range: {range_arr}")      # [0, 3, 6, 9, 12, 15, 18]

# Random arrays
random_uniform = np.random.rand(3, 3)      # Uniform [0, 1)
random_normal = np.random.randn(3, 3)      # Standard normal
random_int = np.random.randint(0, 100, (3, 3))  # Random integers

Array Operations (Vectorization)

Element-wise Operations

DfVectorization

Vectorization is the process of applying operations to entire arrays at once, without explicit Python loops. This is possible because NumPy operations are implemented in optimized C code that operates on contiguous memory blocks.

a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Element-wise operations
print(f"Addition: {a + b}")        # [11, 22, 33, 44, 55]
print(f"Subtraction: {b - a}")     # [9, 18, 27, 36, 45]
print(f"Multiplication: {a * b}")  # [10, 40, 90, 160, 250]
print(f"Division: {b / a}")        # [10, 10, 10, 10, 10]
print(f"Power: {a ** 2}")          # [1, 4, 9, 16, 25]

# Scalar operations
print(f"\nAdd scalar: {a + 10}")   # [11, 12, 13, 14, 15]
print(f"Multiply scalar: {a * 3}") # [3, 6, 9, 12, 15]

Matrix Operations

Matrix Multiplication

Cij=sumk=1nAikcdotBkjC_{ij} = \\sum_{k=1}^{n} A_{ik} \\cdot B_{kj}

Here,

  • AA=First matrix (m × n)
  • BB=Second matrix (n × p)
  • CC=Result matrix (m × p)
  • AikA_{ik}=Element in row i, column k of A
  • BkjB_{kj}=Element in row k, column j of B
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

# Matrix multiplication
C = A @ B
print(f"Matrix multiplication:\n{C}")
# [[19, 22],
#  [43, 50]]

# Element-wise multiplication (Hadamard product)
D = A * B
print(f"\nElement-wise multiplication:\n{D}")
# [[5, 12],
#  [21, 32]]

# Transpose
print(f"\nTranspose of A:\n{A.T}")
# [[1, 3],
#  [2, 4]]

# Determinant
det_A = np.linalg.det(A)
print(f"\nDeterminant of A: {det_A:.2f}")  # -2.00

# Inverse
inv_A = np.linalg.inv(A)
print(f"\nInverse of A:\n{inv_A}")

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"\nEigenvalues: {eigenvalues}")
print(f"\nEigenvectors:\n{eigenvectors}")

💡 Matrix Multiplication Rules

For matrix multiplication A×BA \times B to be defined:

  • The number of columns in AA must equal the number of rows in BB
  • If AA is (m×n)(m \times n) and BB is (n×p)(n \times p), the result CC is (m×p)(m \times p)
  • Matrix multiplication is not commutative: AB≠BAAB \neq BA in general

Broadcasting

DfBroadcasting

Broadcasting is NumPy's mechanism for performing arithmetic operations on arrays with different shapes. It automatically "stretches" the smaller array across the larger array so that they have compatible shapes, without copying data.

# Broadcasting examples

# Add a scalar to an array
arr = np.array([[1, 2, 3], [4, 5, 6]])
result = arr + 10
print(f"Scalar broadcast:\n{result}")
# [[11, 12, 13],
#  [14, 15, 16]]

# Add 1D array to 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
arr1d = np.array([10, 20, 30])
result = arr2d + arr1d
print(f"\n1D broadcast:\n{result}")
# [[11, 22, 33],
#  [14, 25, 36]]

# Visual explanation:
# arr2d:     [[1, 2, 3],     arr1d: [10, 20, 30]
#            [4, 5, 6]]
#
# Result:    [[11, 22, 33],   # arr2d[0] + arr1d
#            [14, 25, 36]]    # arr2d[1] + arr1d

â„šī¸ Broadcasting Rules

NumPy follows these rules when broadcasting:

  1. If arrays have different numbers of dimensions, the shape of the one with fewer dimensions is padded with ones on the left
  2. Arrays with size 1 along a dimension act as if they had the size of the largest array in that dimension
  3. If shapes are incompatible (neither equal nor 1), broadcasting raises a ValueError

Indexing and Slicing

arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

# Single element
print(f"Element at (1,2): {arr[1, 2]}")  # 7

# Slice: rows 0-1, columns 1-3
print(f"Slice:\n{arr[0:2, 1:3]}")
# [[2, 3],
#  [6, 7]]

# All rows, specific column
print(f"Column 2: {arr[:, 2]}")  # [3, 7, 11]

# Boolean indexing (filtering)
data = np.array([10, 25, 30, 15, 40])
mask = data > 20
print(f"Boolean mask: {mask}")           # [False, True, True, False, True]
print(f"Filtered: {data[mask]}")          # [25, 30, 40]

# Fancy indexing
indices = [0, 2, 4]
print(f"Fancy indexing: {data[indices]}")  # [10, 30, 40]

Useful NumPy Functions

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])

# Aggregation functions
print(f"Sum: {np.sum(arr)}")           # 40
print(f"Mean: {np.mean(arr)}")         # 4.0
print(f"Median: {np.median(arr)}")     # 3.5
print(f"Std Dev: {np.std(arr):.2f}")   # 2.16
print(f"Min: {np.min(arr)}")           # 1
print(f"Max: {np.max(arr)}")           # 9
print(f"Argmin: {np.argmin(arr)}")     # 1 (index of min)
print(f"Argmax: {np.argmax(arr)}")     # 5 (index of max)

# Cumulative operations
print(f"Cumulative sum: {np.cumsum(arr)}")  # [3,4,8,9,14,23,25,31,36,39]

# Sorting
print(f"Sorted: {np.sort(arr)}")       # [1,1,2,3,3,4,5,5,6,9]

# Unique values
arr2 = np.array([1, 1, 2, 2, 3, 3, 3])
print(f"Unique: {np.unique(arr2)}")    # [1, 2, 3]
print(f"Counts: {np.unique(arr2, return_counts=True)}")  # ([1,2,3], [2,2,3])

# Where (conditional)
scores = np.array([85, 92, 78, 95, 88])
grades = np.where(scores >= 90, 'A', 
         np.where(scores >= 80, 'B', 'C'))
print(f"Grades: {grades}")  # ['B', 'A', 'C', 'A', 'B']

Practical Example: Data Normalization

Different features often have vastly different scales (e.g., age: 0-100, income: 0-1,000,000). Without normalization, features with larger magnitudes dominate distance calculations, gradient descent, and model training.

Min-Max Normalization

Min-Max Normalization

xtextnormalized=fracx−min(x)max(x)−min(x)x_{\\text{normalized}} = \\frac{x - \\min(x)}{\\max(x) - \\min(x)}

Here,

  • xx=Original data value
  • min(x)min(x)=Minimum value in the dataset
  • max(x)max(x)=Maximum value in the dataset
  • xnormalizedx_normalized=Normalized value in [0, 1]

💡 When to Use Min-Max

Use for neural networks with sigmoid/tanh activations, image pixel values, and algorithms sensitive to magnitude (KNN, K-Means). Sensitive to outliers — a single extreme value compresses all other data into a narrow range.

Z-Score Standardization

Z-Score Standardization

z=fracx−musigmaz = \\frac{x - \\mu}{\\sigma}

Here,

  • xx=Original data value
  • Îŧ\mu=Population mean
  • ΃\sigma=Population standard deviation
  • zz=Standardized score (number of std devs from mean)

💡 When to Use Z-Score

Use for algorithms assuming Gaussian distribution (LDA, Gaussian Naive Bayes), regression with regularization, PCA, and any distance-based method. Robust to outliers and centers data at origin.

When to Choose Which?

ScenarioUseWhy
Neural network with ReLUMin-MaxKeeps values positive, bounded range
Linear/Logistic RegressionZ-ScoreAssumptions align with standardized data
KNN / K-MeansZ-ScoreDistance metrics need comparable scales
PCAZ-ScoreVariance scaling must be equal
Tree-based modelsNeitherTrees are scale-invariant

📝Implementing Both Normalizations

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data
X = np.array([[25, 50000], [30, 80000], [35, 120000], [40, 95000]])

# Min-Max Normalization
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(X)
print(f"Min-Max normalized:\n{X_minmax}")

# Z-Score Standardization
scaler_standard = StandardScaler()
X_standard = scaler_standard.fit_transform(X)
print(f"\nZ-Score standardized:\n{X_standard}")
print(f"Mean: {X_standard.mean(axis=0)}")  # [0, 0]
print(f"Std:  {X_standard.std(axis=0)}")   # [1, 1]

Key Takeaways

📋Summary: NumPy Essentials

  1. NumPy arrays are significantly faster than Python lists due to contiguous memory and vectorized operations
  2. Vectorization eliminates explicit loops — operations are applied element-wise in optimized C
  3. Broadcasting enables operations on arrays of different shapes without data copying
  4. Matrix multiplication uses @ operator; remember dimension rules (m×nm \times n times n×pn \times p)
  5. Master indexing and slicing for efficient data manipulation (label-based, position-based, boolean)
  6. Normalization is critical for distance-based algorithms — choose Min-Max or Z-Score based on your model's assumptions
  7. NumPy is the foundation for Pandas, Scikit-learn, TensorFlow, and PyTorch

Practice Exercise

  1. Create a 5x5 random matrix and compute its eigenvalues and eigenvectors
  2. Implement row-wise and column-wise normalization (divide each row/column by its L2 norm)
  3. Find all elements greater than the matrix mean using boolean indexing
  4. Use broadcasting to compute the pairwise Euclidean distance matrix between two sets of points
  5. Verify that the determinant of a matrix equals the product of its eigenvalues

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement