NumPy Arrays for Data Science

Introduction to NumPy for Data Science

NumPy (Numerical Python) is the foundation of data science in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.

Why NumPy for Data Science?

NumPy arrays are:

Memory efficient: Store data in contiguous memory blocks
Vectorized operations: Apply operations to entire arrays without loops
Fast: C-implemented, much faster than Python lists

Creating NumPy Arrays

import numpy as np

# From Python list
data = [1, 2, 3, 4, 5]
arr = np.array(data)

# Using built-in functions
zeros = np.zeros((3, 4))        # 3x4 array of zeros
ones = np.ones((2, 3))          # 2x3 array of ones
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced points

# Random arrays
rand_uniform = np.random.rand(3, 3)    # Uniform distribution [0, 1]
rand_normal = np.random.randn(1000)    # Standard normal distribution
rand_int = np.random.randint(0, 10, (5, 5))  # Random integers

print("Array shape:", arr.shape)
print("Array dtype:", arr.dtype)
print("Array mean:", arr.mean())

Array Indexing and Slicing

# 2D array
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12],
                   [13, 14, 15, 16]])

# Basic indexing
print(matrix[0, 0])      # 1 (first element)
print(matrix[-1, -1])    # 16 (last element)

# Slicing
print(matrix[0, :])      # First row: [1, 2, 3, 4]
print(matrix[:, 0])      # First column: [1, 5, 9, 13]
print(matrix[1:3, 1:3])  # Submatrix [[6, 7], [10, 11]]

# Boolean indexing
print(matrix[matrix > 5])  # [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

# Fancy indexing
print(matrix[[0, 2], [1, 3]])  # [2, 12] - elements at (0,1) and (2,3)

Vectorized Operations

arr = np.array([1, 2, 3, 4, 5])

# Element-wise operations
print(arr + 10)        # [11, 12, 13, 14, 15]
print(arr * 2)         # [2, 4, 6, 8, 10]
print(arr ** 2)        # [1, 4, 9, 16, 25]
print(np.sqrt(arr))    # [1. , 1.41, 1.73, 2. , 2.24]

# Universal functions (ufuncs)
print(np.sin(arr))     # Sine of each element
print(np.log(arr))     # Natural log
print(np.exp(arr))     # Exponential

Statistical Functions for Data Science

data = np.array([23, 45, 67, 89, 12, 34, 56, 78, 90, 11])

# Central tendency
print("Mean:", np.mean(data))           # 50.5
print("Median:", np.median(data))       # 45.0
print("Standard Deviation:", np.std(data))  # 28.5

# Percentiles (important for EDA)
print("25th percentile:", np.percentile(data, 25))  # 22.5
print("75th percentile:", np.percentile(data, 75))  # 77.5

# Descriptive statistics
print("Min:", np.min(data))    # 11
print("Max:", np.max(data))    # 90
print("Sum:", np.sum(data))    # 505
print("Variance:", np.var(data))  # 812.25

Reshaping and Transposing

arr = np.arange(1, 13)  # [1, 2, ..., 12]

# Reshape
print(arr.reshape(3, 4))
# [[1, 2, 3, 4],
#  [5, 6, 7, 8],
#  [9, 10, 11, 12]]

print(arr.reshape(2, 2, 3))
# [[[1, 2, 3], [4, 5, 6]],
#  [[7, 8, 9], [10, 11, 12]]]

# Flatten and ravel
flat = arr.reshape(-1)  # Flatten to 1D
print(flat.ravel())    # Same as flatten but returns view

# Transpose
matrix = np.array([[1, 2], [3, 4], [5, 6]])
print(matrix.T)  # Transposed matrix

Broadcasting

# Broadcasting allows operations on arrays of different shapes
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])

# b is broadcast to match a's shape
print(a + b)
# [[11, 22, 33],
#  [14, 25, 36]]

# 2D + 1D broadcasting
c = np.array([[1], [2], [3]])
print(a + c)
# [[2, 3, 4],
#  [6, 7, 8],
#  [8, 9, 10]]

Linear Algebra for Data Science

# Dot product
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b))  # 32 (1*4 + 2*5 + 3*6)

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.matmul(A, B))
# [[19, 22],
#  [43, 50]]

# Matrix inverse (important for linear regression)
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(A_inv)

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:", eigenvectors)

# Determinant
print(np.linalg.det(A))  # 10.0

Practice Exercise: Data Analysis with NumPy

import numpy as np

# Simulate a dataset (e.g., housing prices)
np.random.seed(42)
n_samples = 1000

# Features: size (sq ft), bedrooms, age, distance to downtown
size = np.random.normal(2000, 500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)
distance = np.random.uniform(1, 30, n_samples)

# Target: price (in thousands)
price = (150 + 0.15 * size + 20 * bedrooms - 2 * age - 3 * distance + 
         np.random.normal(0, 20, n_samples))

# Create dataset
data = np.column_stack([size, bedrooms, age, distance, price])

# Analyze
print("Dataset shape:", data.shape)
print("Mean price:", np.mean(price))
print("Median price:", np.median(price))
print("Price std:", np.std(price))
print("Correlation with size:", np.corrcoef(size, price)[0, 1])
print("Correlation with bedrooms:", np.corrcoef(bedrooms, price)[0, 1])
print("Correlation with age:", np.corrcoef(age, price)[0, 1])

Key Takeaways

NumPy is essential for data science because:

Efficient array operations - Much faster than Python lists
Vectorized computations - Avoid explicit loops
Rich mathematical functions - Statistics, linear algebra, etc.
Foundation for Pandas - Pandas is built on NumPy

When to Use NumPy

Numerical computations
Mathematical operations on arrays
Linear algebra (regression, PCA, etc.)
Image processing (as numerical matrices)
Any numerical data manipulation

Next Steps

Learn Pandas for data manipulation
Practice vectorized operations
Explore NumPy's linear algebra capabilities
Combine with Matplotlib for visualizations