NumPy Arrays for Data Science

Python FoundationsNumPyFree Lesson

Advertisement

Introduction to NumPy for Data Science

NumPy (Numerical Python) is the foundation of data science in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on them.

Why NumPy for Data Science?

NumPy arrays are:

  • Memory efficient: Store data in contiguous memory blocks
  • Vectorized operations: Apply operations to entire arrays without loops
  • Fast: C-implemented, much faster than Python lists

Creating NumPy Arrays

import numpy as np

# From Python list
data = [1, 2, 3, 4, 5]
arr = np.array(data)

# Using built-in functions
zeros = np.zeros((3, 4))        # 3x4 array of zeros
ones = np.ones((2, 3))          # 2x3 array of ones
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5) # 5 evenly spaced points

# Random arrays
rand_uniform = np.random.rand(3, 3)    # Uniform distribution [0, 1]
rand_normal = np.random.randn(1000)    # Standard normal distribution
rand_int = np.random.randint(0, 10, (5, 5))  # Random integers

print("Array shape:", arr.shape)
print("Array dtype:", arr.dtype)
print("Array mean:", arr.mean())

Array Indexing and Slicing

# 2D array
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12],
                   [13, 14, 15, 16]])

# Basic indexing
print(matrix[0, 0])      # 1 (first element)
print(matrix[-1, -1])    # 16 (last element)

# Slicing
print(matrix[0, :])      # First row: [1, 2, 3, 4]
print(matrix[:, 0])      # First column: [1, 5, 9, 13]
print(matrix[1:3, 1:3])  # Submatrix [[6, 7], [10, 11]]

# Boolean indexing
print(matrix[matrix > 5])  # [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

# Fancy indexing
print(matrix[[0, 2], [1, 3]])  # [2, 12] - elements at (0,1) and (2,3)

Vectorized Operations

arr = np.array([1, 2, 3, 4, 5])

# Element-wise operations
print(arr + 10)        # [11, 12, 13, 14, 15]
print(arr * 2)         # [2, 4, 6, 8, 10]
print(arr ** 2)        # [1, 4, 9, 16, 25]
print(np.sqrt(arr))    # [1. , 1.41, 1.73, 2. , 2.24]

# Universal functions (ufuncs)
print(np.sin(arr))     # Sine of each element
print(np.log(arr))     # Natural log
print(np.exp(arr))     # Exponential

Statistical Functions for Data Science

data = np.array([23, 45, 67, 89, 12, 34, 56, 78, 90, 11])

# Central tendency
print("Mean:", np.mean(data))           # 50.5
print("Median:", np.median(data))       # 45.0
print("Standard Deviation:", np.std(data))  # 28.5

# Percentiles (important for EDA)
print("25th percentile:", np.percentile(data, 25))  # 22.5
print("75th percentile:", np.percentile(data, 75))  # 77.5

# Descriptive statistics
print("Min:", np.min(data))    # 11
print("Max:", np.max(data))    # 90
print("Sum:", np.sum(data))    # 505
print("Variance:", np.var(data))  # 812.25

Reshaping and Transposing

arr = np.arange(1, 13)  # [1, 2, ..., 12]

# Reshape
print(arr.reshape(3, 4))
# [[1, 2, 3, 4],
#  [5, 6, 7, 8],
#  [9, 10, 11, 12]]

print(arr.reshape(2, 2, 3))
# [[[1, 2, 3], [4, 5, 6]],
#  [[7, 8, 9], [10, 11, 12]]]

# Flatten and ravel
flat = arr.reshape(-1)  # Flatten to 1D
print(flat.ravel())    # Same as flatten but returns view

# Transpose
matrix = np.array([[1, 2], [3, 4], [5, 6]])
print(matrix.T)  # Transposed matrix

Broadcasting

# Broadcasting allows operations on arrays of different shapes
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])

# b is broadcast to match a's shape
print(a + b)
# [[11, 22, 33],
#  [14, 25, 36]]

# 2D + 1D broadcasting
c = np.array([[1], [2], [3]])
print(a + c)
# [[2, 3, 4],
#  [6, 7, 8],
#  [8, 9, 10]]

Linear Algebra for Data Science

# Dot product
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.dot(a, b))  # 32 (1*4 + 2*5 + 3*6)

# Matrix multiplication
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.matmul(A, B))
# [[19, 22],
#  [43, 50]]

# Matrix inverse (important for linear regression)
A = np.array([[4, 7], [2, 6]])
A_inv = np.linalg.inv(A)
print(A_inv)

# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:", eigenvectors)

# Determinant
print(np.linalg.det(A))  # 10.0

Practice Exercise: Data Analysis with NumPy

import numpy as np

# Simulate a dataset (e.g., housing prices)
np.random.seed(42)
n_samples = 1000

# Features: size (sq ft), bedrooms, age, distance to downtown
size = np.random.normal(2000, 500, n_samples)
bedrooms = np.random.randint(1, 6, n_samples)
age = np.random.randint(0, 50, n_samples)
distance = np.random.uniform(1, 30, n_samples)

# Target: price (in thousands)
price = (150 + 0.15 * size + 20 * bedrooms - 2 * age - 3 * distance + 
         np.random.normal(0, 20, n_samples))

# Create dataset
data = np.column_stack([size, bedrooms, age, distance, price])

# Analyze
print("Dataset shape:", data.shape)
print("Mean price:", np.mean(price))
print("Median price:", np.median(price))
print("Price std:", np.std(price))
print("Correlation with size:", np.corrcoef(size, price)[0, 1])
print("Correlation with bedrooms:", np.corrcoef(bedrooms, price)[0, 1])
print("Correlation with age:", np.corrcoef(age, price)[0, 1])

Key Takeaways

NumPy is essential for data science because:

  1. Efficient array operations - Much faster than Python lists
  2. Vectorized computations - Avoid explicit loops
  3. Rich mathematical functions - Statistics, linear algebra, etc.
  4. Foundation for Pandas - Pandas is built on NumPy

When to Use NumPy

  • Numerical computations
  • Mathematical operations on arrays
  • Linear algebra (regression, PCA, etc.)
  • Image processing (as numerical matrices)
  • Any numerical data manipulation

Next Steps

  • Learn Pandas for data manipulation
  • Practice vectorized operations
  • Explore NumPy's linear algebra capabilities
  • Combine with Matplotlib for visualizations

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement