Python Data Analysis — Pandas & NumPy

Pandas is the core library for data analysis in Python. NumPy provides fast numerical computing. Together they form the foundation of the Python data science ecosystem.

Learning Objectives

Load and explore datasets with pandas
Filter, group, and aggregate data
Handle missing data effectively
Create basic visualizations

What is Pandas?

Pandas provides two main data structures:

# Series — one-dimensional labeled array
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# a    10
# b    20
# c    30
# d    40

# DataFrame — two-dimensional labeled table (like a spreadsheet)
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})
#      name  age     city
# 0    Alice   25      NYC
# 1      Bob   30       LA
# 2  Charlie   35  Chicago

Loading Data

import pandas as pd

# From CSV (most common)
df = pd.read_csv('data.csv')

# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# From JSON
df = pd.read_json('data.json')

# From database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)

# Quick overview of your data
print(df.head())       # First 5 rows
print(df.info())       # Column types and missing values
print(df.describe())   # Statistical summary
print(df.shape)        # (rows, columns)
print(df.columns)      # Column names

Selection and Filtering

# Select a single column
names = df['name']

# Select multiple columns
subset = df[['name', 'age']]

# Filter rows with boolean indexing
adults = df[df['age'] >= 18]
ny_residents = df[df['city'] == 'NYC']

# Multiple conditions (use & for AND, | for OR)
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
any_condition = df[(df['age'] < 25) | (df['city'] == 'LA')]

# Query method (SQL-like syntax)
result = df.query('age > 25 and city == "NYC"')

# loc — label-based selection
row = df.loc[0]                    # First row
subset = df.loc[0:5, 'name':'age'] # Rows 0-5, columns name to age

# iloc — integer-based selection
first_row = df.iloc[0]
first_three = df.iloc[:3]

GroupBy and Aggregation

# Group by a column and aggregate
city_stats = df.groupby('city').agg({
    'salary': ['mean', 'median', 'std', 'min', 'max'],
    'age': ['count', 'mean']
})

# Multiple groupby
dept_city = df.groupby(['department', 'city']).agg({
    'salary': 'mean',
    'employee_id': 'count'
}).rename(columns={'employee_id': 'headcount'})

# Apply custom aggregation
def salary_range(group):
    return group.max() - group.min()

df.groupby('department')['salary'].agg(salary_range)

# Pivot tables (cross-tabulation)
pivot = df.pivot_table(
    values='salary',
    index='department',
    columns='city',
    aggfunc='mean',
    fill_value=0  # Fill missing with 0
)

Handling Missing Data

# Check for missing values
print(df.isnull().sum())           # Count per column
print(df.isnull().sum().sum())     # Total missing
print(df.isnull().mean() * 100)    # Percentage missing

# Drop missing values
df_clean = df.dropna()                    # Drop any row with missing
df_clean = df.dropna(subset=['age'])      # Drop only if 'age' is missing
df_clean = df.dropna(thresh=3)            # Keep rows with at least 3 non-null

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)   # Fill with median
df['city'].fillna('Unknown', inplace=True)            # Fill with constant
df['salary'].fillna(method='ffill', inplace=True)    # Forward fill

# Interpolate (for time series)
df['temperature'].interpolate(method='linear', inplace=True)

Data Transformation

# Apply function to column
df['salary_monthly'] = df['salary'] / 12

# Apply function to rows
df['full_name'] = df.apply(lambda row: f"{row['first_name']} {row['last_name']}", axis=1)

# Map values
grade_map = {'A': 4, 'B': 3, 'C': 2, 'D': 1, 'F': 0}
df['gpa'] = df['grade'].map(grade_map)

# Cut into bins
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
                         labels=['Youth', 'Young Adult', 'Middle Age', 'Senior'])

# Get dummies (one-hot encoding)
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)

Basic Visualization

import matplotlib.pyplot as plt

# Histogram
df['salary'].hist(bins=20, edgecolor='black')
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.savefig('salary_dist.png')
plt.close()

# Bar chart
city_counts = df['city'].value_counts()
city_counts.plot(kind='bar')
plt.title('Employees by City')
plt.tight_layout()
plt.savefig('city_counts.png')
plt.close()

# Scatter plot
df.plot.scatter(x='age', y='salary', alpha=0.5)
plt.title('Age vs Salary')
plt.savefig('age_salary.png')
plt.close()

# Correlation heatmap
import seaborn as sns
corr = df.select_dtypes(include='number').corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.savefig('correlation.png')
plt.close()

Key Takeaways

pd.read_csv() is the most common data loading method
Use boolean indexing for filtering: df[df['col'] > value]
groupby() is powerful for aggregation
Always check for missing data first with df.isnull().sum()
Use describe() for quick statistical overview
pandas integrates well with matplotlib and seaborn
Use inplace=True to modify DataFrame directly
Method chaining keeps transformations readable

Python Data Analysis — Pandas & NumPy

Python Data Analysis — Pandas & NumPy

Learning Objectives

What is Pandas?

Loading Data

Selection and Filtering

GroupBy and Aggregation

Handling Missing Data

Data Transformation

Basic Visualization

Key Takeaways

Need Expert Python Help?