Python Data Analysis — Pandas & NumPy
Pandas is the core library for data analysis in Python. NumPy provides fast numerical computing. Together they form the foundation of the Python data science ecosystem.
Learning Objectives
- Load and explore datasets with pandas
- Filter, group, and aggregate data
- Handle missing data effectively
- Create basic visualizations
What is Pandas?
Pandas provides two main data structures:
# Series — one-dimensional labeled array
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# a 10
# b 20
# c 30
# d 40
# DataFrame — two-dimensional labeled table (like a spreadsheet)
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
# name age city
# 0 Alice 25 NYC
# 1 Bob 30 LA
# 2 Charlie 35 Chicago
Loading Data
import pandas as pd
# From CSV (most common)
df = pd.read_csv('data.csv')
# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# From JSON
df = pd.read_json('data.json')
# From database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)
# Quick overview of your data
print(df.head()) # First 5 rows
print(df.info()) # Column types and missing values
print(df.describe()) # Statistical summary
print(df.shape) # (rows, columns)
print(df.columns) # Column names
Selection and Filtering
# Select a single column
names = df['name']
# Select multiple columns
subset = df[['name', 'age']]
# Filter rows with boolean indexing
adults = df[df['age'] >= 18]
ny_residents = df[df['city'] == 'NYC']
# Multiple conditions (use & for AND, | for OR)
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
any_condition = df[(df['age'] < 25) | (df['city'] == 'LA')]
# Query method (SQL-like syntax)
result = df.query('age > 25 and city == "NYC"')
# loc — label-based selection
row = df.loc[0] # First row
subset = df.loc[0:5, 'name':'age'] # Rows 0-5, columns name to age
# iloc — integer-based selection
first_row = df.iloc[0]
first_three = df.iloc[:3]
GroupBy and Aggregation
# Group by a column and aggregate
city_stats = df.groupby('city').agg({
'salary': ['mean', 'median', 'std', 'min', 'max'],
'age': ['count', 'mean']
})
# Multiple groupby
dept_city = df.groupby(['department', 'city']).agg({
'salary': 'mean',
'employee_id': 'count'
}).rename(columns={'employee_id': 'headcount'})
# Apply custom aggregation
def salary_range(group):
return group.max() - group.min()
df.groupby('department')['salary'].agg(salary_range)
# Pivot tables (cross-tabulation)
pivot = df.pivot_table(
values='salary',
index='department',
columns='city',
aggfunc='mean',
fill_value=0 # Fill missing with 0
)
Handling Missing Data
# Check for missing values
print(df.isnull().sum()) # Count per column
print(df.isnull().sum().sum()) # Total missing
print(df.isnull().mean() * 100) # Percentage missing
# Drop missing values
df_clean = df.dropna() # Drop any row with missing
df_clean = df.dropna(subset=['age']) # Drop only if 'age' is missing
df_clean = df.dropna(thresh=3) # Keep rows with at least 3 non-null
# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True) # Fill with median
df['city'].fillna('Unknown', inplace=True) # Fill with constant
df['salary'].fillna(method='ffill', inplace=True) # Forward fill
# Interpolate (for time series)
df['temperature'].interpolate(method='linear', inplace=True)
Data Transformation
# Apply function to column
df['salary_monthly'] = df['salary'] / 12
# Apply function to rows
df['full_name'] = df.apply(lambda row: f"{row['first_name']} {row['last_name']}", axis=1)
# Map values
grade_map = {'A': 4, 'B': 3, 'C': 2, 'D': 1, 'F': 0}
df['gpa'] = df['grade'].map(grade_map)
# Cut into bins
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
labels=['Youth', 'Young Adult', 'Middle Age', 'Senior'])
# Get dummies (one-hot encoding)
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)
Basic Visualization
import matplotlib.pyplot as plt
# Histogram
df['salary'].hist(bins=20, edgecolor='black')
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.savefig('salary_dist.png')
plt.close()
# Bar chart
city_counts = df['city'].value_counts()
city_counts.plot(kind='bar')
plt.title('Employees by City')
plt.tight_layout()
plt.savefig('city_counts.png')
plt.close()
# Scatter plot
df.plot.scatter(x='age', y='salary', alpha=0.5)
plt.title('Age vs Salary')
plt.savefig('age_salary.png')
plt.close()
# Correlation heatmap
import seaborn as sns
corr = df.select_dtypes(include='number').corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.savefig('correlation.png')
plt.close()
Key Takeaways
pd.read_csv()is the most common data loading method- Use boolean indexing for filtering:
df[df['col'] > value] groupby()is powerful for aggregation- Always check for missing data first with
df.isnull().sum() - Use
describe()for quick statistical overview - pandas integrates well with matplotlib and seaborn
- Use
inplace=Trueto modify DataFrame directly - Method chaining keeps transformations readable