🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Python Data Analysis — Pandas & NumPy

Python Data ScienceData Analysis🟢 Free Lesson

Advertisement

Python Data Analysis — Pandas & NumPy

Pandas is the core library for data analysis in Python. NumPy provides fast numerical computing. Together they form the foundation of the Python data science ecosystem.

Learning Objectives

  • Load and explore datasets with pandas
  • Filter, group, and aggregate data
  • Handle missing data effectively
  • Create visualizations with matplotlib and seaborn
  • Apply real-world data processing patterns

What is Pandas?

Pandas provides two main data structures:

# Series — one-dimensional labeled array
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# a    10
# b    20
# c    30
# d    40

# DataFrame — two-dimensional labeled table (like a spreadsheet)
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})
#      name  age     city
# 0    Alice   25      NYC
# 1      Bob   30       LA
# 2  Charlie   35  Chicago

Pandas Data Structures

StructureDimensionsDescriptionUse Case
Series1DLabeled arraySingle column, time series
DataFrame2DLabeled tableTabular data, spreadsheets
Panel3DLabeled cubeMulti-dimensional (deprecated)

Loading Data

import pandas as pd

# From CSV (most common)
df = pd.read_csv('data.csv')

# From CSV with options
df = pd.read_csv('data.csv',
    index_col='id',           # Set index column
    parse_dates=['date'],     # Parse dates
    na_values=['N/A', ''],    # Custom NA values
    dtype={'age': int},       # Column types
    usecols=['name', 'age']   # Load only specific columns
)

# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# From JSON
df = pd.read_json('data.json')

# From database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)

# From parquet (fast, columnar format)
df = pd.read_parquet('data.parquet')

# Quick overview of your data
print(df.head())       # First 5 rows
print(df.info())       # Column types and missing values
print(df.describe())   # Statistical summary
print(df.shape)        # (rows, columns)
print(df.columns)      # Column names
print(df.dtypes)       # Data types
print(df.memory_usage(deep=True))  # Memory usage per column

Selection and Filtering

# Select a single column
names = df['name']

# Select multiple columns
subset = df[['name', 'age']]

# Filter rows with boolean indexing
adults = df[df['age'] >= 18]
ny_residents = df[df['city'] == 'NYC']

# Multiple conditions (use & for AND, | for OR)
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
any_condition = df[(df['age'] < 25) | (df['city'] == 'LA')]

# Query method (SQL-like syntax)
result = df.query('age > 25 and city == "NYC"')

# loc — label-based selection
row = df.loc[0]                    # First row
subset = df.loc[0:5, 'name':'age'] # Rows 0-5, columns name to age

# iloc — integer-based selection
first_row = df.iloc[0]
first_three = df.iloc[:3]

# Filter with isin
ny_or_la = df[df['city'].isin(['NYC', 'LA'])]

# Filter with string methods
a_names = df[df['name'].str.startswith('A')]
contains_a = df[df['name'].str.contains('a', case=False)]

# Filter with between
mid_age = df[df['age'].between(25, 35)]

GroupBy and Aggregation

# Group by a column and aggregate
city_stats = df.groupby('city').agg({
    'salary': ['mean', 'median', 'std', 'min', 'max'],
    'age': ['count', 'mean']
})

# Multiple groupby
dept_city = df.groupby(['department', 'city']).agg({
    'salary': 'mean',
    'employee_id': 'count'
}).rename(columns={'employee_id': 'headcount'})

# Apply custom aggregation
def salary_range(group):
    return group.max() - group.min()

df.groupby('department')['salary'].agg(salary_range)

# Pivot tables (cross-tabulation)
pivot = df.pivot_table(
    values='salary',
    index='department',
    columns='city',
    aggfunc='mean',
    fill_value=0  # Fill missing with 0
)

# Value counts
city_counts = df['city'].value_counts()
print(city_counts)

# Cumulative operations
df['salary_cumsum'] = df.groupby('department')['salary'].cumsum()

# Rolling window
df['salary_rolling_avg'] = df['salary'].rolling(window=3).mean()

Merging and Joining

# Two DataFrames
employees = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'dept_id': [101, 102, 101]
})

departments = pd.DataFrame({
    'dept_id': [101, 102, 103],
    'dept_name': ['Engineering', 'Marketing', 'Sales']
})

# Inner join (only matching rows)
merged = pd.merge(employees, departments, on='dept_id', how='inner')

# Left join (keep all from left)
merged = pd.merge(employees, departments, on='dept_id', how='left')

# Right join (keep all from right)
merged = pd.merge(employees, departments, on='dept_id', how='right')

# Outer join (keep all)
merged = pd.merge(employees, departments, on='dept_id', how='outer')

# Concatenate DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Stack vertically
combined = pd.concat([df1, df2], ignore_index=True)

# Stack horizontally
combined = pd.concat([df1, df2], axis=1)

Handling Missing Data

# Check for missing values
print(df.isnull().sum())           # Count per column
print(df.isnull().sum().sum())     # Total missing
print(df.isnull().mean() * 100)    # Percentage missing

# Drop missing values
df_clean = df.dropna()                    # Drop any row with missing
df_clean = df.dropna(subset=['age'])      # Drop only if 'age' is missing
df_clean = df.dropna(thresh=3)            # Keep rows with at least 3 non-null

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)   # Fill with median
df['city'].fillna('Unknown', inplace=True)            # Fill with constant
df['salary'].fillna(method='ffill', inplace=True)    # Forward fill
df['salary'].fillna(method='bfill', inplace=True)    # Backward fill

# Interpolate (for time series)
df['temperature'].interpolate(method='linear', inplace=True)

# Drop columns with too many missing values
threshold = len(df) * 0.5  # 50% threshold
df = df.dropna(axis=1, thresh=threshold)

Data Transformation

# Apply function to column
df['salary_monthly'] = df['salary'] / 12

# Apply function to rows
df['full_name'] = df.apply(
    lambda row: f"{row['first_name']} {row['last_name']}", axis=1
)

# Map values
grade_map = {'A': 4, 'B': 3, 'C': 2, 'D': 1, 'F': 0}
df['gpa'] = df['grade'].map(grade_map)

# Cut into bins
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 18, 35, 50, 100],
    labels=['Youth', 'Young Adult', 'Middle Age', 'Senior']
)

# Get dummies (one-hot encoding)
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)

# Rename columns
df = df.rename(columns={'old_name': 'new_name'})

# Type conversion
df['date'] = pd.to_datetime(df['date'])
df['price'] = df['price'].astype(float)

# String operations
df['name_upper'] = df['name'].str.upper()
df['name_clean'] = df['name'].str.strip().str.lower()

Data Cleaning

# Remove duplicates
df = df.drop_duplicates()
df = df.drop_duplicates(subset=['name', 'email'])  # Based on specific columns

# Fix data types
df['price'] = pd.to_numeric(df['price'], errors='coerce')  # Invalid -> NaN
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Handle outliers with IQR
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

# Or cap outliers
df['salary'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)

# Clean text data
df['email'] = df['email'].str.strip().str.lower()
df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)

Pivot Tables and Reshaping

# Pivot table
pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,           # Add row/column totals
    fill_value=0
)

# Melt (wide to long)
melted = pd.melt(df, id_vars=['name'], value_vars=['math', 'science', 'english'])

# Stack/unstack
stacked = df.set_index(['region', 'product']).unstack(level='product')

# Cross tabulation
cross_tab = pd.crosstab(df['region'], df['product'], margins=True)

Basic Visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
df['salary'].hist(bins=20, edgecolor='black', figsize=(10, 6))
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.savefig('salary_dist.png')
plt.close()

# Bar chart
city_counts = df['city'].value_counts()
city_counts.plot(kind='bar', figsize=(10, 6))
plt.title('Employees by City')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('city_counts.png')
plt.close()

# Scatter plot
df.plot.scatter(x='age', y='salary', alpha=0.5, figsize=(10, 6))
plt.title('Age vs Salary')
plt.savefig('age_salary.png')
plt.close()

# Line plot (time series)
df.set_index('date')['sales'].plot(figsize=(12, 6))
plt.title('Sales Over Time')
plt.savefig('sales_trend.png')
plt.close()

# Correlation heatmap
corr = df.select_dtypes(include='number').corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation.png')
plt.close()

# Box plot
df.boxplot(column='salary', by='department', figsize=(10, 6))
plt.title('Salary by Department')
plt.savefig('salary_boxplot.png')
plt.close()

# Subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
df['salary'].hist(ax=axes[0, 0], bins=20)
axes[0, 0].set_title('Salary Distribution')
df['age'].hist(ax=axes[0, 1], bins=20)
axes[0, 1].set_title('Age Distribution')
df.plot.scatter(x='age', y='salary', ax=axes[1, 0])
axes[1, 0].set_title('Age vs Salary')
df['city'].value_counts().plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Employees by City')
plt.tight_layout()
plt.savefig('dashboard.png')
plt.close()

Real-World Example: Analyzing Sales Data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('sales_data.csv')

# Quick exploration
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")

# Clean data
df['date'] = pd.to_datetime(df['date'])
df['revenue'] = df['quantity'] * df['price']
df = df.drop_duplicates()

# Feature engineering
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter

# Analysis 1: Revenue by product
product_revenue = df.groupby('product')['revenue'].sum().sort_values(ascending=False)
print(f"\nRevenue by Product:\n{product_revenue}")

# Analysis 2: Monthly trend
monthly = df.groupby('month')['revenue'].sum()

# Analysis 3: Top customers
top_customers = df.groupby('customer_id')['revenue'].sum().nlargest(10)

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Product revenue
product_revenue.plot(kind='bar', ax=axes[0, 0], color='steelblue')
axes[0, 0].set_title('Revenue by Product')
axes[0, 0].set_ylabel('Revenue ($)')

# Monthly trend
monthly.plot(kind='line', ax=axes[0, 1], marker='o', color='green')
axes[0, 1].set_title('Monthly Revenue Trend')
axes[0, 1].set_ylabel('Revenue ($)')

# Top customers
top_customers.plot(kind='barh', ax=axes[1, 0], color='coral')
axes[1, 0].set_title('Top 10 Customers')

# Day of week
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
day_sales = df.groupby('day_of_week')['revenue'].sum().reindex(day_order)
day_sales.plot(kind='bar', ax=axes[1, 1], color='purple')
axes[1, 1].set_title('Revenue by Day of Week')

plt.tight_layout()
plt.savefig('sales_analysis.png', dpi=150)
plt.show()

Common Mistakes

MistakeProblemSolution
Not checking dtypesUnexpected behaviorAlways check df.dtypes
Modifying original dataUnexpected side effectsUse .copy() or method chaining
Not handling missing valuesErrors in calculationsCheck df.isnull().sum() first
Using inplace=True everywhereHard to chain operationsUse method chaining instead
Forgetting reset_index()Misaligned index after groupbyReset index when needed
Ignoring duplicatesSkewed resultsCheck and remove duplicates

Key Takeaways

  1. pd.read_csv() is the most common data loading method
  2. Use boolean indexing for filtering: df[df['col'] > value]
  3. groupby() is powerful for aggregation
  4. Always check for missing data first with df.isnull().sum()
  5. Use describe() for quick statistical overview
  6. pandas integrates well with matplotlib and seaborn
  7. Use inplace=True to modify DataFrame directly
  8. Method chaining keeps transformations readable
  9. Use .copy() to avoid SettingWithCopyWarning
  10. Prefer vectorized operations over apply() for performance

Premium Content

Python Data Analysis — Pandas & NumPy

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement