Python Data Analysis — Pandas & NumPy

Python Data ScienceData AnalysisFree Lesson

Advertisement

Python Data Analysis — Pandas & NumPy

Pandas is the core library for data analysis in Python. NumPy provides fast numerical computing. Together they form the foundation of the Python data science ecosystem.

Learning Objectives

  • Load and explore datasets with pandas
  • Filter, group, and aggregate data
  • Handle missing data effectively
  • Create basic visualizations

What is Pandas?

Pandas provides two main data structures:

# Series — one-dimensional labeled array
import pandas as pd
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
# a    10
# b    20
# c    30
# d    40

# DataFrame — two-dimensional labeled table (like a spreadsheet)
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})
#      name  age     city
# 0    Alice   25      NYC
# 1      Bob   30       LA
# 2  Charlie   35  Chicago

Loading Data

import pandas as pd

# From CSV (most common)
df = pd.read_csv('data.csv')

# From Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# From JSON
df = pd.read_json('data.json')

# From database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM users', conn)

# Quick overview of your data
print(df.head())       # First 5 rows
print(df.info())       # Column types and missing values
print(df.describe())   # Statistical summary
print(df.shape)        # (rows, columns)
print(df.columns)      # Column names

Selection and Filtering

# Select a single column
names = df['name']

# Select multiple columns
subset = df[['name', 'age']]

# Filter rows with boolean indexing
adults = df[df['age'] >= 18]
ny_residents = df[df['city'] == 'NYC']

# Multiple conditions (use & for AND, | for OR)
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
any_condition = df[(df['age'] < 25) | (df['city'] == 'LA')]

# Query method (SQL-like syntax)
result = df.query('age > 25 and city == "NYC"')

# loc — label-based selection
row = df.loc[0]                    # First row
subset = df.loc[0:5, 'name':'age'] # Rows 0-5, columns name to age

# iloc — integer-based selection
first_row = df.iloc[0]
first_three = df.iloc[:3]

GroupBy and Aggregation

# Group by a column and aggregate
city_stats = df.groupby('city').agg({
    'salary': ['mean', 'median', 'std', 'min', 'max'],
    'age': ['count', 'mean']
})

# Multiple groupby
dept_city = df.groupby(['department', 'city']).agg({
    'salary': 'mean',
    'employee_id': 'count'
}).rename(columns={'employee_id': 'headcount'})

# Apply custom aggregation
def salary_range(group):
    return group.max() - group.min()

df.groupby('department')['salary'].agg(salary_range)

# Pivot tables (cross-tabulation)
pivot = df.pivot_table(
    values='salary',
    index='department',
    columns='city',
    aggfunc='mean',
    fill_value=0  # Fill missing with 0
)

Handling Missing Data

# Check for missing values
print(df.isnull().sum())           # Count per column
print(df.isnull().sum().sum())     # Total missing
print(df.isnull().mean() * 100)    # Percentage missing

# Drop missing values
df_clean = df.dropna()                    # Drop any row with missing
df_clean = df.dropna(subset=['age'])      # Drop only if 'age' is missing
df_clean = df.dropna(thresh=3)            # Keep rows with at least 3 non-null

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)   # Fill with median
df['city'].fillna('Unknown', inplace=True)            # Fill with constant
df['salary'].fillna(method='ffill', inplace=True)    # Forward fill

# Interpolate (for time series)
df['temperature'].interpolate(method='linear', inplace=True)

Data Transformation

# Apply function to column
df['salary_monthly'] = df['salary'] / 12

# Apply function to rows
df['full_name'] = df.apply(lambda row: f"{row['first_name']} {row['last_name']}", axis=1)

# Map values
grade_map = {'A': 4, 'B': 3, 'C': 2, 'D': 1, 'F': 0}
df['gpa'] = df['grade'].map(grade_map)

# Cut into bins
df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100],
                         labels=['Youth', 'Young Adult', 'Middle Age', 'Senior'])

# Get dummies (one-hot encoding)
dummies = pd.get_dummies(df['city'], prefix='city')
df = pd.concat([df, dummies], axis=1)

Basic Visualization

import matplotlib.pyplot as plt

# Histogram
df['salary'].hist(bins=20, edgecolor='black')
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Count')
plt.savefig('salary_dist.png')
plt.close()

# Bar chart
city_counts = df['city'].value_counts()
city_counts.plot(kind='bar')
plt.title('Employees by City')
plt.tight_layout()
plt.savefig('city_counts.png')
plt.close()

# Scatter plot
df.plot.scatter(x='age', y='salary', alpha=0.5)
plt.title('Age vs Salary')
plt.savefig('age_salary.png')
plt.close()

# Correlation heatmap
import seaborn as sns
corr = df.select_dtypes(include='number').corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.savefig('correlation.png')
plt.close()

Key Takeaways

  1. pd.read_csv() is the most common data loading method
  2. Use boolean indexing for filtering: df[df['col'] > value]
  3. groupby() is powerful for aggregation
  4. Always check for missing data first with df.isnull().sum()
  5. Use describe() for quick statistical overview
  6. pandas integrates well with matplotlib and seaborn
  7. Use inplace=True to modify DataFrame directly
  8. Method chaining keeps transformations readable

Advertisement

Need Expert Python Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement