Why Pandas?
Pandas is the backbone of data science in Python — it provides DataFrame (2D table) and Series (1D array) structures that make data manipulation fast and expressive.
Feature Pandas NumPy Python list
Labelled columns ✅ ❌ ❌
Mixed data types ✅ ❌ ✅
Missing data (NaN) ✅ built-in limited ❌
GroupBy / Agg ✅ one-liner manual manual
SQL-like joins ✅ ❌ ❌
Creating DataFrames
import pandas as pd
import numpy as np
# From dict
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"age": [28, 34, 29, 45],
"salary": [72000, 85000, 68000, 120000],
"dept": ["Eng", "DS", "DS", "Eng"],
})
# From CSV / Excel / SQL
df_csv = pd.read_csv("data.csv", parse_dates=["date"])
df_excel = pd.read_excel("data.xlsx", sheet_name="Sales")
df_sql = pd.read_sql("SELECT * FROM users WHERE active=1", con=engine)
# Quick overview
print(df.head(3)) # first 3 rows
print(df.info()) # dtypes, nulls
print(df.describe()) # summary stats
print(df.shape) # (rows, cols)
print(df.dtypes) # column types
Indexing & Selection
# Column selection
df["salary"] # Series
df[["name", "salary"]] # DataFrame (multi-col)
# Row selection
df.loc[0] # by label
df.iloc[0] # by integer position
df.loc[0:2, "name":"salary"] # label slice
# Boolean filtering
high_earners = df[df["salary"] > 80000]
ds_team = df[df["dept"] == "DS"]
senior_ds = df[(df["dept"] == "DS") & (df["age"] > 30)]
# .query() — cleaner syntax
result = df.query("dept == 'DS' and salary > 70000")
print(result)
Core Operations
# ── Add / Transform columns ──────────────────────────────────────────
df["salary_k"] = df["salary"] / 1000
df["senior"] = df["age"] > 35
df["name_upper"] = df["name"].str.upper()
df["log_salary"] = np.log(df["salary"])
# apply — row or column function
df["salary_grade"] = df["salary"].apply(
lambda x: "A" if x >= 100000 else ("B" if x >= 75000 else "C")
)
# ── GroupBy — split-apply-combine ────────────────────────────────────
agg = df.groupby("dept").agg(
avg_salary=("salary", "mean"),
max_salary=("salary", "max"),
count=("name", "count"),
avg_age=("age", "mean"),
).round(2)
print(agg)
# Multiple groupby keys
df.groupby(["dept", "senior"])["salary"].mean().unstack()
# ── Sorting ──────────────────────────────────────────────────────────
df.sort_values("salary", ascending=False, inplace=True)
df.sort_values(["dept", "salary"], ascending=[True, False])
# ── Missing Data ─────────────────────────────────────────────────────
df["salary"].isna().sum() # count NaNs
df.dropna(subset=["salary"]) # drop rows with NaN salary
df["salary"].fillna(df["salary"].median()) # fill with median
df.fillna({"age": df["age"].mean(), "salary": 0})
# ── String operations ─────────────────────────────────────────────────
df["name"].str.lower()
df["name"].str.contains("al", case=False)
df["name"].str.replace("Carol", "Caroline")
df["name"].str.split(" ", expand=True) # split into columns
Merging & Joining
orders = pd.DataFrame({
"name": ["Alice", "Bob", "Alice", "Dave"],
"product": ["Laptop", "Phone", "Mouse", "Tablet"],
"amount": [1200, 800, 25, 600],
})
# SQL-style merges
inner = pd.merge(df, orders, on="name", how="inner")
left = pd.merge(df, orders, on="name", how="left") # keep all employees
# Concatenate
df_combined = pd.concat([df, df2], axis=0, ignore_index=True) # stack rows
df_wide = pd.concat([df, features], axis=1) # add columns
Reshaping
# Wide → Long (melt)
monthly = pd.DataFrame({
"product": ["A", "B", "C"],
"Jan": [100, 200, 150],
"Feb": [120, 180, 160],
"Mar": [110, 210, 140],
})
long = pd.melt(monthly, id_vars="product",
var_name="month", value_name="sales")
# Long → Wide (pivot)
wide = long.pivot(index="product", columns="month", values="sales")
# Pivot table (like Excel)
pivot = pd.pivot_table(orders, values="amount",
index="name", aggfunc=["mean", "sum", "count"])
Time Series
# Parse dates
ts = pd.read_csv("stock.csv", parse_dates=["date"], index_col="date")
# Resample
ts.resample("M").mean() # monthly average
ts.resample("W").agg({"open": "first", "close": "last", "volume": "sum"})
# Rolling statistics
ts["close"].rolling(window=30).mean() # 30-day moving average
ts["close"].rolling(window=30).std() # 30-day volatility
# Shift / lag
ts["prev_close"] = ts["close"].shift(1)
ts["pct_change"] = ts["close"].pct_change()
Performance Tips
| Tip | Speedup | Code |
|---|---|---|
| Use vectorised ops | 100x | df["x"] * 2 not df.apply(lambda r: r*2) |
| Category dtype | 10x mem | df["dept"].astype("category") |
query() | 2x | df.query("age > 30") |
eval() | 2x | df.eval("profit = revenue - cost") |
| Read parquet | 10x | pd.read_parquet("data.parquet") |
| Chunk large CSVs | memory | pd.read_csv("big.csv", chunksize=10000) |
Key Takeaways
loc/iloc— uselocfor labels,ilocfor positionsgroupby().agg()— one-liner for split-apply-combine operationsmerge()covers all SQL join types: inner, left, right, outermelt/pivotreshape between wide and long format- Always profile — use
df.info()anddf.describe()before any analysis