Pandas DataFrames and Series
Master Pandas — Series, DataFrame, indexing, selection, and basic operations.
What is Pandas?
DfPandas
Pandas is the most important library for data manipulation and analysis in Python. It provides labeled, indexed data structures optimized for tabular, heterogeneous data. Built on NumPy, Pandas adds label-based indexing, handling of missing data, and powerful group-by operations. Named after "panel data" (econometrics term for multidimensional structured datasets).
Core Data Structures
Series (1D) DataFrame (2D)
+-------------+ +------+------+------+
| Index | | | Name | Age |
+------+------+ +------+------+------+
| 0 | Alice| | 0 |Alice | 25 |
| 1 | Bob | | 1 | Bob | 30 |
| 2 | Char | | 2 |Charl | 35 |
+------+------+ +------+------+------+
↑ ↑ ↑
index + values index + columns + values
Series: The 1D Building Block
DfSeries
A Series is a one-dimensional labeled array capable of holding any data type. It consists of two aligned arrays: an index (labels) and values (data). The index provides automatic alignment for arithmetic operations — elements are matched by label, not position.
import pandas as pd
import numpy as np
# Creating Series
s = pd.Series([10, 20, 30, 40, 50])
print(s)
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# Custom index
s_named = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='scores')
print(s_named)
# a 10
# b 20
# c 30
# From dictionary
s_dict = pd.Series({'Alice': 95, 'Bob': 87, 'Charlie': 92})
print(s_dict)
Series Attributes
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s.index) # Index(['a', 'b', 'c'], dtype='object')
print(s.values) # [10 20 30]
print(s.dtype) # int64
print(s.name) # None
print(s.nbytes) # 24
print(s.shape) # (3,)
print(s.ndim) # 1
print(s.is_unique) # True
Series Alignment
Index Alignment
When performing operations on two Series, Pandas aligns values by index label. Missing indices produce NaN. This is fundamentally different from NumPy's position-based alignment.
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])
# Aligned by index
result = s1 + s2
print(result)
# a NaN (s1 has 'a', s2 doesn't)
# b 22.0 (2 + 20)
# c 33.0 (3 + 30)
# d NaN (s2 has 'd', s1 doesn't)
DataFrame: The 2D Workhorse
DfDataFrame
A DataFrame is a 2D labeled data structure with columns of potentially different types. Think of it as a dictionary of Series sharing a common index. Internally, it stores columns as separate Series objects in a dict-like structure, enabling efficient column operations.
# From dictionary
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NYC', 'LA', 'Chicago', 'NYC'],
'salary': [70000, 85000, 92000, 78000]
})
print(df)
# name age city salary
# 0 Alice 25 NYC 70000
# 1 Bob 30 LA 85000
# 2 Charlie 35 Chicago 92000
# 3 Diana 28 NYC 78000
# From 2D array
df_arr = pd.DataFrame(
np.random.randn(4, 3),
columns=['A', 'B', 'C'],
index=['w', 'x', 'y', 'z']
)
# From list of dicts
df_records = pd.DataFrame([
{'name': 'Alice', 'score': 95},
{'name': 'Bob', 'score': 87}
])
DataFrame Attributes
print(df.shape) # (4, 4)
print(df.columns) # Index(['name', 'age', 'city', 'salary'])
print(df.dtypes)
# name object
# age int64
# city object
# salary int64
print(df.index) # RangeIndex(start=0, stop=4, step=1)
print(df.values) # 2D numpy array
print(df.memory_usage(deep=True)) # per-column memory
print(df.info()) # concise summary
print(df.describe()) # statistical summary
Indexing and Selection
Bracket Selection ([])
# Column selection
df['name'] # Returns Series
df[['name', 'age']] # Returns DataFrame (list of cols)
# Boolean filtering
df[df['age'] > 28]
df[df['city'] == 'NYC']
df[(df['age'] > 25) & (df['salary'] > 80000)]
# Row slicing (only for explicit slices, NOT single labels)
df[1:3] # Rows 1 and 2
Label-based Selection (.loc[])
# Single row
df.loc[0] # Returns Series
# Row and column
df.loc[0, 'name'] # 'Alice'
# Range of rows and columns (INCLUSIVE end)
df.loc[0:2, 'name':'city']
# Conditional
df.loc[df['age'] > 28, ['name', 'salary']]
# Setting values
df.loc[df['city'] == 'NYC', 'bonus'] = 5000
Position-based Selection (.iloc[])
# Single row
df.iloc[0] # Returns Series
# Row and column
df.iloc[0, 1] # 25 (age of first row)
# Range (EXCLUSIVE end — standard Python slicing)
df.iloc[0:2, 0:2] # First 2 rows, first 2 columns
# Negative indexing
df.iloc[-1] # Last row
df.iloc[:, -1] # Last column
# Boolean (via .iloc with boolean array)
df.iloc[df['age'].values > 28]
loc vs iloc — Key Differences
.loc[]uses labels and is inclusive on both ends.iloc[]uses integer positions and is exclusive on the end (like Python slicing)- When the index is default
RangeIndex(0, n),df.loc[0]anddf.iloc[0]return the same row — but for different reasons - After
df.set_index('name'),df.loc['Alice']works butdf.iloc['Alice']raisesTypeError
Filtering and Boolean Operations
# Complex conditions
mask = (df['age'] > 25) & (df['city'].isin(['NYC', 'LA']))
filtered = df[mask]
# Query method (string-based filtering)
filtered = df.query('age > 25 and city in ["NYC", "LA"]')
# isin for categorical filtering
df[df['city'].isin(['NYC', 'Chicago'])]
# Between for range
df[df['age'].between(25, 30)]
# String methods
df[df['name'].str.startswith('A')]
df[df['name'].str.contains('li')]
# Where (returns original values where condition is True, NaN otherwise)
df['age'].where(df['age'] > 25, other=0) # Replace <25 with 0
# Mask (opposite of where — replaces True conditions)
df['age'].mask(df['age'] > 25, other=-1) # Replace >25 with -1
Adding, Modifying, and Deleting Columns
# Add new column
df['country'] = 'USA'
df['tax'] = df['salary'] * 0.3
# Conditional column
df['senior'] = np.where(df['age'] > 30, 'Yes', 'No')
# Insert at specific position
df.insert(2, 'department', ['Engineering', 'Sales', 'Marketing', 'Engineering'])
# Rename columns
df_renamed = df.rename(columns={'name': 'employee_name', 'age': 'employee_age'})
# Drop columns
df_dropped = df.drop(columns=['country', 'tax'])
df.drop(columns=['country'], inplace=True) # modifies in place
# Delete
del df['tax']
Basic Operations
Aggregation
# Column-level
print(df['salary'].mean())
print(df['salary'].median())
print(df['salary'].std())
print(df['salary'].min(), df['salary'].max())
print(df['salary'].quantile(0.75))
# DataFrame-level
print(df.describe())
print(df['city'].value_counts())
print(df['salary'].sum())
# Custom aggregation
print(df.agg({
'age': ['mean', 'min', 'max'],
'salary': ['sum', 'mean']
}))
Sorting
# Sort by values
df.sort_values('salary', ascending=False)
df.sort_values(['city', 'salary'], ascending=[True, False])
# Sort by index
df.sort_index(ascending=False)
# Sort by column with key
df.sort_values('name', key=lambda x: x.str.len())
apply and map
# apply on Series (element-wise function)
df['salary_k'] = df['salary'].apply(lambda x: x / 1000)
# apply on DataFrame (row or column-wise)
df['age_salary_ratio'] = df.apply(
lambda row: row['salary'] / row['age'], axis=1
)
# map on Series (element-wise transformation)
df['city_code'] = df['city'].map({
'NYC': 'NYC',
'LA': 'LA',
'Chicago': 'CHI'
})
# applymap for element-wise on entire DataFrame
df_numeric = df[['age', 'salary']].applymap(lambda x: x * 1.1) # 10% raise
Reading and Writing Data
# CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', index_col=0, parse_dates=['date'])
df.to_csv('output.csv', index=False)
# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df.to_excel('output.xlsx', index=False)
# JSON
df = pd.read_json('data.json')
df.to_json('output.json', orient='records')
# Parquet (efficient columnar format)
df = pd.read_parquet('data.parquet')
df.to_parquet('output.parquet', index=False)
# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table WHERE age > 25', conn)
Performance Tip — Parquet vs CSV
Parquet stores data in columnar format with compression. For a 1GB CSV file, Parquet is typically 5-10x smaller and 10-100x faster to read because it only loads selected columns (column pruning) and supports predicate pushdown.
Missing Data
df_with_nan = pd.DataFrame({
'A': [1, np.nan, 3, np.nan, 5],
'B': [np.nan, 2, 3, 4, np.nan],
'C': [1, 2, np.nan, 4, 5]
})
# Detection
df_with_nan.isnull() # Boolean DataFrame
df_with_nan.isnull().sum() # Count per column
df_with_nan.isnull().sum().sum() # Total missing
# Filtering
df_with_nan.dropna() # Drop any row with NaN
df_with_nan.dropna(subset=['A']) # Drop if NaN in column A
df_with_nan.dropna(thresh=2) # Keep rows with ≥2 non-NaN
# Imputation
df_with_nan.fillna(0) # Fill with constant
df_with_nan.fillna(df.mean()) # Fill with column means
df_with_nan.fillna(method='ffill') # Forward fill
df_with_nan.fillna(method='bfill') # Backward fill
df_with_nan.interpolate() # Linear interpolation
Key Takeaways
Summary: Pandas DataFrames and Series
- Series is a 1D labeled array; DataFrame is a 2D labeled structure (dict of Series)
- Use
.loc[]for label-based indexing (inclusive) and.iloc[]for position-based indexing (exclusive) - Index alignment means operations on Series/DataFrames match by label, not position — producing NaN for mismatches
- Filtering supports boolean masks,
.query(),.isin(),.between(), and string methods - Missing data is handled via
isnull(),dropna(),fillna(), andinterpolate() apply()is flexible but slow — prefer vectorized operations for performance- Choose Parquet over CSV for large datasets (columnar storage, compression, fast reads)
Practice Exercise
- Create a DataFrame of 100 students with random names, ages (18-30), GPAs (2.0-4.0), and majors. Find the top 5 students by GPA in each major using
groupby()andnlargest(). - Load a CSV file with missing values. Quantify missingness per column, identify rows with more than 30% missing data, and impute numerical columns with median, categorical with mode.
- Implement a function that takes a DataFrame and returns a report: shape, dtypes, missing counts, duplicate rows, and memory usage.
- Merge two DataFrames on a common key with different column names using
left_onandright_on. Verify the merge preserved all rows from the left DataFrame. - Use method chaining to: load data → drop duplicates → fill NaN → filter rows → sort → export to Parquet, all in a single expression using
.pipe().