CW

Pandas DataFrames and Series

Module 2: NumPy & PandasFree Lesson

Advertisement

Pandas DataFrames and Series

Master Pandas — Series, DataFrame, indexing, selection, and basic operations.

Pandas Data StructuresSeries (1D)IndexValuesnameAliceage25cityNYC1D labeled array with axis labelsDataFrame (2D)nameagecity0Alice25NYC1Bob30LA2Charlie35Chicago2D labeled structure (dict of Series)

What is Pandas?

DfPandas

Pandas is the most important library for data manipulation and analysis in Python. It provides labeled, indexed data structures optimized for tabular, heterogeneous data. Built on NumPy, Pandas adds label-based indexing, handling of missing data, and powerful group-by operations. Named after "panel data" (econometrics term for multidimensional structured datasets).

Core Data Structures

Architecture Diagram
Series (1D)                    DataFrame (2D)
+-------------+               +------+------+------+
|    Index    |               |      | Name | Age  |
+------+------+               +------+------+------+
|  0   | Alice|               |  0   |Alice |  25  |
|  1   | Bob  |               |  1   | Bob  |  30  |
|  2   | Char |               |  2   |Charl |  35  |
+------+------+               +------+------+------+
  ↑                           ↑             ↑
  index + values              index + columns + values

Series: The 1D Building Block

DfSeries

A Series is a one-dimensional labeled array capable of holding any data type. It consists of two aligned arrays: an index (labels) and values (data). The index provides automatic alignment for arithmetic operations — elements are matched by label, not position.

import pandas as pd
import numpy as np

# Creating Series
s = pd.Series([10, 20, 30, 40, 50])
print(s)
# 0    10
# 1    20
# 2    30
# 3    40
# 4    50

# Custom index
s_named = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='scores')
print(s_named)
# a    10
# b    20
# c    30

# From dictionary
s_dict = pd.Series({'Alice': 95, 'Bob': 87, 'Charlie': 92})
print(s_dict)

Series Attributes

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

print(s.index)       # Index(['a', 'b', 'c'], dtype='object')
print(s.values)      # [10 20 30]
print(s.dtype)       # int64
print(s.name)        # None
print(s.nbytes)      # 24
print(s.shape)       # (3,)
print(s.ndim)        # 1
print(s.is_unique)   # True

Series Alignment

Index Alignment

When performing operations on two Series, Pandas aligns values by index label. Missing indices produce NaN. This is fundamentally different from NumPy's position-based alignment.

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([10, 20, 30], index=['b', 'c', 'd'])

# Aligned by index
result = s1 + s2
print(result)
# a     NaN    (s1 has 'a', s2 doesn't)
# b    22.0    (2 + 20)
# c    33.0    (3 + 30)
# d     NaN    (s2 has 'd', s1 doesn't)

DataFrame: The 2D Workhorse

DfDataFrame

A DataFrame is a 2D labeled data structure with columns of potentially different types. Think of it as a dictionary of Series sharing a common index. Internally, it stores columns as separate Series objects in a dict-like structure, enabling efficient column operations.

# From dictionary
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'Chicago', 'NYC'],
    'salary': [70000, 85000, 92000, 78000]
})
print(df)
#       name  age     city  salary
# 0    Alice   25      NYC   70000
# 1      Bob   30       LA   85000
# 2  Charlie   35  Chicago   92000
# 3    Diana   28      NYC   78000

# From 2D array
df_arr = pd.DataFrame(
    np.random.randn(4, 3),
    columns=['A', 'B', 'C'],
    index=['w', 'x', 'y', 'z']
)

# From list of dicts
df_records = pd.DataFrame([
    {'name': 'Alice', 'score': 95},
    {'name': 'Bob', 'score': 87}
])

DataFrame Attributes

print(df.shape)       # (4, 4)
print(df.columns)     # Index(['name', 'age', 'city', 'salary'])
print(df.dtypes)
# name      object
# age        int64
# city      object
# salary     int64

print(df.index)       # RangeIndex(start=0, stop=4, step=1)
print(df.values)      # 2D numpy array
print(df.memory_usage(deep=True))  # per-column memory
print(df.info())      # concise summary
print(df.describe())  # statistical summary

Indexing and Selection

Pandas Indexing Methods[] (Bracket)Column selectionBoolean filteringSlice (row-only)df['col'] / df[condition]Ambiguous for rows!.loc[] (Label)Row by labelColumn by labelLabel-based slicingdf.loc['row', 'col']Inclusive end slicing.iloc[] (Integer)Row by positionColumn by positionStandard Python slicingdf.iloc[0, 1]Exclusive end slicing

Bracket Selection ([])

# Column selection
df['name']           # Returns Series
df[['name', 'age']]  # Returns DataFrame (list of cols)

# Boolean filtering
df[df['age'] > 28]
df[df['city'] == 'NYC']
df[(df['age'] > 25) & (df['salary'] > 80000)]

# Row slicing (only for explicit slices, NOT single labels)
df[1:3]  # Rows 1 and 2

Label-based Selection (.loc[])

# Single row
df.loc[0]           # Returns Series

# Row and column
df.loc[0, 'name']   # 'Alice'

# Range of rows and columns (INCLUSIVE end)
df.loc[0:2, 'name':'city']

# Conditional
df.loc[df['age'] > 28, ['name', 'salary']]

# Setting values
df.loc[df['city'] == 'NYC', 'bonus'] = 5000

Position-based Selection (.iloc[])

# Single row
df.iloc[0]          # Returns Series

# Row and column
df.iloc[0, 1]       # 25 (age of first row)

# Range (EXCLUSIVE end — standard Python slicing)
df.iloc[0:2, 0:2]   # First 2 rows, first 2 columns

# Negative indexing
df.iloc[-1]          # Last row
df.iloc[:, -1]       # Last column

# Boolean (via .iloc with boolean array)
df.iloc[df['age'].values > 28]

loc vs iloc — Key Differences

  • .loc[] uses labels and is inclusive on both ends
  • .iloc[] uses integer positions and is exclusive on the end (like Python slicing)
  • When the index is default RangeIndex(0, n), df.loc[0] and df.iloc[0] return the same row — but for different reasons
  • After df.set_index('name'), df.loc['Alice'] works but df.iloc['Alice'] raises TypeError

Filtering and Boolean Operations

# Complex conditions
mask = (df['age'] > 25) & (df['city'].isin(['NYC', 'LA']))
filtered = df[mask]

# Query method (string-based filtering)
filtered = df.query('age > 25 and city in ["NYC", "LA"]')

# isin for categorical filtering
df[df['city'].isin(['NYC', 'Chicago'])]

# Between for range
df[df['age'].between(25, 30)]

# String methods
df[df['name'].str.startswith('A')]
df[df['name'].str.contains('li')]

# Where (returns original values where condition is True, NaN otherwise)
df['age'].where(df['age'] > 25, other=0)  # Replace <25 with 0

# Mask (opposite of where — replaces True conditions)
df['age'].mask(df['age'] > 25, other=-1)  # Replace >25 with -1

Adding, Modifying, and Deleting Columns

# Add new column
df['country'] = 'USA'
df['tax'] = df['salary'] * 0.3

# Conditional column
df['senior'] = np.where(df['age'] > 30, 'Yes', 'No')

# Insert at specific position
df.insert(2, 'department', ['Engineering', 'Sales', 'Marketing', 'Engineering'])

# Rename columns
df_renamed = df.rename(columns={'name': 'employee_name', 'age': 'employee_age'})

# Drop columns
df_dropped = df.drop(columns=['country', 'tax'])
df.drop(columns=['country'], inplace=True)  # modifies in place

# Delete
del df['tax']

Basic Operations

Aggregation

# Column-level
print(df['salary'].mean())
print(df['salary'].median())
print(df['salary'].std())
print(df['salary'].min(), df['salary'].max())
print(df['salary'].quantile(0.75))

# DataFrame-level
print(df.describe())
print(df['city'].value_counts())
print(df['salary'].sum())

# Custom aggregation
print(df.agg({
    'age': ['mean', 'min', 'max'],
    'salary': ['sum', 'mean']
}))

Sorting

# Sort by values
df.sort_values('salary', ascending=False)
df.sort_values(['city', 'salary'], ascending=[True, False])

# Sort by index
df.sort_index(ascending=False)

# Sort by column with key
df.sort_values('name', key=lambda x: x.str.len())

apply and map

# apply on Series (element-wise function)
df['salary_k'] = df['salary'].apply(lambda x: x / 1000)

# apply on DataFrame (row or column-wise)
df['age_salary_ratio'] = df.apply(
    lambda row: row['salary'] / row['age'], axis=1
)

# map on Series (element-wise transformation)
df['city_code'] = df['city'].map({
    'NYC': 'NYC',
    'LA': 'LA',
    'Chicago': 'CHI'
})

# applymap for element-wise on entire DataFrame
df_numeric = df[['age', 'salary']].applymap(lambda x: x * 1.1)  # 10% raise

Reading and Writing Data

# CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', index_col=0, parse_dates=['date'])
df.to_csv('output.csv', index=False)

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df.to_excel('output.xlsx', index=False)

# JSON
df = pd.read_json('data.json')
df.to_json('output.json', orient='records')

# Parquet (efficient columnar format)
df = pd.read_parquet('data.parquet')
df.to_parquet('output.parquet', index=False)

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table WHERE age > 25', conn)

Performance Tip — Parquet vs CSV

Parquet stores data in columnar format with compression. For a 1GB CSV file, Parquet is typically 5-10x smaller and 10-100x faster to read because it only loads selected columns (column pruning) and supports predicate pushdown.

Missing Data

df_with_nan = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan, 5],
    'B': [np.nan, 2, 3, 4, np.nan],
    'C': [1, 2, np.nan, 4, 5]
})

# Detection
df_with_nan.isnull()          # Boolean DataFrame
df_with_nan.isnull().sum()    # Count per column
df_with_nan.isnull().sum().sum()  # Total missing

# Filtering
df_with_nan.dropna()                    # Drop any row with NaN
df_with_nan.dropna(subset=['A'])        # Drop if NaN in column A
df_with_nan.dropna(thresh=2)            # Keep rows with ≥2 non-NaN

# Imputation
df_with_nan.fillna(0)                   # Fill with constant
df_with_nan.fillna(df.mean())           # Fill with column means
df_with_nan.fillna(method='ffill')      # Forward fill
df_with_nan.fillna(method='bfill')      # Backward fill
df_with_nan.interpolate()               # Linear interpolation

Key Takeaways

Summary: Pandas DataFrames and Series

  1. Series is a 1D labeled array; DataFrame is a 2D labeled structure (dict of Series)
  2. Use .loc[] for label-based indexing (inclusive) and .iloc[] for position-based indexing (exclusive)
  3. Index alignment means operations on Series/DataFrames match by label, not position — producing NaN for mismatches
  4. Filtering supports boolean masks, .query(), .isin(), .between(), and string methods
  5. Missing data is handled via isnull(), dropna(), fillna(), and interpolate()
  6. apply() is flexible but slow — prefer vectorized operations for performance
  7. Choose Parquet over CSV for large datasets (columnar storage, compression, fast reads)

Practice Exercise

  1. Create a DataFrame of 100 students with random names, ages (18-30), GPAs (2.0-4.0), and majors. Find the top 5 students by GPA in each major using groupby() and nlargest().
  2. Load a CSV file with missing values. Quantify missingness per column, identify rows with more than 30% missing data, and impute numerical columns with median, categorical with mode.
  3. Implement a function that takes a DataFrame and returns a report: shape, dtypes, missing counts, duplicate rows, and memory usage.
  4. Merge two DataFrames on a common key with different column names using left_on and right_on. Verify the merge preserved all rows from the left DataFrame.
  5. Use method chaining to: load data → drop duplicates → fill NaN → filter rows → sort → export to Parquet, all in a single expression using .pipe().

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement