Introduction
Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is the primary data structure for data science operations and provides spreadsheet-like functionality with powerful indexing and manipulation capabilities. DataFrames can be created from various sources including CSV files, SQL databases, and Python dictionaries.
Key Concepts
- Two-dimensional structure: Rows and columns with labels
- Heterogeneous columns: Different data types per column
- Indexing: Multiple ways to access data (loc, iloc, at)
- Column operations: Adding, removing, and modifying columns
- Data types: Support for numeric, string, datetime, categorical
- Mutability: DataFrames can be modified in place
Python Implementation
import pandas as pd
import numpy as np
# Creating DataFrame
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"score": [85.5, 90.0, 78.5]
})
# From CSV
df = pd.read_csv("data.csv")
# Accessing columns
ages = df["age"]
name_col = df.name # Alternative syntax
# Accessing rows
first_row = df.iloc[0]
by_index = df.loc[0]
# Adding columns
df["grade"] = ["A", "B", "C"]
df["pass"] = df["score"] >= 60
# Removing columns
df_dropped = df.drop(columns=["grade"])
df_pop = df.pop("grade")
# Selecting subset
subset = df[["name", "score"]]
# Statistics
df.describe() # Summary statistics
df.mean() # Mean of each column
df.corr() # Correlation matrix
# Sorting
df_sorted = df.sort_values(by="score", ascending=False)
df_sorted_index = df.sort_index()
# Iteration
for col in df.columns:
print(col)
for index, row in df.iterrows():
print(index, row["name"])
When to Use
- Tabular data analysis and manipulation
- Loading and processing datasets
- Data cleaning and transformation
- Statistical analysis
- Building ML feature matrices
- Exporting data to various formats
Key Takeaways
- DataFrames are the primary data structure for pandas operations
- The .loc[] and .iloc[] methods provide distinct access patterns
- Method chaining enables readable data transformation pipelines
- Understanding copy vs view is crucial to avoid unintended modifications
- The inplace parameter modifies data directly when set to True