Pandas DataFrames

Data ProcessingPandas BasicsFree Lesson

Advertisement

Introduction

Pandas DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is the primary data structure for data science operations and provides spreadsheet-like functionality with powerful indexing and manipulation capabilities. DataFrames can be created from various sources including CSV files, SQL databases, and Python dictionaries.

Key Concepts

  • Two-dimensional structure: Rows and columns with labels
  • Heterogeneous columns: Different data types per column
  • Indexing: Multiple ways to access data (loc, iloc, at)
  • Column operations: Adding, removing, and modifying columns
  • Data types: Support for numeric, string, datetime, categorical
  • Mutability: DataFrames can be modified in place

Python Implementation

import pandas as pd
import numpy as np

# Creating DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "score": [85.5, 90.0, 78.5]
})

# From CSV
df = pd.read_csv("data.csv")

# Accessing columns
ages = df["age"]
name_col = df.name  # Alternative syntax

# Accessing rows
first_row = df.iloc[0]
by_index = df.loc[0]

# Adding columns
df["grade"] = ["A", "B", "C"]
df["pass"] = df["score"] >= 60

# Removing columns
df_dropped = df.drop(columns=["grade"])
df_pop = df.pop("grade")

# Selecting subset
subset = df[["name", "score"]]

# Statistics
df.describe()          # Summary statistics
df.mean()             # Mean of each column
df.corr()             # Correlation matrix

# Sorting
df_sorted = df.sort_values(by="score", ascending=False)
df_sorted_index = df.sort_index()

# Iteration
for col in df.columns:
    print(col)
for index, row in df.iterrows():
    print(index, row["name"])

When to Use

  • Tabular data analysis and manipulation
  • Loading and processing datasets
  • Data cleaning and transformation
  • Statistical analysis
  • Building ML feature matrices
  • Exporting data to various formats

Key Takeaways

  1. DataFrames are the primary data structure for pandas operations
  2. The .loc[] and .iloc[] methods provide distinct access patterns
  3. Method chaining enables readable data transformation pipelines
  4. Understanding copy vs view is crucial to avoid unintended modifications
  5. The inplace parameter modifies data directly when set to True

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement