Python Data Types and Structures

Python's built-in data types are the foundation of every data science project. Choosing the right structure affects performance, readability, and correctness. This lesson covers every major type you will use daily.

Python Data Type Hierarchy

Numeric Types

Integers

Integers are whole numbers with no decimal point. Python handles arbitrarily large integers without overflow.

# Integers in Python
population = 1_400_000_000  # Underscores for readability
negative = -42
binary = 0b1010  # Binary literal = 10
hex_val = 0xFF   # Hex literal = 255

print(type(population))  # <class 'int'>
print(population.bit_length())  # 31

Floats

Floats are decimal numbers. They use IEEE 754 double precision, which means ~15-17 significant digits.

# Floats
pi = 3.14159265358979
avogadro = 6.022e23  # Scientific notation

# Beware of floating point precision
print(0.1 + 0.2)         # 0.30000000000000004
print(0.1 + 0.2 == 0.3)  # False

# For precise decimals (financial), use the decimal module
from decimal import Decimal
a = Decimal("0.1") + Decimal("0.2")
print(a == Decimal("0.3"))  # True

Complex Numbers

Used in signal processing and physics simulations.

z = 3 + 4j
print(z.real)       # 3.0
print(z.imag)       # 4.0
print(abs(z))       # 5.0  (magnitude)

Booleans

Booleans are a subclass of integers. True equals 1 and False equals 0.

is_active = True
has_data = False

# Boolean operations
print(is_active and has_data)   # False
print(is_active or has_data)    # True
print(not is_active)            # False

# Truthy and Falsy values
print(bool(0))       # False
print(bool(""))      # False
print(bool([]))      # False
print(bool(None))    # False
print(bool(42))      # True
print(bool("hello")) # True

Strings

Strings are immutable sequences of Unicode characters. You will use them extensively for text processing in data science.

name = "Data Science"
print(len(name))        # 12
print(name[0])          # 'D'
print(name[-1])         # 'e'
print(name[0:4])        # 'Data'

# Strings are immutable
# name[0] = "d"  # TypeError!

# Common methods
text = "  Hello, World!  "
print(text.strip())        # "Hello, World!"
print(text.lower())        # "  hello, world!  "
print(text.upper())        # "  HELLO, WORLD!  "
print(text.replace("World", "Python"))  # "  Hello, Python!  "
print(text.split(","))     # ['  Hello', ' World!  ']
print("-".join(["a", "b", "c"]))  # "a-b-c"

Lists

Lists are ordered, mutable sequences. They are the most versatile data structure in Python.

# Creating lists
numbers = [1, 2, 3, 4, 5]
mixed = [1, "hello", 3.14, True, None]
nested = [[1, 2], [3, 4], [5, 6]]
empty = []

# Accessing elements
print(numbers[0])    # 1 (first)
print(numbers[-1])   # 5 (last)
print(numbers[1:3])  # [2, 3] (slice)

# Modifying
numbers.append(6)         # Add to end
numbers.insert(0, 0)     # Insert at index
numbers.extend([7, 8])   # Add multiple
numbers.remove(3)        # Remove first occurrence
popped = numbers.pop()   # Remove and return last
del numbers[0]           # Delete by index

# List operations
a = [1, 2, 3]
b = [4, 5, 6]
print(a + b)       # [1, 2, 3, 4, 5, 6]
print(a * 2)       # [1, 2, 3, 1, 2, 3]
print(3 in a)      # True
print(len(a))      # 3

# Sorting
nums = [3, 1, 4, 1, 5, 9, 2, 6]
nums.sort()               # In-place sort
print(nums)               # [1, 1, 2, 3, 4, 5, 6, 9]
sorted_nums = sorted(nums, reverse=True)  # New sorted list
print(sorted_nums)        # [9, 6, 5, 4, 3, 2, 1, 1]

When to Use Lists

You need an ordered collection that changes over time.
You want to append, insert, or remove elements frequently.
You need duplicate values.
You want to iterate in insertion order.

Tuples

Tuples are ordered, immutable sequences. They are faster than lists and can be used as dictionary keys.

# Creating tuples
point = (3, 4)
color = (255, 128, 0)
single = (42,)    # Note the trailing comma for single-element tuple
not_a_tuple = (42)  # This is just the integer 42

# Accessing (same as lists)
print(point[0])      # 3
print(point[-1])     # 4

# Unpacking
x, y = point
print(f"x={x}, y={y}")  # x=3, y=4

# Multiple assignment
a, b, c = 1, 2, 3

# Swap variables
a, b = b, a

# Tuple methods
nums = (1, 2, 2, 3, 3, 3)
print(nums.count(3))   # 3
print(nums.index(2))   # 1

# Tuples as dictionary keys (lists cannot be keys)
location = {(40.7128, -74.0060): "New York", (51.5074, -0.1278): "London"}

When to Use Tuples

Data should not change (coordinates, RGB colors, database rows).
You need a hashable type (dictionary keys, set elements).
Performance matters – tuples are slightly faster than lists.
You want to enforce immutability as a design constraint.

Dictionaries

Dictionaries store key-value pairs. They are the most important data structure for structured data work.

# Creating dictionaries
person = {"name": "Alice", "age": 30, "city": "New York"}
from_keys = dict.fromkeys(["a", "b", "c"], 0)  # {'a': 0, 'b': 0, 'c': 0}
empty_dict = {}

# Accessing
print(person["name"])           # "Alice"
print(person.get("salary", 0))  # 0 (default if key missing)

# Modifying
person["age"] = 31              # Update
person["email"] = "a@b.com"    # Add new key
del person["city"]              # Delete key

# Iterating
for key in person:
    print(key, person[key])

for key, value in person.items():
    print(f"{key}: {value}")

# Useful methods
print(person.keys())    # dict_keys(['name', 'age', 'email'])
print(person.values())  # dict_values(['Alice', 31, 'a@b.com'])
print("name" in person) # True

# Dictionary comprehension
squares = {x: x**2 for x in range(10)}
evens = {x: x**2 for x in range(10) if x % 2 == 0}

Nested Dictionaries

students = {
    "alice": {"age": 22, "grades": [90, 85, 92]},
    "bob": {"age": 23, "grades": [78, 82, 88]},
}

# Access nested values
print(students["alice"]["grades"][0])  # 90

When to Use Dictionaries

You need fast lookups by a unique key (O(1) average).
You represent structured records or JSON-like data.
You need to map one set of values to another.
Data science: column-based data, feature dictionaries, configuration.

Sets

Sets are unordered collections of unique elements. They are optimized for membership testing and set operations.

# Creating sets
fruits = {"apple", "banana", "cherry"}
numbers = set([1, 2, 2, 3, 3, 3])  # {1, 2, 3}
empty_set = set()  # NOT {} (that creates a dict)

# Adding and removing
fruits.add("date")
fruits.remove("banana")
fruits.discard("fig")  # No error if missing

# Set operations
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}

print(a | b)   # Union: {1, 2, 3, 4, 5, 6}
print(a & b)   # Intersection: {3, 4}
print(a - b)   # Difference: {1, 2}
print(a ^ b)   # Symmetric difference: {1, 2, 5, 6}

# Membership testing (very fast)
print(3 in a)  # True

# Practical use: finding unique values
df_column = ["cat", "dog", "cat", "bird", "dog", "cat"]
unique_values = set(df_column)
print(unique_values)  # {'cat', 'dog', 'bird'}

When to Use Sets

You need to remove duplicates quickly.
You need fast membership testing.
You need mathematical set operations (union, intersection, difference).
You are comparing two collections for overlap.

Type Conversions

# Explicit conversion (casting)
int("42")         # 42
float("3.14")     # 3.14
str(100)          # "100"
list("abc")       # ['a', 'b', 'c']
tuple([1, 2, 3])  # (1, 2, 3)
set([1, 1, 2])    # {1, 2}
dict([("a", 1), ("b", 2)])  # {'a': 1, 'b': 2}

# Common pitfalls
print(int("3.14"))    # ValueError! Use float() first
print(float("3.14"))  # 3.14
print(int(3.14))      # 3 (truncates, does not round)
print(int(3.7))       # 3

Type Conversion Rules

Type conversions follow implicit and explicit rules. Here are the mathematical representations:

Implicit Promotion Order:

Explicit Casting Rules:

Choosing the Right Structure

Architecture Diagram

Need ordered + mutable?        ?� List
Need ordered + immutable?      ?� Tuple
Need key-value pairs?          ?� Dict
Need unique values?            ?� Set
Need fast lookup by key?       ?� Dict
Need fast membership testing?  ?� Set
Need to enforce no duplicates? ?� Set

Performance Comparison

import time

# Membership testing speed
large_list = list(range(1_000_000))
large_set = set(range(1_000_000))

start = time.time()
999_999 in large_list
list_time = time.time() - start

start = time.time()
999_999 in large_set
set_time = time.time() - start

print(f"List: {list_time:.4f}s")   # ~0.01s
print(f"Set:  {set_time:.6f}s")    # ~0.000001s
# Sets are orders of magnitude faster for membership testing

Key Takeaways

Lists are your default ordered collection; use them when you need mutability.
Tuples protect data from accidental modification and work as dict keys.
Dictionaries are essential for structured data and fast lookups.
Sets are irreplaceable for deduplication and membership testing.
Always choose the structure that best matches your data's constraints and access patterns.

Python Data Types and Structures

Python Data Types and Structures

Python Data Type Hierarchy

Numeric Types

Integers

Floats

Complex Numbers

Booleans

Strings

Lists

When to Use Lists

Tuples

When to Use Tuples

Dictionaries

Nested Dictionaries

When to Use Dictionaries

Sets

When to Use Sets

Type Conversions

Type Conversion Rules

Choosing the Right Structure

Performance Comparison

Key Takeaways

Need Expert Data Science Help?