Python Data Types & Structures Deep Dive

Module 1: FoundationsFree Lesson

Advertisement

Python Data Types & Structures Deep Dive

Why This Matters for Data Science

Before building machine learning models or analyzing datasets, you must fundamentally understand how Python organizes and stores data. Every data structure choice has computational consequencesβ€”a naive list operation might take 1000x longer than the equivalent NumPy array operation. This tutorial provides a rigorous exploration of Python's type system and data structures with focus on their performance characteristics and practical applications.


1. Primitive Data Types

1.1 The Type Hierarchy

Python's type system forms a hierarchy where every value is an object:

Architecture Diagram
object
β”œβ”€β”€ numbers.Number
β”‚   β”œβ”€β”€ int (arbitrary precision)
β”‚   β”œβ”€β”€ float (64-bit IEEE 754)
β”‚   └── complex
β”œβ”€β”€ bool (subclass of int)
β”œβ”€β”€ str (immutable sequence of Unicode codepoints)
β”œβ”€β”€ NoneType
└── collections.abc
    β”œβ”€β”€ Sequence β†’ list, tuple, range
    β”œβ”€β”€ Set β†’ set, frozenset
    └── Mapping β†’ dict

DfArbitrary Precision Integers

A number representation that can express any integer with exact precision, limited only by available memory. Unlike fixed-width integers (e.g., 32-bit or 64-bit), arbitrary precision integers grow dynamically to accommodate larger values, eliminating overflow errors in cryptographic and financial computations.

Python's int type uses a base-2³⁰ representation internally (on 64-bit systems), where each "digit" stores 30 bits. This means memory usage scales as O(log n) bits for an integer n, making it efficient for very large numbers compared to naive string-based representations.

1.2 Integer (int)

Python integers have arbitrary precisionβ€”they grow as large as memory allows.

# Memory representation
import sys

small_int = 42
big_int = 2 ** 1000

print(f"Small int: {small_int}")
print(f"  Bytes: {sys.getsizeof(small_int)}")
print(f"  Type: {type(small_int)}")
print()
print(f"Big int: {big_int}")
print(f"  Bytes: {sys.getsizeof(big_int)}")
print(f"  Digits: {len(str(big_int))}")

Output:

Architecture Diagram
Small int: 42
  Bytes: 28
  Type: <class 'int'>

Big int: 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
  Digits: 302

Integer Memory Complexity

O(log⁑2n) bits for integer nO(\log_2 n) \text{ bits for integer } n

Here,

  • =The integer value being stored
  • =Total bits required for storage

Why arbitrary precision matters: Financial calculations, cryptography, and scientific computing require exact integer arithmetic without overflow.

1.3 Float (float)

Python floats are 64-bit IEEE 754 double-precision numbers.

DfIEEE 754 Double Precision

A binary floating-point format that uses 64 bits: 1 sign bit, 11 exponent bits (biased by 1023), and 52 mantissa bits (with an implicit leading 1). This gives approximately 15-17 significant decimal digits of precision.

IEEE 754 Value Formula

(βˆ’1)sΓ—2eβˆ’1023Γ—(1+m252)(-1)^s \times 2^{e - 1023} \times (1 + \frac{m}{2^{52}})

Here,

  • =Sign bit (0 = positive, 1 = negative)
  • =Exponent (11 bits, range 0-2047)
  • =Mantissa (52 bits, fractional part)
import struct

value = 3.14159

# IEEE 754 representation
bytes_repr = struct.pack('d', value)
bits = ''.join(format(b, '08b') for b in bytes_repr)

print(f"Value: {value}")
print(f"IEEE 754 bits: {bits}")
print(f"  Sign:     {bits[0]} ({'negative' if bits[0] == '1' else 'positive'})")
print(f"  Exponent: {bits[1:12]} = {int(bits[1:12], 2) - 1023}")
print(f"  Mantissa: {bits[12:]}")
print(f"\nFloat precision limit:")
print(f"  0.1 + 0.2 = {0.1 + 0.2}")
print(f"  0.1 + 0.2 == 0.3? {0.1 + 0.2 == 0.3}")
print(f"  Decimal comparison: {round(0.1 + 0.2, 10) == 0.3}")

Output:

Architecture Diagram
Value: 3.14159
IEEE 754 bits: 0100000000001001001000011111101101010100010001000010110111111000
  Sign:     0 (positive)
  Exponent: 10000000000 = 1
  Mantissa: 1001001000011111101101010100010001000010110111111000

Float precision limit:
  0.1 + 0.2 = 0.30000000000000004
  0.1 + 0.2 == 0.3? False
  Decimal comparison: True

The floating-point error in 0.1 + 0.2 arises because 0.1 and 0.2 cannot be exactly represented in binary. The value 0.1 in binary is a repeating fraction (0.0001100110011...), similar to how 1/3 = 0.333... in decimal. This is why financial applications should use the decimal module or integer arithmetic with cents.

Critical insight: Floating-point arithmetic is inherently approximate. Always use math.isclose() for float comparisons:

Floating-Point Comparison

∣aβˆ’bβˆ£β‰€max⁑(rel_tolΓ—max⁑(∣a∣,∣b∣),abs_tol)|a - b| \leq \max(\text{rel\_tol} \times \max(|a|, |b|), \text{abs\_tol})

Here,

  • =Values to compare
  • =Relative tolerance (default 1e-9)
  • =Absolute tolerance (default 0.0)
import math

a = 0.1 + 0.2
b = 0.3

# WRONG
print(f"Direct comparison: {a == b}")

# RIGHT
print(f"Close comparison: {math.isclose(a, b)}")
print(f"With tolerance:    {math.isclose(a, b, rel_tol=1e-9)}")

1.4 Boolean (bool)

Booleans are subclasses of integers with two values: True (1) and False (0).

# Boolean arithmetic
print(f"True + True = {True + True}")
print(f"True * 10 = {True * 10}")
print(f"False * 100 = {False * 100}")
print(f"True == 1: {True == 1}")
print(f"True is 1: {True is 1}")
print(f"\nIdentity vs Equality:")
print(f"  bool(1) == True: {bool(1) == True}")
print(f"  bool(1) is True: {bool(1) is True}")
print(f"  bool(0) == False: {bool(0) == False}")
print(f"  bool(0) is False: {bool(0) is False}")

1.5 String (str)

Strings in Python 3 are immutable sequences of Unicode codepoints.

DfUnicode Codepoint

A unique number assigned to each character in the Unicode standard, ranging from U+0000 to U+10FFFF. Python strings store these codepoints as integers, with memory usage varying by the maximum codepoint in the string (1-4 bytes per character in CPython's flexible representation).

s = "Hello, Data Science!"

print(f"String: {s}")
print(f"Length: {len(s)}")
print(f"First char: {s[0]}")
print(f"Last char: {s[-1]}")
print(f"Substring: {s[7:21]}")
print(f"Reversed: {s[::-1]}")
print(f"Memory per char: {sys.getsizeof('a')} bytes")
print(f"Total string memory: {sys.getsizeof(s)} bytes (includes overhead)")
print(f"\nImmutability demonstration:")
try:
    s[0] = 'h'
except TypeError as e:
    print(f"  Cannot modify: {e}")
print(f"  Must create new: {'h' + s[1:]}")

2. Type Casting and Type Hints

2.1 Why Type Casting Matters

Data scientists frequently encounter type mismatches when loading data:

# Common data type issues in data science
raw_data = ["1", "2.5", "3", "invalid", "4.0"]

# These will fail or give unexpected results
print("String to int (with validation):")
clean_ints = []
for x in raw_data:
    try:
        clean_ints.append(int(x))
    except ValueError:
        print(f"  Skipping '{x}' - not convertible to int")

print(f"Result: {clean_ints}")

# Float conversion is more forgiving
print("\nString to float (with validation):")
clean_floats = []
for x in raw_data:
    try:
        clean_floats.append(float(x))
    except ValueError:
        print(f"  Skipping '{x}' - not convertible to float")

print(f"Result: {clean_floats}")

Output:

Architecture Diagram
String to int (with validation):
  Skipping '2.5' - not convertible to int
  Skipping 'invalid' - not convertible to int
  Skipping '4.0' - not convertible to int
Result: [1, 3]

String to float (with validation):
  Skipping 'invalid' - not convertible to float
Result: [1.0, 2.5, 3.0, 4.0]

2.2 Type Hints for Data Science

Type hints improve code readability and enable static analysis:

Type hints don't affect runtime performance in Pythonβ€”they are metadata for static analysis tools like mypy. However, they are critical for large data science codebases where understanding data flow between functions is essential for debugging and collaboration.

from typing import List, Dict, Optional, Union, Tuple
import numpy as np

# Function with proper type hints
def compute_statistics(
    data: List[float],
    weights: Optional[List[float]] = None
) -> Dict[str, float]:
    """
    Compute weighted mean and standard deviation.
    
    Args:
        data: List of numerical values
        weights: Optional weights for each value
        
    Returns:
        Dictionary with 'mean' and 'std' keys
    """
    if weights is None:
        weights = [1.0] * len(data)
    
    mean = sum(d * w for d, w in zip(data, weights)) / sum(weights)
    variance = sum(w * (d - mean) ** 2 for d, w in zip(data, weights)) / sum(weights)
    
    return {
        "mean": mean,
        "std": variance ** 0.5
    }

# Usage with type checking
data = [1.0, 2.0, 3.0, 4.0, 5.0]
result = compute_statistics(data)
print(f"Statistics: {result}")
print(f"Type: {type(result)}")

3. Lists vs Tuples vs Arrays

3.1 The Fundamental Difference

Architecture Diagram
Memory Layout Comparison:

LIST (mutable):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ pointer  β”‚ pointer  β”‚ pointer  β”‚  ← List object (PyObject*)
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
     β”‚          β”‚          β”‚
     β–Ό          β–Ό          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ int(1) β”‚ β”‚ int(2) β”‚ β”‚ int(3) β”‚  ← Individual int objects
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

TUPLE (immutable):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ int(1)   β”‚ int(2)   β”‚ int(3)   β”‚  ← Values stored inline
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

NUMPY ARRAY (contiguous):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ int(1) β”‚ int(2) β”‚ int(3)    β”‚  ← Contiguous C array
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DfCache Locality

The tendency of a processor to access the same set of memory locations repetitively over a short period. Arrays have high spatial locality because elements are stored contiguously, allowing the CPU to prefetch data efficiently. Lists have poor locality due to pointer indirection.

3.2 Performance Comparison

import time
import numpy as np

def benchmark_list(n):
    start = time.perf_counter()
    lst = list(range(n))
    for i in range(n):
        lst[i] = lst[i] * 2
    return time.perf_counter() - start

def benchmark_tuple(n):
    start = time.perf_counter()
    tup = tuple(range(n))
    tup = tuple(x * 2 for x in tup)  # Must create new tuple
    return time.perf_counter() - start

def benchmark_numpy(n):
    start = time.perf_counter()
    arr = np.arange(n)
    arr = arr * 2  # Vectorized operation
    return time.perf_counter() - start

n = 1_000_000
print(f"Benchmarking with n={n:,}:")
print(f"  List:   {benchmark_list(n):.4f}s")
print(f"  Tuple:  {benchmark_tuple(n):.4f}s")
print(f"  NumPy:  {benchmark_numpy(n):.4f}s")

Output (typical):

Architecture Diagram
Benchmarking with n=1,000,000:
  List:   0.0892s
  Tuple:  0.1245s
  NumPy:  0.0043s

Vectorized Operation Speedup

Tvectorizedβ‰ˆTloopcβ‹…nT_{\text{vectorized}} \approx \frac{T_{\text{loop}}}{c \cdot n}

Here,

  • =Time for vectorized NumPy operation
  • =Time for equivalent Python loop
  • =CPU cache and SIMD optimization factor (typically 10-100)
  • =Number of elements

3.3 Time Complexity Table

OperationListTupleNumPy ArrayDictSet
Index access x[i]O(1)O(1)O(1)O(1)N/A
AppendO(1)N/AO(n)*O(1)O(1)
Insert x.insert(i,v)O(n)N/AO(n)N/AN/A
Delete del x[i]O(n)N/AO(n)O(1)O(1)
Search v in xO(n)O(n)O(n)O(1)O(1)
Pop from endO(1)O(1)O(1)O(1)N/A
Pop from frontO(n)O(n)O(n)N/AN/A
ConcatenationO(k)O(k)N/AN/AN/A
Slice x[a:b]O(k)O(k)O(k)N/AN/A

*NumPy arrays have amortized O(1) append when pre-allocated

The O(1) amortized append for lists and sets relies on over-allocation: Python allocates more memory than immediately needed, so most appends don't trigger reallocation. The amortized cost is O(1) because the occasional O(n) reallocation is spread across O(n) appends.

3.4 When to Use Each

import numpy as np

# USE LIST when:
# - You need mixed types
# - You need to append frequently
# - You're building data incrementally
data_list = [1, "hello", 3.14, [1, 2, 3]]

# USE TUPLE when:
# - Data should not change (coordinates, RGB values)
# - Dictionary keys (hashable)
# - Function returns with multiple values
point = (3.0, 4.0)
rgb = (255, 128, 0)

# USE NUMPY ARRAY when:
# - Numerical computation on large datasets
# - Vectorized operations needed
# - Memory efficiency matters
data_array = np.array([1.0, 2.0, 3.0, 4.0, 5.0])

# Demonstrate hashability
try:
    hash(point)
    print(f"Tuples are hashable: hash({point}) = {hash(point)}")
except TypeError as e:
    print(f"Lists are NOT hashable: {e}")

# Demonstrate vectorization
print(f"\nVectorized operations:")
print(f"  data_array * 2 = {data_array * 2}")
print(f"  data_array ** 2 = {data_array ** 2}")
print(f"  np.mean(data_array) = {np.mean(data_array)}")

4. Dictionaries: Hash Tables

4.1 How Hash Tables Work

DfHash Function

A deterministic function h: K β†’ {0, 1, ..., m-1} that maps keys to array indices. A good hash function distributes keys uniformly across the index space, minimizing collisions. Python's hash() function uses SipHash on CPython 3.7+ for security against hash-flooding attacks.

Hash Table Load Factor

Ξ±=nm\alpha = \frac{n}{m}

Here,

  • =Load factor (ratio of entries to buckets)
  • =Number of entries in the table
  • =Number of buckets (table size)

Python maintains a load factor Ξ± ≀ 2/3 for dictionaries. When exceeded, the table is resized (typically doubled) and all entries are rehashed. This ensures O(1) average-case lookup with a worst-case of O(n) during resizing.

Architecture Diagram
Dictionary Structure (simplified):

Key: "name" β†’ hash("name") = 946783568
     ↓
Index: 946783568 % 8 = 0

β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Index β”‚ Bucket                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   0   β”‚ ("name", "Alice") ──────────┼──→ ("age", 30)
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   1   β”‚ None                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   2   β”‚ ("city", "NYC")             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   3   β”‚ None                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   4   β”‚ ("email", "a@b.com") ───────┼──→ ("phone", "555-0123")
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   5   β”‚ None                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   6   β”‚ None                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   7   β”‚ None                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hash collisions are resolved by chaining (linked lists).

4.2 Dictionary Operations

import sys

# Create dictionary
person = {
    "name": "Alice Johnson",
    "age": 30,
    "city": "New York",
    "skills": ["Python", "ML", "Statistics"]
}

print(f"Dictionary: {person}")
print(f"Memory: {sys.getsizeof(person)} bytes")
print()

# O(1) lookup
print("O(1) Lookups:")
print(f"  person['name'] = {person['name']}")
print(f"  person.get('age', 'N/A') = {person.get('age', 'N/A')}")
print(f"  person.get('salary', 'N/A') = {person.get('salary', 'N/A')}")
print()

# Dictionary comprehension
squares = {x: x**2 for x in range(10)}
print(f"Comprehension: {squares}")
print()

# Nested dictionary access
nested = {
    "user": {
        "profile": {
            "name": "Bob",
            "scores": [95, 87, 92]
        }
    }
}

# Safe nested access
def get_nested(d, keys, default=None):
    """Safely access nested dictionary keys."""
    current = d
    for key in keys:
        if isinstance(current, dict) and key in current:
            current = current[key]
        else:
            return default
    return current

print(f"Nested access: {get_nested(nested, ['user', 'profile', 'name'])}")
print(f"Missing key:   {get_nested(nested, ['user', 'address', 'city'])}")

4.3 Performance Characteristics

import time

def lookup_benchmark(n):
    """Benchmark dictionary vs list lookup."""
    # Create data
    data_dict = {f"key_{i}": i for i in range(n)}
    data_list = [(f"key_{i}", i) for i in range(n)]
    
    search_key = f"key_{n-1}"
    
    # Dict lookup
    start = time.perf_counter()
    for _ in range(1000):
        _ = data_dict[search_key]
    dict_time = time.perf_counter() - start
    
    # List search
    start = time.perf_counter()
    for _ in range(1000):
        _ = [x for x in data_list if x[0] == search_key]
    list_time = time.perf_counter() - start
    
    return dict_time, list_time

n = 100_000
dict_t, list_t = lookup_benchmark(n)
print(f"Lookup benchmark (n={n:,}, 1000 iterations):")
print(f"  Dict: {dict_t:.4f}s")
print(f"  List: {list_t:.4f}s")
print(f"  Speedup: {list_t/dict_t:.1f}x faster with dict")
O(1)Β average,Β O(n)Β worstΒ caseΒ (allΒ collisions)O(1) \text{ average, } O(n) \text{ worst case (all collisions)}

The dictionary's O(1) average lookup time is achieved through hash-based indexing with collision resolution, making it the foundation of Python's data model.


5. Sets: Membership Testing and Deduplication

5.1 Set Operations

# Create sets
skills_ml = {"Python", "TensorFlow", "Scikit-learn", "Pandas"}
skills_data = {"Python", "SQL", "Pandas", "Tableau", "Excel"}

# Set operations
print("Set Operations:")
print(f"  Union:        {skills_ml | skills_data}")
print(f"  Intersection: {skills_ml & skills_data}")
print(f"  Difference:   {skills_ml - skills_data}")
print(f"  Symmetric:    {skills_ml ^ skills_data}")
print()

# Deduplication
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
unique = list(set(data))
print(f"Original: {data}")
print(f"Unique:   {sorted(unique)}")
print(f"Count:    {len(data)} β†’ {len(unique)}")
print()

# Membership testing benchmark
import time

large_list = list(range(1_000_000))
large_set = set(range(1_000_000))

# Test membership
test_value = 999_999

start = time.perf_counter()
for _ in range(100):
    _ = test_value in large_list
list_time = time.perf_counter() - start

start = time.perf_counter()
for _ in range(100):
    _ = test_value in large_set
set_time = time.perf_counter() - start

print(f"Membership test (1M elements, 100 iterations):")
print(f"  List: {list_time:.4f}s")
print(f"  Set:  {set_time:.4f}s")
print(f"  Speedup: {list_time/set_time:.1f}x faster with set")

Set Membership Complexity

O(1)Β forΒ sets,Β O(n)Β forΒ listsO(1) \text{ for sets, } O(n) \text{ for lists}

Here,

  • =Hash-based, O(1) average lookup
  • =Linear scan, O(n) worst case

6. Mutable vs Immutable

6.1 Why Immutability Matters

Architecture Diagram
Mutable Object (list):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ id: 14023456 │──→ [1, 2, 3, 4, 5]
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓ modified in place
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ id: 14023456 │──→ [1, 2, 3, 4, 5, 6]  ← Same object
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Immutable Object (tuple):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ id: 14023456 │──→ (1, 2, 3, 4, 5)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓ "modification" creates new object
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ id: 14023520 │──→ (1, 2, 3, 4, 5, 6)  ← Different object
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Immutability enables Python to optimize memory through object interning (caching). Small integers (-5 to 256) and some strings are cached, meaning multiple variables can reference the same object in memory. This is why is and == can behave differently for immutable types.

import sys

# Mutable: in-place modification
list_a = [1, 2, 3]
list_b = list_a  # Both point to same object

print("Mutable (list):")
print(f"  list_a id: {id(list_a)}")
print(f"  list_b id: {id(list_b)}")
print(f"  Same object: {list_a is list_b}")

list_b.append(4)
print(f"  After append(4) to list_b:")
print(f"  list_a: {list_a}")  # Also changed!
print(f"  list_b: {list_b}")
print()

# Immutable: new object created
tuple_a = (1, 2, 3)
tuple_b = tuple_a  # Both point to same object

print("Immutable (tuple):")
print(f"  tuple_a id: {id(tuple_a)}")
print(f"  tuple_b id: {id(tuple_b)}")
print(f"  Same object: {tuple_a is tuple_b}")

tuple_b = tuple_b + (4,)  # Creates new tuple
print(f"  After concatenation:")
print(f"  tuple_a: {tuple_a}")  # Unchanged!
print(f"  tuple_b: {tuple_b}")
print(f"  Same object now: {tuple_a is tuple_b}")
print()

# Immutability enables safe dictionary keys
location_data = {
    (40.7128, -74.0060): "New York",
    (34.0522, -118.2437): "Los Angeles",
    (41.8781, -87.6298): "Chicago"
}

print("Dictionary with tuple keys:")
for coords, city in location_data.items():
    print(f"  {coords}: {city}")

6.2 String Interning (Caching)

# Python caches small integers and some strings
a = "hello"
b = "hello"
c = "".join(["h", "e", "l", "l", "o"])

print(f"a is b: {a is b}")  # True (interned)
print(f"a is c: {a is c}")  # False (created dynamically)
print(f"a == c: {a == c}")  # True (value equality)
print()

# Integer caching (-5 to 256)
x = 256
y = 256
print(f"256 is 256: {x is y}")  # True (cached)

x = 257
y = 257
print(f"257 is 257: {x is y}")  # False (not cached)

7. Memory Layout Visualization

7.1 Stack vs Heap

Architecture Diagram
Memory Layout:

STACK (fast, limited size):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Frame: main()                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚ x = 5   β”‚ β”‚ y = 3.14β”‚ β”‚ s = ────┼──┐    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚    β”‚
β”‚  β”‚ flag=Trueβ”‚ β”‚ temp = ─┼────┐        β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚        β”‚    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”˜
                               β”‚        β”‚
HEAP (slow, large):            β”‚        β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”€β”€β”€β”€β”‚β”€β”€β”€β”€β”
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚        β”‚    β”‚
β”‚  β”‚ "Hello" β”‚ β†β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚    β”‚
β”‚  β”‚ [1,2,3] β”‚ β†β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚
β”‚  β”‚ {"a": 1}β”‚                               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

7.2 Reference Counting

Reference Counting

refcount(obj)=βˆ‘allΒ references1\text{refcount}(obj) = \sum_{\text{all references}} 1

Here,

  • =Number of references to an object
  • =The Python object being referenced
import sys

# Reference counting
a = [1, 2, 3]
print(f"Initial refcount: {sys.getrefcount(a) - 1}")  # -1 for getrefcount arg

b = a  # Increases refcount
print(f"After b = a: {sys.getrefcount(a) - 1}")

c = a  # Increases refcount
print(f"After c = a: {sys.getrefcount(a) - 1}")

del b  # Decreases refcount
print(f"After del b: {sys.getrefcount(a) - 1}")

del c  # Decreases refcount
print(f"After del c: {sys.getrefcount(a) - 1}")

# Circular reference detection
class Node:
    def __init__(self):
        self.reference = None

x = Node()
y = Node()
x.reference = y
y.reference = x  # Circular reference

print(f"\nCircular reference: x→y→x")
print(f"  x refcount (includes getrefcount): {sys.getrefcount(x) - 1}")
print(f"  y refcount (includes getrefcount): {sys.getrefcount(y) - 1}")
print(f"  Garbage collector handles circular refs")

8. Complete Code Example: Data Science Utilities

import numpy as np
from typing import List, Dict, Any, Optional
from collections import defaultdict

class DataScienceToolkit:
    """Practical data structure utilities for data science."""
    
    @staticmethod
    def safe_convert(value: Any, target_type: type, default: Any = None) -> Any:
        """Safely convert value to target type."""
        try:
            return target_type(value)
        except (ValueError, TypeError):
            return default
    
    @staticmethod
    def flatten_dict(d: Dict, parent_key: str = '', sep: str = '.') -> Dict:
        """Flatten nested dictionary."""
        items = []
        for k, v in d.items():
            new_key = f"{parent_key}{sep}{k}" if parent_key else k
            if isinstance(v, dict):
                items.extend(DataScienceToolkit.flatten_dict(v, new_key, sep).items())
            else:
                items.append((new_key, v))
        return dict(items)
    
    @staticmethod
    def chunk_list(lst: List, chunk_size: int) -> List[List]:
        """Split list into chunks."""
        return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]
    
    @staticmethod
    def weighted_average(values: List[float], weights: List[float]) -> float:
        """Compute weighted average."""
        return sum(v * w for v, w in zip(values, weights)) / sum(weights)
    
    @staticmethod
    def detect_outliers_iqr(data: List[float], factor: float = 1.5) -> List[int]:
        """Detect outliers using IQR method."""
        sorted_data = sorted(data)
        n = len(sorted_data)
        q1 = sorted_data[n // 4]
        q3 = sorted_data[3 * n // 4]
        iqr = q3 - q1
        lower = q1 - factor * iqr
        upper = q3 + factor * iqr
        return [i for i, x in enumerate(data) if x < lower or x > upper]

# Usage example
toolkit = DataScienceToolkit()

# Type conversion
print("Safe type conversion:")
print(f"  '42' β†’ int: {toolkit.safe_convert('42', int)}")
print(f"  'abc' β†’ int: {toolkit.safe_convert('abc', int, default=0)}")
print(f"  '3.14' β†’ float: {toolkit.safe_convert('3.14', float)}")
print()

# Dictionary flattening
nested = {
    "user": {
        "profile": {
            "name": "Alice",
            "age": 30
        },
        "settings": {
            "theme": "dark",
            "notifications": True
        }
    }
}
flattened = toolkit.flatten_dict(nested)
print("Flattened dictionary:")
for key, value in flattened.items():
    print(f"  {key}: {value}")
print()

# List chunking
data = list(range(1, 11))
chunks = toolkit.chunk_list(data, 3)
print(f"Chunked list ({data}):")
for i, chunk in enumerate(chunks):
    print(f"  Chunk {i}: {chunk}")
print()

# Outlier detection
np.random.seed(42)
normal_data = np.random.normal(0, 1, 100).tolist()
normal_data.extend([10, -10, 15])  # Add outliers
outliers = toolkit.detect_outliers_iqr(normal_data)
print(f"Outlier detection: found {len(outliers)} outliers")
print(f"  Outlier indices: {outliers}")
print(f"  Outlier values: {[normal_data[i] for i in outliers]}")

Key Takeaways

πŸ“‹Summary: Python Data Types & Structures

  1. Type Precision: Python integers have arbitrary precision; floats are approximate (IEEE 754). Always use math.isclose() for float comparisons.
  2. Data Structure Choice Matters: Lists for mutable collections, tuples for fixed data, NumPy arrays for numerical work, dicts for O(1) lookup, sets for deduplication.
  3. Memory Efficiency: Tuples are more memory-efficient than lists; NumPy arrays store data contiguously (cache-friendly).
  4. Immutability Benefits: Immutable objects are hashable (can be dict keys/set members), thread-safe, and enable Python's internal optimizations.
  5. Type Hints: Improve code clarity and enable static analysis tools like mypy.

Practice Exercises

Exercise 1: Performance Analysis

Write a function that benchmarks list vs dictionary lookup for finding an element in a collection of 1 million items. Include timing and discuss the results.

Exercise 2: Nested Data Processing

Given a nested dictionary representing a student database, write a function to:

  1. Flatten the dictionary
  2. Extract all grades
  3. Calculate statistics (mean, median, std)
  4. Identify students with grades below threshold

Exercise 3: Memory Optimization

Create a data structure that stores 10,000 records with name, age, and email. Compare memory usage between:

  • List of dictionaries
  • List of tuples
  • NumPy structured array
  • Pandas DataFrame

Discuss which approach is most memory-efficient and why.

Exercise 4: Type Conversion Pipeline

Build a robust type conversion function that handles:

  • Mixed types in input
  • Missing values (None, NaN)
  • Invalid conversions
  • Logging of conversion failures

Test with a dataset containing 1000 mixed-type values.

Exercise 5: Set Operations for Data Analysis

Using set operations, write a function to find:

  • Common features between two datasets
  • Features unique to each dataset
  • The symmetric difference Discuss why set operations are preferred over list operations for this use case.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement