Python Data Types & Structures Deep Dive
Why This Matters for Data Science
Before building machine learning models or analyzing datasets, you must fundamentally understand how Python organizes and stores data. Every data structure choice has computational consequencesβa naive list operation might take 1000x longer than the equivalent NumPy array operation. This tutorial provides a rigorous exploration of Python's type system and data structures with focus on their performance characteristics and practical applications.
1. Primitive Data Types
1.1 The Type Hierarchy
Python's type system forms a hierarchy where every value is an object:
object
βββ numbers.Number
β βββ int (arbitrary precision)
β βββ float (64-bit IEEE 754)
β βββ complex
βββ bool (subclass of int)
βββ str (immutable sequence of Unicode codepoints)
βββ NoneType
βββ collections.abc
βββ Sequence β list, tuple, range
βββ Set β set, frozenset
βββ Mapping β dict
DfArbitrary Precision Integers
A number representation that can express any integer with exact precision, limited only by available memory. Unlike fixed-width integers (e.g., 32-bit or 64-bit), arbitrary precision integers grow dynamically to accommodate larger values, eliminating overflow errors in cryptographic and financial computations.
Python's int type uses a base-2Β³β° representation internally (on 64-bit systems), where each "digit" stores 30 bits. This means memory usage scales as O(log n) bits for an integer n, making it efficient for very large numbers compared to naive string-based representations.
1.2 Integer (int)
Python integers have arbitrary precisionβthey grow as large as memory allows.
# Memory representation
import sys
small_int = 42
big_int = 2 ** 1000
print(f"Small int: {small_int}")
print(f" Bytes: {sys.getsizeof(small_int)}")
print(f" Type: {type(small_int)}")
print()
print(f"Big int: {big_int}")
print(f" Bytes: {sys.getsizeof(big_int)}")
print(f" Digits: {len(str(big_int))}")
Output:
Small int: 42
Bytes: 28
Type: <class 'int'>
Big int: 10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
Digits: 302
Integer Memory Complexity
Here,
- =The integer value being stored
- =Total bits required for storage
Why arbitrary precision matters: Financial calculations, cryptography, and scientific computing require exact integer arithmetic without overflow.
1.3 Float (float)
Python floats are 64-bit IEEE 754 double-precision numbers.
DfIEEE 754 Double Precision
A binary floating-point format that uses 64 bits: 1 sign bit, 11 exponent bits (biased by 1023), and 52 mantissa bits (with an implicit leading 1). This gives approximately 15-17 significant decimal digits of precision.
IEEE 754 Value Formula
Here,
- =Sign bit (0 = positive, 1 = negative)
- =Exponent (11 bits, range 0-2047)
- =Mantissa (52 bits, fractional part)
import struct
value = 3.14159
# IEEE 754 representation
bytes_repr = struct.pack('d', value)
bits = ''.join(format(b, '08b') for b in bytes_repr)
print(f"Value: {value}")
print(f"IEEE 754 bits: {bits}")
print(f" Sign: {bits[0]} ({'negative' if bits[0] == '1' else 'positive'})")
print(f" Exponent: {bits[1:12]} = {int(bits[1:12], 2) - 1023}")
print(f" Mantissa: {bits[12:]}")
print(f"\nFloat precision limit:")
print(f" 0.1 + 0.2 = {0.1 + 0.2}")
print(f" 0.1 + 0.2 == 0.3? {0.1 + 0.2 == 0.3}")
print(f" Decimal comparison: {round(0.1 + 0.2, 10) == 0.3}")
Output:
Value: 3.14159
IEEE 754 bits: 0100000000001001001000011111101101010100010001000010110111111000
Sign: 0 (positive)
Exponent: 10000000000 = 1
Mantissa: 1001001000011111101101010100010001000010110111111000
Float precision limit:
0.1 + 0.2 = 0.30000000000000004
0.1 + 0.2 == 0.3? False
Decimal comparison: True
The floating-point error in 0.1 + 0.2 arises because 0.1 and 0.2 cannot be exactly represented in binary. The value 0.1 in binary is a repeating fraction (0.0001100110011...), similar to how 1/3 = 0.333... in decimal. This is why financial applications should use the decimal module or integer arithmetic with cents.
Critical insight: Floating-point arithmetic is inherently approximate. Always use math.isclose() for float comparisons:
Floating-Point Comparison
Here,
- =Values to compare
- =Relative tolerance (default 1e-9)
- =Absolute tolerance (default 0.0)
import math
a = 0.1 + 0.2
b = 0.3
# WRONG
print(f"Direct comparison: {a == b}")
# RIGHT
print(f"Close comparison: {math.isclose(a, b)}")
print(f"With tolerance: {math.isclose(a, b, rel_tol=1e-9)}")
1.4 Boolean (bool)
Booleans are subclasses of integers with two values: True (1) and False (0).
# Boolean arithmetic
print(f"True + True = {True + True}")
print(f"True * 10 = {True * 10}")
print(f"False * 100 = {False * 100}")
print(f"True == 1: {True == 1}")
print(f"True is 1: {True is 1}")
print(f"\nIdentity vs Equality:")
print(f" bool(1) == True: {bool(1) == True}")
print(f" bool(1) is True: {bool(1) is True}")
print(f" bool(0) == False: {bool(0) == False}")
print(f" bool(0) is False: {bool(0) is False}")
1.5 String (str)
Strings in Python 3 are immutable sequences of Unicode codepoints.
DfUnicode Codepoint
A unique number assigned to each character in the Unicode standard, ranging from U+0000 to U+10FFFF. Python strings store these codepoints as integers, with memory usage varying by the maximum codepoint in the string (1-4 bytes per character in CPython's flexible representation).
s = "Hello, Data Science!"
print(f"String: {s}")
print(f"Length: {len(s)}")
print(f"First char: {s[0]}")
print(f"Last char: {s[-1]}")
print(f"Substring: {s[7:21]}")
print(f"Reversed: {s[::-1]}")
print(f"Memory per char: {sys.getsizeof('a')} bytes")
print(f"Total string memory: {sys.getsizeof(s)} bytes (includes overhead)")
print(f"\nImmutability demonstration:")
try:
s[0] = 'h'
except TypeError as e:
print(f" Cannot modify: {e}")
print(f" Must create new: {'h' + s[1:]}")
2. Type Casting and Type Hints
2.1 Why Type Casting Matters
Data scientists frequently encounter type mismatches when loading data:
# Common data type issues in data science
raw_data = ["1", "2.5", "3", "invalid", "4.0"]
# These will fail or give unexpected results
print("String to int (with validation):")
clean_ints = []
for x in raw_data:
try:
clean_ints.append(int(x))
except ValueError:
print(f" Skipping '{x}' - not convertible to int")
print(f"Result: {clean_ints}")
# Float conversion is more forgiving
print("\nString to float (with validation):")
clean_floats = []
for x in raw_data:
try:
clean_floats.append(float(x))
except ValueError:
print(f" Skipping '{x}' - not convertible to float")
print(f"Result: {clean_floats}")
Output:
String to int (with validation):
Skipping '2.5' - not convertible to int
Skipping 'invalid' - not convertible to int
Skipping '4.0' - not convertible to int
Result: [1, 3]
String to float (with validation):
Skipping 'invalid' - not convertible to float
Result: [1.0, 2.5, 3.0, 4.0]
2.2 Type Hints for Data Science
Type hints improve code readability and enable static analysis:
Type hints don't affect runtime performance in Pythonβthey are metadata for static analysis tools like mypy. However, they are critical for large data science codebases where understanding data flow between functions is essential for debugging and collaboration.
from typing import List, Dict, Optional, Union, Tuple
import numpy as np
# Function with proper type hints
def compute_statistics(
data: List[float],
weights: Optional[List[float]] = None
) -> Dict[str, float]:
"""
Compute weighted mean and standard deviation.
Args:
data: List of numerical values
weights: Optional weights for each value
Returns:
Dictionary with 'mean' and 'std' keys
"""
if weights is None:
weights = [1.0] * len(data)
mean = sum(d * w for d, w in zip(data, weights)) / sum(weights)
variance = sum(w * (d - mean) ** 2 for d, w in zip(data, weights)) / sum(weights)
return {
"mean": mean,
"std": variance ** 0.5
}
# Usage with type checking
data = [1.0, 2.0, 3.0, 4.0, 5.0]
result = compute_statistics(data)
print(f"Statistics: {result}")
print(f"Type: {type(result)}")
3. Lists vs Tuples vs Arrays
3.1 The Fundamental Difference
Memory Layout Comparison:
LIST (mutable):
ββββββββββββ¬βββββββββββ¬βββββββββββ
β pointer β pointer β pointer β β List object (PyObject*)
ββββββ¬ββββββ΄βββββ¬ββββββ΄βββββ¬ββββββ
β β β
βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ
β int(1) β β int(2) β β int(3) β β Individual int objects
ββββββββββ ββββββββββ ββββββββββ
TUPLE (immutable):
ββββββββββββ¬βββββββββββ¬βββββββββββ
β int(1) β int(2) β int(3) β β Values stored inline
ββββββββββββ΄βββββββββββ΄βββββββββββ
NUMPY ARRAY (contiguous):
ββββββββββββββββββββββββββββββββ
β int(1) β int(2) β int(3) β β Contiguous C array
ββββββββββββββββββββββββββββββββ
DfCache Locality
The tendency of a processor to access the same set of memory locations repetitively over a short period. Arrays have high spatial locality because elements are stored contiguously, allowing the CPU to prefetch data efficiently. Lists have poor locality due to pointer indirection.
3.2 Performance Comparison
import time
import numpy as np
def benchmark_list(n):
start = time.perf_counter()
lst = list(range(n))
for i in range(n):
lst[i] = lst[i] * 2
return time.perf_counter() - start
def benchmark_tuple(n):
start = time.perf_counter()
tup = tuple(range(n))
tup = tuple(x * 2 for x in tup) # Must create new tuple
return time.perf_counter() - start
def benchmark_numpy(n):
start = time.perf_counter()
arr = np.arange(n)
arr = arr * 2 # Vectorized operation
return time.perf_counter() - start
n = 1_000_000
print(f"Benchmarking with n={n:,}:")
print(f" List: {benchmark_list(n):.4f}s")
print(f" Tuple: {benchmark_tuple(n):.4f}s")
print(f" NumPy: {benchmark_numpy(n):.4f}s")
Output (typical):
Benchmarking with n=1,000,000:
List: 0.0892s
Tuple: 0.1245s
NumPy: 0.0043s
Vectorized Operation Speedup
Here,
- =Time for vectorized NumPy operation
- =Time for equivalent Python loop
- =CPU cache and SIMD optimization factor (typically 10-100)
- =Number of elements
3.3 Time Complexity Table
| Operation | List | Tuple | NumPy Array | Dict | Set |
|---|---|---|---|---|---|
Index access x[i] | O(1) | O(1) | O(1) | O(1) | N/A |
| Append | O(1) | N/A | O(n)* | O(1) | O(1) |
Insert x.insert(i,v) | O(n) | N/A | O(n) | N/A | N/A |
Delete del x[i] | O(n) | N/A | O(n) | O(1) | O(1) |
Search v in x | O(n) | O(n) | O(n) | O(1) | O(1) |
| Pop from end | O(1) | O(1) | O(1) | O(1) | N/A |
| Pop from front | O(n) | O(n) | O(n) | N/A | N/A |
| Concatenation | O(k) | O(k) | N/A | N/A | N/A |
Slice x[a:b] | O(k) | O(k) | O(k) | N/A | N/A |
*NumPy arrays have amortized O(1) append when pre-allocated
The O(1) amortized append for lists and sets relies on over-allocation: Python allocates more memory than immediately needed, so most appends don't trigger reallocation. The amortized cost is O(1) because the occasional O(n) reallocation is spread across O(n) appends.
3.4 When to Use Each
import numpy as np
# USE LIST when:
# - You need mixed types
# - You need to append frequently
# - You're building data incrementally
data_list = [1, "hello", 3.14, [1, 2, 3]]
# USE TUPLE when:
# - Data should not change (coordinates, RGB values)
# - Dictionary keys (hashable)
# - Function returns with multiple values
point = (3.0, 4.0)
rgb = (255, 128, 0)
# USE NUMPY ARRAY when:
# - Numerical computation on large datasets
# - Vectorized operations needed
# - Memory efficiency matters
data_array = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
# Demonstrate hashability
try:
hash(point)
print(f"Tuples are hashable: hash({point}) = {hash(point)}")
except TypeError as e:
print(f"Lists are NOT hashable: {e}")
# Demonstrate vectorization
print(f"\nVectorized operations:")
print(f" data_array * 2 = {data_array * 2}")
print(f" data_array ** 2 = {data_array ** 2}")
print(f" np.mean(data_array) = {np.mean(data_array)}")
4. Dictionaries: Hash Tables
4.1 How Hash Tables Work
DfHash Function
A deterministic function h: K β {0, 1, ..., m-1} that maps keys to array indices. A good hash function distributes keys uniformly across the index space, minimizing collisions. Python's hash() function uses SipHash on CPython 3.7+ for security against hash-flooding attacks.
Hash Table Load Factor
Here,
- =Load factor (ratio of entries to buckets)
- =Number of entries in the table
- =Number of buckets (table size)
Python maintains a load factor Ξ± β€ 2/3 for dictionaries. When exceeded, the table is resized (typically doubled) and all entries are rehashed. This ensures O(1) average-case lookup with a worst-case of O(n) during resizing.
Dictionary Structure (simplified):
Key: "name" β hash("name") = 946783568
β
Index: 946783568 % 8 = 0
βββββββββ¬ββββββββββββββββββββββββββββββ
β Index β Bucket β
βββββββββΌββββββββββββββββββββββββββββββ€
β 0 β ("name", "Alice") βββββββββββΌβββ ("age", 30)
βββββββββΌββββββββββββββββββββββββββββββ€
β 1 β None β
βββββββββΌββββββββββββββββββββββββββββββ€
β 2 β ("city", "NYC") β
βββββββββΌββββββββββββββββββββββββββββββ€
β 3 β None β
βββββββββΌββββββββββββββββββββββββββββββ€
β 4 β ("email", "a@b.com") ββββββββΌβββ ("phone", "555-0123")
βββββββββΌββββββββββββββββββββββββββββββ€
β 5 β None β
βββββββββΌββββββββββββββββββββββββββββββ€
β 6 β None β
βββββββββΌββββββββββββββββββββββββββββββ€
β 7 β None β
βββββββββ΄ββββββββββββββββββββββββββββββ
Hash collisions are resolved by chaining (linked lists).
4.2 Dictionary Operations
import sys
# Create dictionary
person = {
"name": "Alice Johnson",
"age": 30,
"city": "New York",
"skills": ["Python", "ML", "Statistics"]
}
print(f"Dictionary: {person}")
print(f"Memory: {sys.getsizeof(person)} bytes")
print()
# O(1) lookup
print("O(1) Lookups:")
print(f" person['name'] = {person['name']}")
print(f" person.get('age', 'N/A') = {person.get('age', 'N/A')}")
print(f" person.get('salary', 'N/A') = {person.get('salary', 'N/A')}")
print()
# Dictionary comprehension
squares = {x: x**2 for x in range(10)}
print(f"Comprehension: {squares}")
print()
# Nested dictionary access
nested = {
"user": {
"profile": {
"name": "Bob",
"scores": [95, 87, 92]
}
}
}
# Safe nested access
def get_nested(d, keys, default=None):
"""Safely access nested dictionary keys."""
current = d
for key in keys:
if isinstance(current, dict) and key in current:
current = current[key]
else:
return default
return current
print(f"Nested access: {get_nested(nested, ['user', 'profile', 'name'])}")
print(f"Missing key: {get_nested(nested, ['user', 'address', 'city'])}")
4.3 Performance Characteristics
import time
def lookup_benchmark(n):
"""Benchmark dictionary vs list lookup."""
# Create data
data_dict = {f"key_{i}": i for i in range(n)}
data_list = [(f"key_{i}", i) for i in range(n)]
search_key = f"key_{n-1}"
# Dict lookup
start = time.perf_counter()
for _ in range(1000):
_ = data_dict[search_key]
dict_time = time.perf_counter() - start
# List search
start = time.perf_counter()
for _ in range(1000):
_ = [x for x in data_list if x[0] == search_key]
list_time = time.perf_counter() - start
return dict_time, list_time
n = 100_000
dict_t, list_t = lookup_benchmark(n)
print(f"Lookup benchmark (n={n:,}, 1000 iterations):")
print(f" Dict: {dict_t:.4f}s")
print(f" List: {list_t:.4f}s")
print(f" Speedup: {list_t/dict_t:.1f}x faster with dict")
The dictionary's O(1) average lookup time is achieved through hash-based indexing with collision resolution, making it the foundation of Python's data model.
5. Sets: Membership Testing and Deduplication
5.1 Set Operations
# Create sets
skills_ml = {"Python", "TensorFlow", "Scikit-learn", "Pandas"}
skills_data = {"Python", "SQL", "Pandas", "Tableau", "Excel"}
# Set operations
print("Set Operations:")
print(f" Union: {skills_ml | skills_data}")
print(f" Intersection: {skills_ml & skills_data}")
print(f" Difference: {skills_ml - skills_data}")
print(f" Symmetric: {skills_ml ^ skills_data}")
print()
# Deduplication
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
unique = list(set(data))
print(f"Original: {data}")
print(f"Unique: {sorted(unique)}")
print(f"Count: {len(data)} β {len(unique)}")
print()
# Membership testing benchmark
import time
large_list = list(range(1_000_000))
large_set = set(range(1_000_000))
# Test membership
test_value = 999_999
start = time.perf_counter()
for _ in range(100):
_ = test_value in large_list
list_time = time.perf_counter() - start
start = time.perf_counter()
for _ in range(100):
_ = test_value in large_set
set_time = time.perf_counter() - start
print(f"Membership test (1M elements, 100 iterations):")
print(f" List: {list_time:.4f}s")
print(f" Set: {set_time:.4f}s")
print(f" Speedup: {list_time/set_time:.1f}x faster with set")
Set Membership Complexity
Here,
- =Hash-based, O(1) average lookup
- =Linear scan, O(n) worst case
6. Mutable vs Immutable
6.1 Why Immutability Matters
Mutable Object (list):
ββββββββββββββββ
β id: 14023456 ββββ [1, 2, 3, 4, 5]
ββββββββββββββββ
β modified in place
ββββββββββββββββ
β id: 14023456 ββββ [1, 2, 3, 4, 5, 6] β Same object
ββββββββββββββββ
Immutable Object (tuple):
ββββββββββββββββ
β id: 14023456 ββββ (1, 2, 3, 4, 5)
ββββββββββββββββ
β "modification" creates new object
ββββββββββββββββ
β id: 14023520 ββββ (1, 2, 3, 4, 5, 6) β Different object
ββββββββββββββββ
Immutability enables Python to optimize memory through object interning (caching). Small integers (-5 to 256) and some strings are cached, meaning multiple variables can reference the same object in memory. This is why is and == can behave differently for immutable types.
import sys
# Mutable: in-place modification
list_a = [1, 2, 3]
list_b = list_a # Both point to same object
print("Mutable (list):")
print(f" list_a id: {id(list_a)}")
print(f" list_b id: {id(list_b)}")
print(f" Same object: {list_a is list_b}")
list_b.append(4)
print(f" After append(4) to list_b:")
print(f" list_a: {list_a}") # Also changed!
print(f" list_b: {list_b}")
print()
# Immutable: new object created
tuple_a = (1, 2, 3)
tuple_b = tuple_a # Both point to same object
print("Immutable (tuple):")
print(f" tuple_a id: {id(tuple_a)}")
print(f" tuple_b id: {id(tuple_b)}")
print(f" Same object: {tuple_a is tuple_b}")
tuple_b = tuple_b + (4,) # Creates new tuple
print(f" After concatenation:")
print(f" tuple_a: {tuple_a}") # Unchanged!
print(f" tuple_b: {tuple_b}")
print(f" Same object now: {tuple_a is tuple_b}")
print()
# Immutability enables safe dictionary keys
location_data = {
(40.7128, -74.0060): "New York",
(34.0522, -118.2437): "Los Angeles",
(41.8781, -87.6298): "Chicago"
}
print("Dictionary with tuple keys:")
for coords, city in location_data.items():
print(f" {coords}: {city}")
6.2 String Interning (Caching)
# Python caches small integers and some strings
a = "hello"
b = "hello"
c = "".join(["h", "e", "l", "l", "o"])
print(f"a is b: {a is b}") # True (interned)
print(f"a is c: {a is c}") # False (created dynamically)
print(f"a == c: {a == c}") # True (value equality)
print()
# Integer caching (-5 to 256)
x = 256
y = 256
print(f"256 is 256: {x is y}") # True (cached)
x = 257
y = 257
print(f"257 is 257: {x is y}") # False (not cached)
7. Memory Layout Visualization
7.1 Stack vs Heap
Memory Layout:
STACK (fast, limited size):
βββββββββββββββββββββββββββββββββββββββββββββββ
β Frame: main() β
β βββββββββββ βββββββββββ βββββββββββ β
β β x = 5 β β y = 3.14β β s = βββββΌβββ β
β βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββ βββββββββββ β β
β β flag=Trueβ β temp = ββΌβββββ β β
β βββββββββββ βββββββββββ β β β
ββββββββββββββββββββββββββββββββββββββββββββββ
β β
HEAP (slow, large): β β
ββββββββββββββββββββββββββββββββββββββββββββββ
β βββββββββββ β β β
β β "Hello" β βββββββββββββββββ β β
β βββββββββββ β β
β βββββββββββ β β
β β [1,2,3] β ββββββββββββββββββββββββββ β
β βββββββββββ β
β βββββββββββ β
β β {"a": 1}β β
β βββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββ
7.2 Reference Counting
Reference Counting
Here,
- =Number of references to an object
- =The Python object being referenced
import sys
# Reference counting
a = [1, 2, 3]
print(f"Initial refcount: {sys.getrefcount(a) - 1}") # -1 for getrefcount arg
b = a # Increases refcount
print(f"After b = a: {sys.getrefcount(a) - 1}")
c = a # Increases refcount
print(f"After c = a: {sys.getrefcount(a) - 1}")
del b # Decreases refcount
print(f"After del b: {sys.getrefcount(a) - 1}")
del c # Decreases refcount
print(f"After del c: {sys.getrefcount(a) - 1}")
# Circular reference detection
class Node:
def __init__(self):
self.reference = None
x = Node()
y = Node()
x.reference = y
y.reference = x # Circular reference
print(f"\nCircular reference: xβyβx")
print(f" x refcount (includes getrefcount): {sys.getrefcount(x) - 1}")
print(f" y refcount (includes getrefcount): {sys.getrefcount(y) - 1}")
print(f" Garbage collector handles circular refs")
8. Complete Code Example: Data Science Utilities
import numpy as np
from typing import List, Dict, Any, Optional
from collections import defaultdict
class DataScienceToolkit:
"""Practical data structure utilities for data science."""
@staticmethod
def safe_convert(value: Any, target_type: type, default: Any = None) -> Any:
"""Safely convert value to target type."""
try:
return target_type(value)
except (ValueError, TypeError):
return default
@staticmethod
def flatten_dict(d: Dict, parent_key: str = '', sep: str = '.') -> Dict:
"""Flatten nested dictionary."""
items = []
for k, v in d.items():
new_key = f"{parent_key}{sep}{k}" if parent_key else k
if isinstance(v, dict):
items.extend(DataScienceToolkit.flatten_dict(v, new_key, sep).items())
else:
items.append((new_key, v))
return dict(items)
@staticmethod
def chunk_list(lst: List, chunk_size: int) -> List[List]:
"""Split list into chunks."""
return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]
@staticmethod
def weighted_average(values: List[float], weights: List[float]) -> float:
"""Compute weighted average."""
return sum(v * w for v, w in zip(values, weights)) / sum(weights)
@staticmethod
def detect_outliers_iqr(data: List[float], factor: float = 1.5) -> List[int]:
"""Detect outliers using IQR method."""
sorted_data = sorted(data)
n = len(sorted_data)
q1 = sorted_data[n // 4]
q3 = sorted_data[3 * n // 4]
iqr = q3 - q1
lower = q1 - factor * iqr
upper = q3 + factor * iqr
return [i for i, x in enumerate(data) if x < lower or x > upper]
# Usage example
toolkit = DataScienceToolkit()
# Type conversion
print("Safe type conversion:")
print(f" '42' β int: {toolkit.safe_convert('42', int)}")
print(f" 'abc' β int: {toolkit.safe_convert('abc', int, default=0)}")
print(f" '3.14' β float: {toolkit.safe_convert('3.14', float)}")
print()
# Dictionary flattening
nested = {
"user": {
"profile": {
"name": "Alice",
"age": 30
},
"settings": {
"theme": "dark",
"notifications": True
}
}
}
flattened = toolkit.flatten_dict(nested)
print("Flattened dictionary:")
for key, value in flattened.items():
print(f" {key}: {value}")
print()
# List chunking
data = list(range(1, 11))
chunks = toolkit.chunk_list(data, 3)
print(f"Chunked list ({data}):")
for i, chunk in enumerate(chunks):
print(f" Chunk {i}: {chunk}")
print()
# Outlier detection
np.random.seed(42)
normal_data = np.random.normal(0, 1, 100).tolist()
normal_data.extend([10, -10, 15]) # Add outliers
outliers = toolkit.detect_outliers_iqr(normal_data)
print(f"Outlier detection: found {len(outliers)} outliers")
print(f" Outlier indices: {outliers}")
print(f" Outlier values: {[normal_data[i] for i in outliers]}")
Key Takeaways
πSummary: Python Data Types & Structures
- Type Precision: Python integers have arbitrary precision; floats are approximate (IEEE 754). Always use
math.isclose()for float comparisons. - Data Structure Choice Matters: Lists for mutable collections, tuples for fixed data, NumPy arrays for numerical work, dicts for O(1) lookup, sets for deduplication.
- Memory Efficiency: Tuples are more memory-efficient than lists; NumPy arrays store data contiguously (cache-friendly).
- Immutability Benefits: Immutable objects are hashable (can be dict keys/set members), thread-safe, and enable Python's internal optimizations.
- Type Hints: Improve code clarity and enable static analysis tools like mypy.
Practice Exercises
Exercise 1: Performance Analysis
Write a function that benchmarks list vs dictionary lookup for finding an element in a collection of 1 million items. Include timing and discuss the results.
Exercise 2: Nested Data Processing
Given a nested dictionary representing a student database, write a function to:
- Flatten the dictionary
- Extract all grades
- Calculate statistics (mean, median, std)
- Identify students with grades below threshold
Exercise 3: Memory Optimization
Create a data structure that stores 10,000 records with name, age, and email. Compare memory usage between:
- List of dictionaries
- List of tuples
- NumPy structured array
- Pandas DataFrame
Discuss which approach is most memory-efficient and why.
Exercise 4: Type Conversion Pipeline
Build a robust type conversion function that handles:
- Mixed types in input
- Missing values (None, NaN)
- Invalid conversions
- Logging of conversion failures
Test with a dataset containing 1000 mixed-type values.
Exercise 5: Set Operations for Data Analysis
Using set operations, write a function to find:
- Common features between two datasets
- Features unique to each dataset
- The symmetric difference Discuss why set operations are preferred over list operations for this use case.