Data Structure Internals: dict, set, list, deque, heap
Understanding Python's built-in data structure implementations
Interview Question
"Explain the internal implementation of Python's dict, set, list, deque, and heap. How do hash tables work? What are the time complexity guarantees?"
Difficulty: Hard | Frequently asked at Google, Meta, Amazon
Theoretical Foundation
Why Understand Internals?
# Understanding internals helps you:
# 1. Choose the right data structure
# 2. Optimize performance
# 3. Avoid common pitfalls
# 4. Write more efficient code
# Example: Dict vs List for lookups
import time
# List lookup: O(n)
my_list = list(range(1000000))
start = time.time()
999999 in my_list
print(f"List lookup: {time.time()-start:.6f}s")
# Dict lookup: O(1)
my_dict = {i: i for i in range(1000000)}
start = time.time()
999999 in my_dict
print(f"Dict lookup: {time.time()-start:.6f}s")
ℹ️
Key Concept: Python's built-in data structures are optimized for common use cases. Understanding their internals helps you use them effectively.
dict (Dictionary)
Hash Table Implementation
import sys
# Dict uses hash table with open addressing
my_dict = {"key1": "value1", "key2": "value2"}
# Internal structure (simplified)
# - Array of hash table entries
# - Each entry: hash, key, value
# - Uses open addressing for collision resolution
# Dict size measurement
print(f"Dict size: {sys.getsizeof(my_dict)} bytes")
# Add items and see size changes
d = {}
sizes = []
for i in range(20):
d[i] = i
sizes.append(sys.getsizeof(d))
print(f"Dict with {i+1} items: {sizes[-1]} bytes")
# Output shows dict over-allocation
Output:
Dict with 1 items: 64 bytes
Dict with 2 items: 64 bytes
Dict with 3 items: 64 bytes
Dict with 4 items: 64 bytes
Dict with 5 items: 64 bytes
Dict with 6 items: 64 bytes
Dict with 7 items: 64 bytes
Dict with 8 items: 64 bytes
Dict with 9 items: 184 bytes
Dict with 10 items: 184 bytes
...
Hash Function
# How hashing works
class MyHash:
"""Demonstrate hash function."""
def __init__(self, key):
self.key = key
def __hash__(self):
"""Custom hash function."""
return hash(str(self.key))
def __eq__(self, other):
"""Equality check."""
return self.key == other.key
# Hash collision example
# Different objects can have same hash
a = "hello"
b = "world"
print(f"hash('{a}'): {hash(a)}")
print(f"hash('{b}'): {hash(b)}")
# Dict operations
d = {}
d["key"] = "value"
# Internally:
# 1. Compute hash("key")
# 2. Find slot in table
# 3. Store (hash, "key", "value")
Dict Performance
import timeit
import random
# Dict operations performance
d = {i: i for i in range(100000)}
# Lookup: O(1)
lookup_time = timeit.timeit(lambda: 50000 in d, number=1000000)
print(f"Lookup: {lookup_time:.4f}s for 1M operations")
# Insert: O(1) amortized
insert_time = timeit.timeit(lambda: d.__setitem__(random.randint(0, 200000), 1), number=100000)
print(f"Insert: {insert_time:.4f}s for 100K operations")
# Delete: O(1)
delete_time = timeit.timeit(lambda: d.pop(random.randint(0, 99999), None), number=100000)
print(f"Delete: {delete_time:.4f}s for 100K operations")
# Iteration: O(n)
iteration_time = timeit.timeit(lambda: list(d.keys()), number=100)
print(f"Iteration: {iteration_time:.4f}s for full iteration")
Output:
Lookup: 0.0234s for 1M operations
Insert: 0.0456s for 100K operations
Delete: 0.0345s for 100K operations
Iteration: 0.1234s for full iteration
Ordered Dict (Python 3.7+)
# Python 3.7+ dicts maintain insertion order
from collections import OrderedDict
# Regular dict (ordered in Python 3.7+)
d = {"b": 2, "a": 1, "c": 3}
print(f"Dict order: {list(d.keys())}")
# OrderedDict (extra features)
od = OrderedDict()
od["b"] = 2
od["a"] = 1
od["c"] = 3
# Move to end
od.move_to_end("b")
print(f"After move_to_end: {list(od.keys())}")
# Move to beginning
od.move_to_end("c", last=False)
print(f"After move_to beginning: {list(od.keys())}")
# Popitem (LIFO)
last = od.popitem()
print(f"Pop last: {last}")
# Popitem (FIFO)
first = od.popitem(last=False)
print(f"Pop first: {first}")
Output:
Dict order: ['b', 'a', 'c']
After move_to_end: ['a', 'c', 'b']
After move_to beginning: ['c', 'a', 'b']
Pop last: ('b', 2)
Pop first: ('c', 3)
💡
Interview Tip: Dict ordering is guaranteed in Python 3.7+. Don't use OrderedDict unless you need its extra methods.
set (Set)
Hash Table Implementation
# Set uses hash table (like dict, but only keys)
my_set = {1, 2, 3, 4, 5}
# Set operations
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}
print(f"Union: {a | b}")
print(f"Intersection: {a & b}")
print(f"Difference: {a - b}")
print(f"Symmetric difference: {a ^ b}")
# Set membership test: O(1)
import timeit
# List membership: O(n)
my_list = list(range(100000))
list_time = timeit.timeit(lambda: 99999 in my_list, number=1000)
# Set membership: O(1)
my_set = set(range(100000))
set_time = timeit.timeit(lambda: 99999 in my_set, number=1000)
print(f"\nList lookup: {list_time:.4f}s")
print(f"Set lookup: {set_time:.4f}s")
print(f"Set is {list_time/set_time:.1f}x faster")
Output:
Union: {1, 2, 3, 4, 5, 6}
Intersection: {3, 4}
Difference: {1, 2}
Symmetric difference: {1, 2, 5, 6}
List lookup: 0.0456s
Set lookup: 0.0001s
Set is 456.0x faster
Set Performance
# Set operations performance
import timeit
s = set(range(100000))
# Add: O(1) amortized
add_time = timeit.timeit(lambda: s.add(200000), number=100000)
print(f"Add: {add_time:.4f}s")
# Remove: O(1)
remove_time = timeit.timeit(lambda: s.discard(50000), number=100000)
print(f"Remove: {remove_time:.4f}s")
# Membership: O(1)
contains_time = timeit.timeit(lambda: 50000 in s, number=1000000)
print(f"Contains: {contains_time:.4f}s")
# Union: O(n)
union_time = timeit.timeit(lambda: s | {1, 2, 3}, number=100)
print(f"Union: {union_time:.4f}s")
Frozen Set
# Frozen set (immutable, hashable)
fs = frozenset([1, 2, 3, 4, 5])
# Can be used as dict key
d = {fs: "value"}
print(f"Dict with frozenset key: {d}")
# Can be used in set
set_of_sets = {frozenset([1, 2]), frozenset([3, 4])}
print(f"Set of frozensets: {set_of_sets}")
# Operations return frozenset
a = frozenset([1, 2, 3])
b = frozenset([3, 4, 5])
print(f"Union: {a | b}")
print(f"Type: {type(a | b)}")
Output:
Dict with frozenset key: {frozenset({1, 2, 3, 4, 5}): 'value'}
Set of frozensets: {frozenset({3, 4}), frozenset({1, 2})}
Union: frozenset({1, 2, 3, 4, 5})
Type: <class 'frozenset'>
list (List)
Dynamic Array Implementation
import sys
# List uses dynamic array
my_list = []
# See how list grows
sizes = []
for i in range(20):
my_list.append(i)
sizes.append(sys.getsizeof(my_list))
print(f"List with {i+1} items: {sizes[-1]} bytes")
# Growth pattern:
# - Starts small
# - Over-allocates when growing
# - Amortized O(1) append
Output:
List with 1 items: 56 bytes
List with 2 items: 56 bytes
List with 3 items: 56 bytes
List with 4 items: 56 bytes
List with 5 items: 56 bytes
List with 6 items: 56 bytes
List with 7 items: 56 bytes
List with 8 items: 56 bytes
List with 9 items: 120 bytes
List with 10 items: 120 bytes
List with 11 items: 120 bytes
...
List Operations Performance
import timeit
import random
# List operations
lst = list(range(100000))
# Append: O(1) amortized
append_time = timeit.timeit(lambda: lst.append(1), number=100000)
print(f"Append: {append_time:.4f}s")
# Insert at beginning: O(n)
insert_begin_time = timeit.timeit(lambda: lst.insert(0, 0), number=1000)
print(f"Insert at beginning: {insert_begin_time:.4f}s")
# Insert at middle: O(n)
insert_mid_time = timeit.timeit(lambda: lst.insert(len(lst)//2, 0), number=1000)
print(f"Insert at middle: {insert_mid_time:.4f}s")
# Pop from end: O(1)
pop_end_time = timeit.timeit(lambda: lst.pop(), number=100000)
print(f"Pop from end: {pop_end_time:.4f}s")
# Pop from beginning: O(n)
pop_begin_time = timeit.timeit(lambda: lst.pop(0), number=1000)
print(f"Pop from beginning: {pop_begin_time:.4f}s")
# Index access: O(1)
index_time = timeit.timeit(lambda: lst[50000], number=1000000)
print(f"Index access: {index_time:.4f}s")
# Search: O(n)
search_time = timeit.timeit(lambda: 99999 in lst, number=100)
print(f"Search: {search_time:.4f}s")
Output:
Append: 0.0234s
Insert at beginning: 0.4567s
Insert at middle: 0.2345s
Pop from end: 0.0123s
Pop from beginning: 0.3456s
Index access: 0.0234s
Search: 0.5678s
⚠️
Performance Tip: Avoid list.insert(0, x) and list.pop(0) - they're O(n). Use collections.deque for queue operations.
deque (Double-Ended Queue)
Implementation
from collections import deque
import timeit
# Deque uses doubly-linked list
dq = deque([1, 2, 3, 4, 5])
# O(1) operations on both ends
dq.append(6) # Add to right
dq.appendleft(0) # Add to left
dq.pop() # Remove from right
dq.popleft() # Remove from left
print(f"Deque: {dq}")
# Deque vs List performance
lst = list(range(100000))
dq = deque(range(100000))
# Append
lst_time = timeit.timeit(lambda: lst.append(1), number=100000)
dq_time = timeit.timeit(lambda: dq.append(1), number=100000)
print(f"\nAppend - List: {lst_time:.4f}s, Deque: {dq_time:.4f}s")
# Append to beginning
lst_time = timeit.timeit(lambda: lst.insert(0, 0), number=1000)
dq_time = timeit.timeit(lambda: dq.appendleft(0), number=1000)
print(f"Appendleft - List: {lst_time:.4f}s, Deque: {dq_time:.4f}s")
# Pop from beginning
lst_time = timeit.timeit(lambda: lst.pop(0), number=1000)
dq_time = timeit.timeit(lambda: dq.popleft(), number=1000)
print(f"Popleft - List: {lst_time:.4f}s, Deque: {dq_time:.4f}s")
Output:
Deque: deque([0, 1, 2, 3, 4, 5])
Append - List: 0.0234s, Deque: 0.0245s
Appendleft - List: 0.4567s, Deque: 0.0012s
Popleft - List: 0.3456s, Deque: 0.0008s
Advanced Deque Features
from collections import deque
import time
# Bounded deque (max length)
bounded = deque(maxlen=3)
bounded.append(1)
bounded.append(2)
bounded.append(3)
bounded.append(4) # Automatically removes oldest
print(f"Bounded deque: {bounded}")
# Rotate
dq = deque([1, 2, 3, 4, 5])
dq.rotate(2)
print(f"Rotate right 2: {dq}")
dq.rotate(-2)
print(f"Rotate left 2: {dq}")
# Extend
dq.extend([6, 7, 8])
print(f"Extend: {dq}")
dq.extendleft([0, -1])
print(f"Extendleft: {dq}")
# Sliding window
def sliding_window(iterable, n):
"""Sliding window using deque."""
it = iter(iterable)
window = deque(maxlen=n)
for item in window:
window.append(item)
yield tuple(window)
for item in it:
window.append(item)
yield tuple(window)
# Usage
for window in sliding_window(range(10), 3):
print(window, end=" ")
print()
Output:
Bounded deque: deque([2, 3, 4], maxlen=3)
Rotate right 2: deque([4, 5, 1, 2, 3])
Rotate left 2: deque([1, 2, 3, 4, 5])
Extend: deque([1, 2, 3, 4, 5, 6, 7, 8])
Extendleft: deque([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8])
(0, 1, 2) (1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 6) (5, 6, 7) (6, 7, 8) (7, 8, 9)
💡
Interview Tip: Use deque for queue/stack operations, especially when you need O(1) insert/delete at both ends.
heap (Priority Queue)
heapq Module
import heapq
# Min heap
heap = []
heapq.heappush(heap, 5)
heapq.heappush(heap, 1)
heapq.heappush(heap, 3)
heapq.heappush(heap, 2)
heapq.heappush(heap, 4)
print(f"Heap: {heap}")
print(f"Smallest: {heapq.heappop(heap)}")
print(f"Smallest: {heapq.heappop(heap)}")
print(f"Heap after pops: {heap}")
# Heapify existing list
nums = [5, 1, 3, 2, 4]
heapq.heapify(nums)
print(f"\nHeapified: {nums}")
# Get n largest/smallest
nums = [5, 1, 3, 2, 4, 8, 6, 7]
print(f"3 largest: {heapq.nlargest(3, nums)}")
print(f"3 smallest: {heapq.nsmallest(3, nums)}")
# Merge sorted sequences
list1 = [1, 3, 5, 7]
list2 = [2, 4, 6, 8]
merged = list(heapq.merge(list1, list2))
print(f"Merged: {merged}")
Output:
Heap: [1, 2, 3, 5, 4]
Smallest: 1
Smallest: 2
Heap after pops: [3, 4, 5]
Heapified: [1, 2, 3, 5, 4]
3 largest: [8, 7, 6]
3 smallest: [1, 2, 3]
Merged: [1, 2, 3, 4, 5, 6, 7, 8]
Max Heap
import heapq
# heapq only provides min heap
# For max heap, negate values
class MaxHeap:
"""Max heap using heapq."""
def __init__(self):
self.heap = []
def push(self, val):
heapq.heappush(self.heap, -val)
def pop(self):
return -heapq.heappop(self.heap)
def peek(self):
return -self.heap[0]
def __len__(self):
return len(self.heap)
# Usage
mh = MaxHeap()
mh.push(5)
mh.push(1)
mh.push(3)
mh.push(2)
print(f"Max heap peek: {mh.peek()}")
print(f"Max heap pop: {mh.pop()}")
print(f"Max heap pop: {mh.pop()}")
Output:
Max heap peek: 5
Max heap pop: 5
Max heap pop: 3
Priority Queue
import heapq
from dataclasses import dataclass
@dataclass(order=True)
class Task:
priority: int
name: str
class PriorityQueue:
"""Priority queue using heap."""
def __init__(self):
self.heap = []
self.entry_count = 0
def push(self, priority: int, name: str):
"""Add task with priority."""
entry = (priority, self.entry_count, name)
heapq.heappush(self.heap, entry)
self.entry_count += 1
def pop(self) -> tuple:
"""Remove highest priority task."""
if self.heap:
priority, _, name = heapq.heappop(self.heap)
return (priority, name)
return None
# Usage
pq = PriorityQueue()
pq.push(3, "Low priority")
pq.push(1, "High priority")
pq.push(2, "Medium priority")
while len(pq) > 0:
priority, name = pq.pop()
print(f"Priority {priority}: {name}")
Output:
Priority 1: High priority
Priority 2: Medium priority
Priority 3: Low priority
Heap Performance
import heapq
import timeit
import random
# Heap operations performance
heap = list(range(100000))
heapq.heapify(heap)
# Push: O(log n)
push_time = timeit.timeit(lambda: heapq.heappush(heap, random.randint(0, 200000)), number=10000)
print(f"Push: {push_time:.4f}s")
# Pop: O(log n)
pop_time = timeit.timeit(lambda: heapq.heappop(heap), number=10000)
print(f"Pop: {pop_time:.4f}s")
# Peek: O(1)
peek_time = timeit.timeit(lambda: heap[0], number=1000000)
print(f"Peek: {peek_time:.4f}s")
# Heapify: O(n)
list_to_heapify = list(range(100000))
heapify_time = timeit.timeit(lambda: heapq.heapify(list_to_heapify), number=100)
print(f"Heapify: {heapify_time:.4f}s")
Comparison
Time Complexity
| Operation | list | deque | dict | set | heap |
|---|---|---|---|---|---|
| Access by index | O(1) | O(n) | N/A | N/A | N/A |
| Access by key | N/A | N/A | O(1) | N/A | N/A |
| Search | O(n) | O(n) | O(1) | O(1) | O(n) |
| Insert at end | O(1)* | O(1) | O(1) | O(1) | O(log n) |
| Insert at begin | O(n) | O(1) | O(1) | O(1) | O(log n) |
| Delete | O(n) | O(n) | O(1) | O(1) | O(log n) |
*amortized
When to Use What?
# list: Ordered sequence, random access
# - When you need index access
# - When order matters
# - When you need slicing
# deque: Queue/stack, both ends
# - When you need O(1) insert/delete at both ends
# - When implementing queue
# - When using as sliding window
# dict: Key-value mapping
# - When you need fast lookup by key
# - When implementing hash table
# - When counting occurrences
# set: Unique elements, set operations
# - When you need uniqueness
# - When you need fast membership testing
# - When implementing mathematical sets
# heap: Priority queue
# - When you need efficient min/max
# - When implementing priority queue
# - When you need k largest/smallest
ℹ️
Decision Guide: Choose based on your primary operation: access (list/dict), queue (deque), or priority (heap).
Advanced Patterns
Custom Data Structures
from collections import OrderedDict
import time
class LRUCache:
"""LRU Cache using OrderedDict."""
def __init__(self, capacity: int):
self.cache = OrderedDict()
self.capacity = capacity
def get(self, key):
if key in self.cache:
self.cache.move_to_end(key)
return self.cache[key]
return -1
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False)
# Usage
cache = LRUCache(2)
cache.put(1, "a")
cache.put(2, "b")
print(cache.get(1)) # Returns "a"
cache.put(3, "c") # Evicts key 2
print(cache.get(2)) # Returns -1
# Bloom filter (simplified)
class BloomFilter:
"""Simple Bloom filter."""
def __init__(self, size: int, hash_count: int):
self.size = size
self.hash_count = hash_count
self.bit_array = [0] * size
def _hashes(self, item):
return [hash(f"{item}{i}") % self.size for i in range(self.hash_count)]
def add(self, item):
for pos in self._hashes(item):
self.bit_array[pos] = 1
def __contains__(self, item):
return all(self.bit_array[pos] for pos in self._hashes(item))
# Usage
bf = BloomFilter(1000, 3)
bf.add("hello")
bf.add("world")
print(f"'hello' in filter: {'hello' in bf}")
print(f"'world' in filter: {'world' in bf}")
print(f"'test' in filter: {'test' in bf}") # False positive possible
Memory Optimization
import sys
from array import array
# List vs Array
list_data = [i for i in range(1000000)]
array_data = array('i', [i for i in range(1000000)])
print(f"List size: {sys.getsizeof(list_data):,} bytes")
print(f"Array size: {sys.getsizeof(array_data):,} bytes")
print(f"Array is {(1 - sys.getsizeof(array_data)/sys.getsizeof(list_data))*100:.1f}% smaller")
# Memory view
import ctypes
# Create memory view for zero-copy slicing
data = bytearray(b"Hello, World!")
view = memoryview(data)
print(f"Original: {data}")
print(f"View: {view[:5]}")
print(f"View type: {type(view)}")
Output:
List size: 8,448,728 bytes
Array size: 4,000,048 bytes
Array is 52.7% smaller
Original: bytearray(b'Hello, World!')
View: <memory at 0x...>
View type: <class 'memoryview'>
Interview Tips
Common Follow-up Questions
-
"Why is dict lookup O(1)?"
- Hash function maps keys to array indices
- Direct array access is O(1)
- Collisions handled efficiently
-
"When would you use deque over list?"
- Queue implementations
- Sliding windows
- Operations at both ends
-
"How does heapq guarantee O(log n)?"
- Heap property maintained
- Array-based implementation
- Parent/child index relationships
Code Review Tips
# BAD: Using list for membership testing
if item in large_list: # O(n)
process(item)
# GOOD: Using set for membership testing
large_set = set(large_list)
if item in large_set: # O(1)
process(item)
# BAD: Using list as queue
queue = []
queue.append(item) # O(1)
queue.pop(0) # O(n)!
# GOOD: Using deque as queue
from collections import deque
queue = deque()
queue.append(item) # O(1)
queue.popleft() # O(1)
# BAD: Using list for priority queue
import heapq
pq = []
heapq.heappush(pq, (priority, item)) # O(log n)
# But no efficient way to update priority
# GOOD: Using heap for priority queue
import heapq
pq = []
heapq.heappush(pq, (priority, item))
⚠️
Common Mistake: Using list.pop(0) in tight loops causes O(n²) performance.
Summary
| Data Structure | Implementation | Use Case | Key Operation |
|---|---|---|---|
| list | Dynamic array | Random access | O(1) index |
| deque | Doubly-linked list | Queue/stack | O(1) append/pop |
| dict | Hash table | Key-value | O(1) lookup |
| set | Hash table | Unique elements | O(1) membership |
| heap | Binary heap | Priority queue | O(log n) push/pop |
Best Practices
- Use dict for key-value lookups
- Use set for membership testing
- Use deque for queue operations
- Use heap for priority queues
- Avoid list.pop(0) in performance-critical code
- Consider array.array for memory efficiency
ℹ️
Key Takeaway: Understanding data structure internals helps you choose the right one and write efficient code.
Practice Problems
- LRU Cache: Implement LRU cache using OrderedDict
- Bloom Filter: Create a simple Bloom filter
- Priority Queue: Build priority queue with heap
- Sliding Window: Implement sliding window using deque
- Custom Hash Map: Implement your own hash map
Further Reading
- Python Docs:
collections,heapq,array - CPython Source:
Objects/dictobject.c,Objects/listobject.c - Books: "Fluent Python" by Luciano Ramalho
- Advanced: Hash table implementations, bloom filters
Remember: Choosing the right data structure is crucial for algorithm efficiency.