🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Data Structure Internals: dict, set, list, deque, heap

PythonData Structure Internals⭐ Premium

Advertisement

Google, Meta & Amazon Interview

Data Structure Internals: dict, set, list, deque, heap

Understanding Python's built-in data structure implementations

Interview Question

"Explain the internal implementation of Python's dict, set, list, deque, and heap. How do hash tables work? What are the time complexity guarantees?"

Difficulty: Hard | Frequently asked at Google, Meta, Amazon


Theoretical Foundation

Why Understand Internals?

# Understanding internals helps you:
# 1. Choose the right data structure
# 2. Optimize performance
# 3. Avoid common pitfalls
# 4. Write more efficient code

# Example: Dict vs List for lookups
import time

# List lookup: O(n)
my_list = list(range(1000000))
start = time.time()
999999 in my_list
print(f"List lookup: {time.time()-start:.6f}s")

# Dict lookup: O(1)
my_dict = {i: i for i in range(1000000)}
start = time.time()
999999 in my_dict
print(f"Dict lookup: {time.time()-start:.6f}s")

ℹ️

Key Concept: Python's built-in data structures are optimized for common use cases. Understanding their internals helps you use them effectively.


dict (Dictionary)

Hash Table Implementation

import sys

# Dict uses hash table with open addressing
my_dict = {"key1": "value1", "key2": "value2"}

# Internal structure (simplified)
# - Array of hash table entries
# - Each entry: hash, key, value
# - Uses open addressing for collision resolution

# Dict size measurement
print(f"Dict size: {sys.getsizeof(my_dict)} bytes")

# Add items and see size changes
d = {}
sizes = []
for i in range(20):
    d[i] = i
    sizes.append(sys.getsizeof(d))
    print(f"Dict with {i+1} items: {sizes[-1]} bytes")

# Output shows dict over-allocation

Output:

Architecture Diagram
Dict with 1 items: 64 bytes
Dict with 2 items: 64 bytes
Dict with 3 items: 64 bytes
Dict with 4 items: 64 bytes
Dict with 5 items: 64 bytes
Dict with 6 items: 64 bytes
Dict with 7 items: 64 bytes
Dict with 8 items: 64 bytes
Dict with 9 items: 184 bytes
Dict with 10 items: 184 bytes
...

Hash Function

# How hashing works
class MyHash:
    """Demonstrate hash function."""
    
    def __init__(self, key):
        self.key = key
    
    def __hash__(self):
        """Custom hash function."""
        return hash(str(self.key))
    
    def __eq__(self, other):
        """Equality check."""
        return self.key == other.key

# Hash collision example
# Different objects can have same hash
a = "hello"
b = "world"
print(f"hash('{a}'): {hash(a)}")
print(f"hash('{b}'): {hash(b)}")

# Dict operations
d = {}
d["key"] = "value"
# Internally:
# 1. Compute hash("key")
# 2. Find slot in table
# 3. Store (hash, "key", "value")

Dict Performance

import timeit
import random

# Dict operations performance
d = {i: i for i in range(100000)}

# Lookup: O(1)
lookup_time = timeit.timeit(lambda: 50000 in d, number=1000000)
print(f"Lookup: {lookup_time:.4f}s for 1M operations")

# Insert: O(1) amortized
insert_time = timeit.timeit(lambda: d.__setitem__(random.randint(0, 200000), 1), number=100000)
print(f"Insert: {insert_time:.4f}s for 100K operations")

# Delete: O(1)
delete_time = timeit.timeit(lambda: d.pop(random.randint(0, 99999), None), number=100000)
print(f"Delete: {delete_time:.4f}s for 100K operations")

# Iteration: O(n)
iteration_time = timeit.timeit(lambda: list(d.keys()), number=100)
print(f"Iteration: {iteration_time:.4f}s for full iteration")

Output:

Architecture Diagram
Lookup: 0.0234s for 1M operations
Insert: 0.0456s for 100K operations
Delete: 0.0345s for 100K operations
Iteration: 0.1234s for full iteration

Ordered Dict (Python 3.7+)

# Python 3.7+ dicts maintain insertion order
from collections import OrderedDict

# Regular dict (ordered in Python 3.7+)
d = {"b": 2, "a": 1, "c": 3}
print(f"Dict order: {list(d.keys())}")

# OrderedDict (extra features)
od = OrderedDict()
od["b"] = 2
od["a"] = 1
od["c"] = 3

# Move to end
od.move_to_end("b")
print(f"After move_to_end: {list(od.keys())}")

# Move to beginning
od.move_to_end("c", last=False)
print(f"After move_to beginning: {list(od.keys())}")

# Popitem (LIFO)
last = od.popitem()
print(f"Pop last: {last}")

# Popitem (FIFO)
first = od.popitem(last=False)
print(f"Pop first: {first}")

Output:

Architecture Diagram
Dict order: ['b', 'a', 'c']
After move_to_end: ['a', 'c', 'b']
After move_to beginning: ['c', 'a', 'b']
Pop last: ('b', 2)
Pop first: ('c', 3)

💡

Interview Tip: Dict ordering is guaranteed in Python 3.7+. Don't use OrderedDict unless you need its extra methods.


set (Set)

Hash Table Implementation

# Set uses hash table (like dict, but only keys)
my_set = {1, 2, 3, 4, 5}

# Set operations
a = {1, 2, 3, 4}
b = {3, 4, 5, 6}

print(f"Union: {a | b}")
print(f"Intersection: {a & b}")
print(f"Difference: {a - b}")
print(f"Symmetric difference: {a ^ b}")

# Set membership test: O(1)
import timeit

# List membership: O(n)
my_list = list(range(100000))
list_time = timeit.timeit(lambda: 99999 in my_list, number=1000)

# Set membership: O(1)
my_set = set(range(100000))
set_time = timeit.timeit(lambda: 99999 in my_set, number=1000)

print(f"\nList lookup: {list_time:.4f}s")
print(f"Set lookup: {set_time:.4f}s")
print(f"Set is {list_time/set_time:.1f}x faster")

Output:

Architecture Diagram
Union: {1, 2, 3, 4, 5, 6}
Intersection: {3, 4}
Difference: {1, 2}
Symmetric difference: {1, 2, 5, 6}

List lookup: 0.0456s
Set lookup: 0.0001s
Set is 456.0x faster

Set Performance

# Set operations performance
import timeit

s = set(range(100000))

# Add: O(1) amortized
add_time = timeit.timeit(lambda: s.add(200000), number=100000)
print(f"Add: {add_time:.4f}s")

# Remove: O(1)
remove_time = timeit.timeit(lambda: s.discard(50000), number=100000)
print(f"Remove: {remove_time:.4f}s")

# Membership: O(1)
contains_time = timeit.timeit(lambda: 50000 in s, number=1000000)
print(f"Contains: {contains_time:.4f}s")

# Union: O(n)
union_time = timeit.timeit(lambda: s | {1, 2, 3}, number=100)
print(f"Union: {union_time:.4f}s")

Frozen Set

# Frozen set (immutable, hashable)
fs = frozenset([1, 2, 3, 4, 5])

# Can be used as dict key
d = {fs: "value"}
print(f"Dict with frozenset key: {d}")

# Can be used in set
set_of_sets = {frozenset([1, 2]), frozenset([3, 4])}
print(f"Set of frozensets: {set_of_sets}")

# Operations return frozenset
a = frozenset([1, 2, 3])
b = frozenset([3, 4, 5])
print(f"Union: {a | b}")
print(f"Type: {type(a | b)}")

Output:

Architecture Diagram
Dict with frozenset key: {frozenset({1, 2, 3, 4, 5}): 'value'}
Set of frozensets: {frozenset({3, 4}), frozenset({1, 2})}
Union: frozenset({1, 2, 3, 4, 5})
Type: <class 'frozenset'>

list (List)

Dynamic Array Implementation

import sys

# List uses dynamic array
my_list = []

# See how list grows
sizes = []
for i in range(20):
    my_list.append(i)
    sizes.append(sys.getsizeof(my_list))
    print(f"List with {i+1} items: {sizes[-1]} bytes")

# Growth pattern:
# - Starts small
# - Over-allocates when growing
# - Amortized O(1) append

Output:

Architecture Diagram
List with 1 items: 56 bytes
List with 2 items: 56 bytes
List with 3 items: 56 bytes
List with 4 items: 56 bytes
List with 5 items: 56 bytes
List with 6 items: 56 bytes
List with 7 items: 56 bytes
List with 8 items: 56 bytes
List with 9 items: 120 bytes
List with 10 items: 120 bytes
List with 11 items: 120 bytes
...

List Operations Performance

import timeit
import random

# List operations
lst = list(range(100000))

# Append: O(1) amortized
append_time = timeit.timeit(lambda: lst.append(1), number=100000)
print(f"Append: {append_time:.4f}s")

# Insert at beginning: O(n)
insert_begin_time = timeit.timeit(lambda: lst.insert(0, 0), number=1000)
print(f"Insert at beginning: {insert_begin_time:.4f}s")

# Insert at middle: O(n)
insert_mid_time = timeit.timeit(lambda: lst.insert(len(lst)//2, 0), number=1000)
print(f"Insert at middle: {insert_mid_time:.4f}s")

# Pop from end: O(1)
pop_end_time = timeit.timeit(lambda: lst.pop(), number=100000)
print(f"Pop from end: {pop_end_time:.4f}s")

# Pop from beginning: O(n)
pop_begin_time = timeit.timeit(lambda: lst.pop(0), number=1000)
print(f"Pop from beginning: {pop_begin_time:.4f}s")

# Index access: O(1)
index_time = timeit.timeit(lambda: lst[50000], number=1000000)
print(f"Index access: {index_time:.4f}s")

# Search: O(n)
search_time = timeit.timeit(lambda: 99999 in lst, number=100)
print(f"Search: {search_time:.4f}s")

Output:

Architecture Diagram
Append: 0.0234s
Insert at beginning: 0.4567s
Insert at middle: 0.2345s
Pop from end: 0.0123s
Pop from beginning: 0.3456s
Index access: 0.0234s
Search: 0.5678s

⚠️

Performance Tip: Avoid list.insert(0, x) and list.pop(0) - they're O(n). Use collections.deque for queue operations.


deque (Double-Ended Queue)

Implementation

from collections import deque
import timeit

# Deque uses doubly-linked list
dq = deque([1, 2, 3, 4, 5])

# O(1) operations on both ends
dq.append(6)      # Add to right
dq.appendleft(0)  # Add to left
dq.pop()          # Remove from right
dq.popleft()      # Remove from left

print(f"Deque: {dq}")

# Deque vs List performance
lst = list(range(100000))
dq = deque(range(100000))

# Append
lst_time = timeit.timeit(lambda: lst.append(1), number=100000)
dq_time = timeit.timeit(lambda: dq.append(1), number=100000)
print(f"\nAppend - List: {lst_time:.4f}s, Deque: {dq_time:.4f}s")

# Append to beginning
lst_time = timeit.timeit(lambda: lst.insert(0, 0), number=1000)
dq_time = timeit.timeit(lambda: dq.appendleft(0), number=1000)
print(f"Appendleft - List: {lst_time:.4f}s, Deque: {dq_time:.4f}s")

# Pop from beginning
lst_time = timeit.timeit(lambda: lst.pop(0), number=1000)
dq_time = timeit.timeit(lambda: dq.popleft(), number=1000)
print(f"Popleft - List: {lst_time:.4f}s, Deque: {dq_time:.4f}s")

Output:

Architecture Diagram
Deque: deque([0, 1, 2, 3, 4, 5])

Append - List: 0.0234s, Deque: 0.0245s
Appendleft - List: 0.4567s, Deque: 0.0012s
Popleft - List: 0.3456s, Deque: 0.0008s

Advanced Deque Features

from collections import deque
import time

# Bounded deque (max length)
bounded = deque(maxlen=3)
bounded.append(1)
bounded.append(2)
bounded.append(3)
bounded.append(4)  # Automatically removes oldest

print(f"Bounded deque: {bounded}")

# Rotate
dq = deque([1, 2, 3, 4, 5])
dq.rotate(2)
print(f"Rotate right 2: {dq}")

dq.rotate(-2)
print(f"Rotate left 2: {dq}")

# Extend
dq.extend([6, 7, 8])
print(f"Extend: {dq}")

dq.extendleft([0, -1])
print(f"Extendleft: {dq}")

# Sliding window
def sliding_window(iterable, n):
    """Sliding window using deque."""
    it = iter(iterable)
    window = deque(maxlen=n)
    
    for item in window:
        window.append(item)
    
    yield tuple(window)
    
    for item in it:
        window.append(item)
        yield tuple(window)

# Usage
for window in sliding_window(range(10), 3):
    print(window, end=" ")
print()

Output:

Architecture Diagram
Bounded deque: deque([2, 3, 4], maxlen=3)
Rotate right 2: deque([4, 5, 1, 2, 3])
Rotate left 2: deque([1, 2, 3, 4, 5])
Extend: deque([1, 2, 3, 4, 5, 6, 7, 8])
Extendleft: deque([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8])
(0, 1, 2) (1, 2, 3) (2, 3, 4) (3, 4, 5) (4, 5, 6) (5, 6, 7) (6, 7, 8) (7, 8, 9) 

💡

Interview Tip: Use deque for queue/stack operations, especially when you need O(1) insert/delete at both ends.


heap (Priority Queue)

heapq Module

import heapq

# Min heap
heap = []
heapq.heappush(heap, 5)
heapq.heappush(heap, 1)
heapq.heappush(heap, 3)
heapq.heappush(heap, 2)
heapq.heappush(heap, 4)

print(f"Heap: {heap}")
print(f"Smallest: {heapq.heappop(heap)}")
print(f"Smallest: {heapq.heappop(heap)}")
print(f"Heap after pops: {heap}")

# Heapify existing list
nums = [5, 1, 3, 2, 4]
heapq.heapify(nums)
print(f"\nHeapified: {nums}")

# Get n largest/smallest
nums = [5, 1, 3, 2, 4, 8, 6, 7]
print(f"3 largest: {heapq.nlargest(3, nums)}")
print(f"3 smallest: {heapq.nsmallest(3, nums)}")

# Merge sorted sequences
list1 = [1, 3, 5, 7]
list2 = [2, 4, 6, 8]
merged = list(heapq.merge(list1, list2))
print(f"Merged: {merged}")

Output:

Architecture Diagram
Heap: [1, 2, 3, 5, 4]
Smallest: 1
Smallest: 2
Heap after pops: [3, 4, 5]

Heapified: [1, 2, 3, 5, 4]
3 largest: [8, 7, 6]
3 smallest: [1, 2, 3]
Merged: [1, 2, 3, 4, 5, 6, 7, 8]

Max Heap

import heapq

# heapq only provides min heap
# For max heap, negate values

class MaxHeap:
    """Max heap using heapq."""
    
    def __init__(self):
        self.heap = []
    
    def push(self, val):
        heapq.heappush(self.heap, -val)
    
    def pop(self):
        return -heapq.heappop(self.heap)
    
    def peek(self):
        return -self.heap[0]
    
    def __len__(self):
        return len(self.heap)

# Usage
mh = MaxHeap()
mh.push(5)
mh.push(1)
mh.push(3)
mh.push(2)

print(f"Max heap peek: {mh.peek()}")
print(f"Max heap pop: {mh.pop()}")
print(f"Max heap pop: {mh.pop()}")

Output:

Architecture Diagram
Max heap peek: 5
Max heap pop: 5
Max heap pop: 3

Priority Queue

import heapq
from dataclasses import dataclass

@dataclass(order=True)
class Task:
    priority: int
    name: str

class PriorityQueue:
    """Priority queue using heap."""
    
    def __init__(self):
        self.heap = []
        self.entry_count = 0
    
    def push(self, priority: int, name: str):
        """Add task with priority."""
        entry = (priority, self.entry_count, name)
        heapq.heappush(self.heap, entry)
        self.entry_count += 1
    
    def pop(self) -> tuple:
        """Remove highest priority task."""
        if self.heap:
            priority, _, name = heapq.heappop(self.heap)
            return (priority, name)
        return None

# Usage
pq = PriorityQueue()
pq.push(3, "Low priority")
pq.push(1, "High priority")
pq.push(2, "Medium priority")

while len(pq) > 0:
    priority, name = pq.pop()
    print(f"Priority {priority}: {name}")

Output:

Architecture Diagram
Priority 1: High priority
Priority 2: Medium priority
Priority 3: Low priority

Heap Performance

import heapq
import timeit
import random

# Heap operations performance
heap = list(range(100000))
heapq.heapify(heap)

# Push: O(log n)
push_time = timeit.timeit(lambda: heapq.heappush(heap, random.randint(0, 200000)), number=10000)
print(f"Push: {push_time:.4f}s")

# Pop: O(log n)
pop_time = timeit.timeit(lambda: heapq.heappop(heap), number=10000)
print(f"Pop: {pop_time:.4f}s")

# Peek: O(1)
peek_time = timeit.timeit(lambda: heap[0], number=1000000)
print(f"Peek: {peek_time:.4f}s")

# Heapify: O(n)
list_to_heapify = list(range(100000))
heapify_time = timeit.timeit(lambda: heapq.heapify(list_to_heapify), number=100)
print(f"Heapify: {heapify_time:.4f}s")

Comparison

Time Complexity

Operationlistdequedictsetheap
Access by indexO(1)O(n)N/AN/AN/A
Access by keyN/AN/AO(1)N/AN/A
SearchO(n)O(n)O(1)O(1)O(n)
Insert at endO(1)*O(1)O(1)O(1)O(log n)
Insert at beginO(n)O(1)O(1)O(1)O(log n)
DeleteO(n)O(n)O(1)O(1)O(log n)

*amortized

When to Use What?

# list: Ordered sequence, random access
# - When you need index access
# - When order matters
# - When you need slicing

# deque: Queue/stack, both ends
# - When you need O(1) insert/delete at both ends
# - When implementing queue
# - When using as sliding window

# dict: Key-value mapping
# - When you need fast lookup by key
# - When implementing hash table
# - When counting occurrences

# set: Unique elements, set operations
# - When you need uniqueness
# - When you need fast membership testing
# - When implementing mathematical sets

# heap: Priority queue
# - When you need efficient min/max
# - When implementing priority queue
# - When you need k largest/smallest

ℹ️

Decision Guide: Choose based on your primary operation: access (list/dict), queue (deque), or priority (heap).


Advanced Patterns

Custom Data Structures

from collections import OrderedDict
import time

class LRUCache:
    """LRU Cache using OrderedDict."""
    
    def __init__(self, capacity: int):
        self.cache = OrderedDict()
        self.capacity = capacity
    
    def get(self, key):
        if key in self.cache:
            self.cache.move_to_end(key)
            return self.cache[key]
        return -1
    
    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

# Usage
cache = LRUCache(2)
cache.put(1, "a")
cache.put(2, "b")
print(cache.get(1))  # Returns "a"
cache.put(3, "c")    # Evicts key 2
print(cache.get(2))  # Returns -1

# Bloom filter (simplified)
class BloomFilter:
    """Simple Bloom filter."""
    
    def __init__(self, size: int, hash_count: int):
        self.size = size
        self.hash_count = hash_count
        self.bit_array = [0] * size
    
    def _hashes(self, item):
        return [hash(f"{item}{i}") % self.size for i in range(self.hash_count)]
    
    def add(self, item):
        for pos in self._hashes(item):
            self.bit_array[pos] = 1
    
    def __contains__(self, item):
        return all(self.bit_array[pos] for pos in self._hashes(item))

# Usage
bf = BloomFilter(1000, 3)
bf.add("hello")
bf.add("world")

print(f"'hello' in filter: {'hello' in bf}")
print(f"'world' in filter: {'world' in bf}")
print(f"'test' in filter: {'test' in bf}")  # False positive possible

Memory Optimization

import sys
from array import array

# List vs Array
list_data = [i for i in range(1000000)]
array_data = array('i', [i for i in range(1000000)])

print(f"List size: {sys.getsizeof(list_data):,} bytes")
print(f"Array size: {sys.getsizeof(array_data):,} bytes")
print(f"Array is {(1 - sys.getsizeof(array_data)/sys.getsizeof(list_data))*100:.1f}% smaller")

# Memory view
import ctypes

# Create memory view for zero-copy slicing
data = bytearray(b"Hello, World!")
view = memoryview(data)
print(f"Original: {data}")
print(f"View: {view[:5]}")
print(f"View type: {type(view)}")

Output:

Architecture Diagram
List size: 8,448,728 bytes
Array size: 4,000,048 bytes
Array is 52.7% smaller
Original: bytearray(b'Hello, World!')
View: <memory at 0x...>
View type: <class 'memoryview'>

Interview Tips

Common Follow-up Questions

  1. "Why is dict lookup O(1)?"

    • Hash function maps keys to array indices
    • Direct array access is O(1)
    • Collisions handled efficiently
  2. "When would you use deque over list?"

    • Queue implementations
    • Sliding windows
    • Operations at both ends
  3. "How does heapq guarantee O(log n)?"

    • Heap property maintained
    • Array-based implementation
    • Parent/child index relationships

Code Review Tips

# BAD: Using list for membership testing
if item in large_list:  # O(n)
    process(item)

# GOOD: Using set for membership testing
large_set = set(large_list)
if item in large_set:  # O(1)
    process(item)

# BAD: Using list as queue
queue = []
queue.append(item)  # O(1)
queue.pop(0)  # O(n)!

# GOOD: Using deque as queue
from collections import deque
queue = deque()
queue.append(item)  # O(1)
queue.popleft()  # O(1)

# BAD: Using list for priority queue
import heapq
pq = []
heapq.heappush(pq, (priority, item))  # O(log n)
# But no efficient way to update priority

# GOOD: Using heap for priority queue
import heapq
pq = []
heapq.heappush(pq, (priority, item))

⚠️

Common Mistake: Using list.pop(0) in tight loops causes O(n²) performance.


Summary

Data StructureImplementationUse CaseKey Operation
listDynamic arrayRandom accessO(1) index
dequeDoubly-linked listQueue/stackO(1) append/pop
dictHash tableKey-valueO(1) lookup
setHash tableUnique elementsO(1) membership
heapBinary heapPriority queueO(log n) push/pop

Best Practices

  1. Use dict for key-value lookups
  2. Use set for membership testing
  3. Use deque for queue operations
  4. Use heap for priority queues
  5. Avoid list.pop(0) in performance-critical code
  6. Consider array.array for memory efficiency

ℹ️

Key Takeaway: Understanding data structure internals helps you choose the right one and write efficient code.


Practice Problems

  1. LRU Cache: Implement LRU cache using OrderedDict
  2. Bloom Filter: Create a simple Bloom filter
  3. Priority Queue: Build priority queue with heap
  4. Sliding Window: Implement sliding window using deque
  5. Custom Hash Map: Implement your own hash map

Further Reading

  • Python Docs: collections, heapq, array
  • CPython Source: Objects/dictobject.c, Objects/listobject.c
  • Books: "Fluent Python" by Luciano Ramalho
  • Advanced: Hash table implementations, bloom filters

Remember: Choosing the right data structure is crucial for algorithm efficiency.

Advertisement