Bigtable Deep Dive

Master Bigtable schema design, garbage collection, replication, and time-series patterns.

18 min readAdvanced

Schema Design Patterns

⚡ Bigtable Architecture for Data Engineering

Interview Tip: Bigtable is ideal for time-series and IoT data with high write throughput. Use composite row keys carefully to avoid hotspots (reverse domain notation). Column families group related data and have independent GC policies. Bigtable ≠ Firestore (choose Bigtable for massive scale, Firestore for document queries).

Implementation

from google.cloud import bigtable
from google.cloud.bigtable import column_family
import datetime

client = bigtable.Client(project="my-project", admin=True)
instance = client.instance("my-instance")

# Create table with multiple column families
table = instance.table("iot_data")
table.create(column_families={
    "readings": column_family.MaxVersionsGCPolicy(1),
    "metadata": column_family.MaxAgeGCPolicy(datetime.timedelta(days=365)),
    "alerts": column_family.MaxVersionsGCPolicy(5)
})

# Write time-series data
timestamp = datetime.datetime.now()
max_ts = 9999999999999
reversed_ts = max_ts - int(timestamp.timestamp() * 1000)
row_key = f"sensor_123#{reversed_ts:013d}"

row = table.direct_row(row_key)
row.set_cell("readings", "temperature", "23.5")
row.set_cell("readings", "humidity", "65.2")
row.set_cell("metadata", "location", "warehouse_a")
row.commit()

# Read with time range
start_time = datetime.datetime.now() - datetime.timedelta(hours=1)
end_time = datetime.datetime.now()
start_reversed = max_ts - int(end_time.timestamp() * 1000)
end_reversed = max_ts - int(start_time.timestamp() * 1000)

row_range = bigtable.row_range.RowRange(
    start_key=f"sensor_123#{start_reversed:013d}".encode(),
    end_key=f"sensor_123#{end_reversed:013d}".encode()
)

rows = table.read_rows(row_range=row_range)
for row in rows:
    print(f"Row: {row.row_key.decode()}")

✨

Best Practice: Design row keys for even distribution across nodes. Use reversed timestamps for time-series queries. Keep column families minimal (<100). Configure GC policies based on data retention requirements. Use replication for disaster recovery.

💬

Common Interview Questions

Q1: How do you prevent hotspots in Bigtable?

Answer: 1) Use hash prefixes on row keys, 2) Avoid monotonically increasing keys, 3) Distribute writes across multiple ordering keys, 4) Use salted row keys for high-throughput entities.

Q2: What is the difference between SSD and HDD in Bigtable?

Answer: SSD provides low-latency access (<10ms) for hot data. HDD provides lower cost for cold data with higher latency (>10ms). Use SSD for frequently accessed data, HDD for archival.

Q3: How does Bigtable replication work?

Answer: Bigtable automatically replicates data across clusters for high availability. Replication is eventually consistent (typically <10 seconds). Use multi-cluster routing for automatic failover.

Q4: When should you use Bigtable vs. Firestore?

Answer: Bigtable for high-throughput time-series/IoT data (millions of ops/sec). Firestore for document-based data with complex queries and real-time updates. Bigtable for analytics; Firestore for applications.

Q5: How do you optimize Bigtable costs?

Answer: 1) Use SSD for hot data, HDD for cold, 2) Configure aggressive GC policies, 3) Right-size nodes based on throughput, 4) Use replication only for availability, 5) Monitor and optimize row key design.

Bigtable Deep Dive: Schema Design, GC Policies & Replication