πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Bigtable Deep Dive: Schema Design, GC Policies & Replication

GCP Data EngineeringBigtable Advanced⭐ Premium

Advertisement

Bigtable Deep Dive

Master Bigtable schema design, garbage collection, replication, and time-series patterns.

18 min readAdvanced

Schema Design Patterns

⚑ Bigtable Architecture for Data Engineering
Cloud Bigtable: Cluster, Storage & Data ModelCLUSTER ARCHITECTUREMaster (Cluster Mgmt)Handles admin operationsNode 1Table A: Tablets 1-3CPU, Memory, SSDNode 2Table A: Tablets 4-6CPU, Memory, SSDNode 3Table B: Tablets 1-2CPU, Memory, SSDTABLETS (Data Partitioning)Tablet 1: row_a β†’ row_dTablet 2: row_d β†’ row_gTablet 3: row_g β†’ row_mTablet 4: row_m β†’ row_sTablet 5: row_s β†’ row_zDATA MODELRow Key:device#1234#2024-01← Composite key (reverse for hotspot avoidance)COLUMN FAMILIESmetadatadevice_name, location | GC: 7 daysreadingstemp, humidity, pressure | GC: 30 daysSTORAGE TYPESSSDLow latency, high IOPSReal-time analyticsHDDHigh throughput, lower costBatch analytics, logs
Interview Tip: Bigtable is ideal for time-series and IoT data with high write throughput. Use composite row keys carefully to avoid hotspots (reverse domain notation). Column families group related data and have independent GC policies. Bigtable β‰  Firestore (choose Bigtable for massive scale, Firestore for document queries).

Implementation

from google.cloud import bigtable
from google.cloud.bigtable import column_family
import datetime

client = bigtable.Client(project="my-project", admin=True)
instance = client.instance("my-instance")

# Create table with multiple column families
table = instance.table("iot_data")
table.create(column_families={
    "readings": column_family.MaxVersionsGCPolicy(1),
    "metadata": column_family.MaxAgeGCPolicy(datetime.timedelta(days=365)),
    "alerts": column_family.MaxVersionsGCPolicy(5)
})

# Write time-series data
timestamp = datetime.datetime.now()
max_ts = 9999999999999
reversed_ts = max_ts - int(timestamp.timestamp() * 1000)
row_key = f"sensor_123#{reversed_ts:013d}"

row = table.direct_row(row_key)
row.set_cell("readings", "temperature", "23.5")
row.set_cell("readings", "humidity", "65.2")
row.set_cell("metadata", "location", "warehouse_a")
row.commit()

# Read with time range
start_time = datetime.datetime.now() - datetime.timedelta(hours=1)
end_time = datetime.datetime.now()
start_reversed = max_ts - int(end_time.timestamp() * 1000)
end_reversed = max_ts - int(start_time.timestamp() * 1000)

row_range = bigtable.row_range.RowRange(
    start_key=f"sensor_123#{start_reversed:013d}".encode(),
    end_key=f"sensor_123#{end_reversed:013d}".encode()
)

rows = table.read_rows(row_range=row_range)
for row in rows:
    print(f"Row: {row.row_key.decode()}")

✨

Best Practice: Design row keys for even distribution across nodes. Use reversed timestamps for time-series queries. Keep column families minimal (<100). Configure GC policies based on data retention requirements. Use replication for disaster recovery.

πŸ’¬

Common Interview Questions

Q1: How do you prevent hotspots in Bigtable?

Answer: 1) Use hash prefixes on row keys, 2) Avoid monotonically increasing keys, 3) Distribute writes across multiple ordering keys, 4) Use salted row keys for high-throughput entities.

Q2: What is the difference between SSD and HDD in Bigtable?

Answer: SSD provides low-latency access (<10ms) for hot data. HDD provides lower cost for cold data with higher latency (>10ms). Use SSD for frequently accessed data, HDD for archival.

Q3: How does Bigtable replication work?

Answer: Bigtable automatically replicates data across clusters for high availability. Replication is eventually consistent (typically <10 seconds). Use multi-cluster routing for automatic failover.

Q4: When should you use Bigtable vs. Firestore?

Answer: Bigtable for high-throughput time-series/IoT data (millions of ops/sec). Firestore for document-based data with complex queries and real-time updates. Bigtable for analytics; Firestore for applications.

Q5: How do you optimize Bigtable costs?

Answer: 1) Use SSD for hot data, HDD for cold, 2) Configure aggressive GC policies, 3) Right-size nodes based on throughput, 4) Use replication only for availability, 5) Monitor and optimize row key design.

Advertisement