Firestore vs. Datastore
Cloud Firestore is the next generation of Datastore, offering real-time synchronization and offline support. Datastore is the legacy NoSQL document database.
Comparison
Data Modeling for Data Engineering
Document Structure
from google.cloud import firestore
# Initialize Firestore client
db = firestore.Client()
# User profile document
user_ref = db.collection("users").document("user_123")
user_ref.set({
"user_id": "user_123",
"email": "user@example.com",
"profile": {
"name": "John Doe",
"age": 30,
"preferences": {
"theme": "dark",
"notifications": True
}
},
"events": [
{"event_type": "login", "timestamp": "2025-01-15T10:00:00Z"},
{"event_type": "purchase", "timestamp": "2025-01-15T10:05:00Z"}
],
"created_at": firestore.SERVER_TIMESTAMP,
"updated_at": firestore.SERVER_TIMESTAMP
})
# Subcollection for event history
events_ref = db.collection("users").document("user_123").collection("events")
events_ref.add({
"event_type": "login",
"timestamp": firestore.SERVER_TIMESTAMP,
"metadata": {
"device": "mobile",
"ip": "192.168.1.1"
}
})
Schema Design Patterns
Dataflow Integration
Reading from Firestore
import apache_beam as beam
from apache_beam.io.gcp.firestoreio import ReadFromFirestore
from apache_beam.options.pipeline_options import PipelineOptions
def run_firestore_to_bigquery():
"""Read from Firestore and write to BigQuery."""
pipeline_options = PipelineOptions([
'--project', 'my-project',
'--runner', 'DataflowRunner',
'--region', 'us-central1',
'--temp_location', 'gs://my-bucket/temp/'
])
with beam.Pipeline(options=pipeline_options) as pipeline:
# Read from Firestore
users = (
pipeline
| 'Read Firestore' >> ReadFromFirestore(
database='(default)',
collection='users'
)
)
# Transform and write to BigQuery
(
users
| 'Format' >> beam.Map(lambda x: {
'user_id': x['user_id'],
'name': x.get('profile', {}).get('name'),
'email': x.get('email'),
'created_at': str(x.get('created_at'))
})
| 'Write to BQ' >> beam.io.WriteToBigQuery(
'my-project:analytics.users',
schema='user_id:STRING,name:STRING,email:STRING,created_at:TIMESTAMP',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
Writing to Firestore
from apache_beam.io.gcp.firestoreio import WriteToFirestore
# Write Dataflow results to Firestore
(
processed_data
| 'Write to Firestore' >> WriteToFirestore(
database='(default)',
collection='processed_events'
)
)
β¨
Best Practice: For data engineering pipelines, prefer Datastore mode over Firestore Native mode. Datastore provides better batch operations, higher throughput, and is designed for backend data workloads rather than real-time applications.
Common Interview Questions
Q1: When would you use Firestore vs. Bigtable?
Answer: Firestore is for document-based data with complex queries and real-time updates. Bigtable is for high-throughput time-series data (IoT, metrics, logs). Firestore is better for user profiles, session data, and moderate-throughput workloads. Bigtable is better for high-volume, low-latency analytics workloads.
Q2: How do you export Firestore data to BigQuery?
Answer: Use the Firestore export service to export to GCS, then load into BigQuery. For continuous syncing, use Dataflow with FirestoreIO. The export creates Firestore Datastore backup files that can be imported into BigQuery using the bq load command.
Q3: What are the query limitations of Firestore?
Answer: Firestore has limitations: maximum 1MB per document, 30 query results limit (can paginate), no inequality filters on multiple fields, limited aggregation options. For complex analytics, export data to BigQuery. For high-throughput reads/writes, consider Bigtable.
Q4: How do you optimize Firestore performance for data pipelines?
Answer: 1) Denormalize data to reduce document reads, 2) Use composite indexes for complex queries, 3) Batch writes instead of individual writes, 4) Use parallel reads for large collections, 5) Consider Datastore mode for backend data workloads.
Q5: What is the difference between Firestore Native and Datastore mode?
Answer: Firestore Native mode supports real-time listeners and offline sync, designed for mobile/web apps. Datastore mode provides traditional server-side operations with higher throughput, designed for backend/data engineering workloads. Datastore mode is recommended for data pipelines.