Firestore & Datastore for Data Engineering

Master Cloud Firestore and Datastore for data engineering including schema design, indexing, Dataflow integration, and pipeline patterns.

15 min readIntermediate

Firestore vs. Datastore

Cloud Firestore is the next generation of Datastore, offering real-time synchronization and offline support. Datastore is the legacy NoSQL document database.

Comparison

🏗️ GCP Data Engineering Reference Architecture

Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

Data Modeling for Data Engineering

Document Structure

from google.cloud import firestore

# Initialize Firestore client
db = firestore.Client()

# User profile document
user_ref = db.collection("users").document("user_123")
user_ref.set({
    "user_id": "user_123",
    "email": "user@example.com",
    "profile": {
        "name": "John Doe",
        "age": 30,
        "preferences": {
            "theme": "dark",
            "notifications": True
        }
    },
    "events": [
        {"event_type": "login", "timestamp": "2025-01-15T10:00:00Z"},
        {"event_type": "purchase", "timestamp": "2025-01-15T10:05:00Z"}
    ],
    "created_at": firestore.SERVER_TIMESTAMP,
    "updated_at": firestore.SERVER_TIMESTAMP
})

# Subcollection for event history
events_ref = db.collection("users").document("user_123").collection("events")
events_ref.add({
    "event_type": "login",
    "timestamp": firestore.SERVER_TIMESTAMP,
    "metadata": {
        "device": "mobile",
        "ip": "192.168.1.1"
    }
})

Schema Design Patterns

🏗️ GCP Data Engineering Reference Architecture

Interview Tip: GCP's data engineering stack is serverless-first. Dataflow (Apache Beam) handles both streaming and batch. BigQuery is the flagship analytics service.

Dataflow Integration

Reading from Firestore

import apache_beam as beam
from apache_beam.io.gcp.firestoreio import ReadFromFirestore
from apache_beam.options.pipeline_options import PipelineOptions

def run_firestore_to_bigquery():
    """Read from Firestore and write to BigQuery."""
    pipeline_options = PipelineOptions([
        '--project', 'my-project',
        '--runner', 'DataflowRunner',
        '--region', 'us-central1',
        '--temp_location', 'gs://my-bucket/temp/'
    ])

    with beam.Pipeline(options=pipeline_options) as pipeline:
        # Read from Firestore
        users = (
            pipeline
            | 'Read Firestore' >> ReadFromFirestore(
                database='(default)',
                collection='users'
            )
        )

        # Transform and write to BigQuery
        (
            users
            | 'Format' >> beam.Map(lambda x: {
                'user_id': x['user_id'],
                'name': x.get('profile', {}).get('name'),
                'email': x.get('email'),
                'created_at': str(x.get('created_at'))
            })
            | 'Write to BQ' >> beam.io.WriteToBigQuery(
                'my-project:analytics.users',
                schema='user_id:STRING,name:STRING,email:STRING,created_at:TIMESTAMP',
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
            )
        )

Writing to Firestore

from apache_beam.io.gcp.firestoreio import WriteToFirestore

# Write Dataflow results to Firestore
(
    processed_data
    | 'Write to Firestore' >> WriteToFirestore(
        database='(default)',
        collection='processed_events'
    )
)

✨

Best Practice: For data engineering pipelines, prefer Datastore mode over Firestore Native mode. Datastore provides better batch operations, higher throughput, and is designed for backend data workloads rather than real-time applications.

💬

Common Interview Questions

Q1: When would you use Firestore vs. Bigtable?

Answer: Firestore is for document-based data with complex queries and real-time updates. Bigtable is for high-throughput time-series data (IoT, metrics, logs). Firestore is better for user profiles, session data, and moderate-throughput workloads. Bigtable is better for high-volume, low-latency analytics workloads.

Q2: How do you export Firestore data to BigQuery?

Answer: Use the Firestore export service to export to GCS, then load into BigQuery. For continuous syncing, use Dataflow with FirestoreIO. The export creates Firestore Datastore backup files that can be imported into BigQuery using the bq load command.

Q3: What are the query limitations of Firestore?

Answer: Firestore has limitations: maximum 1MB per document, 30 query results limit (can paginate), no inequality filters on multiple fields, limited aggregation options. For complex analytics, export data to BigQuery. For high-throughput reads/writes, consider Bigtable.

Q4: How do you optimize Firestore performance for data pipelines?

Answer: 1) Denormalize data to reduce document reads, 2) Use composite indexes for complex queries, 3) Batch writes instead of individual writes, 4) Use parallel reads for large collections, 5) Consider Datastore mode for backend data workloads.

Q5: What is the difference between Firestore Native and Datastore mode?

Answer: Firestore Native mode supports real-time listeners and offline sync, designed for mobile/web apps. Datastore mode provides traditional server-side operations with higher throughput, designed for backend/data engineering workloads. Datastore mode is recommended for data pipelines.

Firestore & Datastore: NoSQL for Data Pipelines