πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Glue Data Catalog Deep Dive

AWS Data EngineeringPartitions, Indexes & Lake Formation⭐ Premium

Advertisement

πŸ“š Glue Data Catalog

Deep dive into Glue Data Catalog partitions, indexes, and Lake Formation.

Module: AWS Data Engineering β€’ Topic 39 of 65 β€’ Premium Content

Catalog Structure

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    GLUE DATA CATALOG DEEP DIVE                               β”‚
β”‚                                                                             β”‚
β”‚  Account β†’ Region β†’ Catalog β†’ Database β†’ Table β†’ Columns/Partitions        β”‚
β”‚                                                                             β”‚
β”‚  Partition Management:                                                      β”‚
β”‚  β€’ Partition by date (year/month/day) for time-series data                  β”‚
β”‚  β€’ Partition pruning reduces data scanned in queries                        β”‚
β”‚  β€’ Batch create/delete partitions for efficiency                            β”‚
β”‚                                                                             β”‚
β”‚  Statistics & Indexes:                                                      β”‚
β”‚  β€’ Table statistics for query optimization                                  β”‚
β”‚  β€’ Column statistics for data profiling                                     β”‚
β”‚  β€’ Stored in catalog for Athena/Spectrum                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Partition Management

import boto3
glue = boto3.client('glue')

# Batch create partitions
glue.batch_create_partition(
    DatabaseName='analytics_db',
    TableName='events',
    PartitionInputList=[
        {'Values': ['2024', '01', '15'],
         'StorageDescriptor': {
             'Columns': [{'Name': 'event_id', 'Type': 'string'}],
             'Location': 's3://data/events/year=2024/month=01/day=15/',
             'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
             'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
             'SerdeInfo': {'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'}
         }}
    ]
)

# Get partitions with expression
partitions = glue.get_partitions(
    DatabaseName='analytics_db',
    TableName='events',
    Expression='year=2024 AND month=01'
)

Interview Q&A

Q1: How does partition pruning work?

Answer: When you query with WHERE year=2024, Athena/Spectrum only scans partitions matching that year, dramatically reducing data scanned.

Q2: What statistics does the catalog store?

Answer: Row counts, column statistics (min/max/num-nulls), table size. Used by Athena and Spectrum for query optimization.

Q3: How many partitions can a table have?

Answer: Glue supports up to 20 million partitions per table. Excessive partitions impact performance.

Summary

  • Partitions: Organize data for efficient querying, up to 20M per table
  • Statistics: Row counts, column stats for query optimization
  • Crawlers: Automatic schema and partition discovery
  • Lake Formation: Fine-grained permissions on catalog objects
  • Performance: Partition pruning reduces data scanned

Advertisement