Metadata Management: Catalogs, Tags & Discovery
Difficulty: Senior Level | Companies: LinkedIn, Netflix, Uber, Airbnb, Spotify
1. Types of Metadata
Architecture Diagram
Metadata Types:
βββ Technical Metadata (schema, types, partitions)
βββ Business Metadata (descriptions, owners, glossary)
βββ Operational Metadata (run times, row counts, errors)
βββ Social Metadata (tags, ratings, reviews)
βββ Lineage Metadata (upstream/downstream dependencies)
2. Data Catalog Implementation
from dataclasses import dataclass, field
from datetime import datetime
from typing import Dict, List, Optional
@dataclass
class CatalogEntry:
name: str
domain: str
description: str
owner: str
schema: Dict[str, str]
tags: List[str] = field(default_factory=list)
quality_score: float = 0.0
last_updated: datetime = field(default_factory=datetime.now)
upstream: List[str] = field(default_factory=list)
downstream: List[str] = field(default_factory=list)
class DataCatalog:
def __init__(self):
self.entries: Dict[str, CatalogEntry] = {}
self.lineage_graph = {}
def register(self, entry: CatalogEntry):
self.entries[entry.name] = entry
def search(self, query: str = "", tags: List[str] = None, domain: str = None) -> List[CatalogEntry]:
results = list(self.entries.values())
if query:
results = [e for e in results if query.lower() in e.description.lower() or query.lower() in e.name.lower()]
if tags:
results = [e for e in results if any(t in e.tags for t in tags)]
if domain:
results = [e for e in results if e.domain == domain]
return sorted(results, key=lambda e: e.quality_score, reverse=True)
def impact_analysis(self, dataset_name: str) -> Dict:
entry = self.entries.get(dataset_name)
return {"upstream": entry.upstream, "downstream": entry.downstream} if entry else {}
def add_tag(self, dataset_name: str, tag: str):
if dataset_name in self.entries:
self.entries[dataset_name].tags.append(tag)
3. Auto-Classification
class AutoClassifier:
PII_PATTERNS = {
"email": r"^[\w\.-]+@[\w\.-]+\.\w+$",
"phone": r"^\+?[\d\-\(\)\s]{10,}$",
"ssn": r"^\d{3}-\d{2}-\d{4}$",
"credit_card": r"^\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$",
}
def classify_column(self, column_name: str, sample_values: list) -> List[str]:
tags = []
# Name-based classification
name_lower = column_name.lower()
if any(pii in name_lower for pii in ["email", "ssn", "phone", "credit"]):
tags.append("pii")
# Value-based classification
for pii_type, pattern in self.PII_PATTERNS.items():
import re
matches = sum(1 for v in sample_values if re.match(pattern, str(v)))
if matches / len(sample_values) > 0.8:
tags.append(pii_type)
tags.append("pii")
return list(set(tags))
βΉοΈ
Best Practice: Auto-classify on ingestion, not as an afterthought. Tag PII columns at write time and enforce policies at read time.
Follow-Up Questions
- How would you build a data catalog for 100,000+ datasets?
- Design a metadata-driven pipeline that auto-generates documentation.
- How do you keep metadata fresh as schemas evolve?
- Design a data discovery system for non-technical users.
- How would you implement a business glossary with automated linking?