Data Governance Framework on GCP
Policy Tags for Column-Level Security
from google.cloud import dataplex_v1
client = dataplex_v1.DataCatalogClient()
# Create policy tag taxonomy
taxonomy = client.create_taxonomy(
request={
"parent": "projects/my-project/locations/us-central1",
"taxonomy": {
"display_name": "Data Classification",
"description": "Policy tags for data classification",
"policy_tag_tree": {
"child_taxonomies": [
{
"display_name": "Public",
"description": "Non-sensitive data"
},
{
"display_name": "Internal",
"description": "Internal use only"
},
{
"display_name": "Confidential",
"description": "Sensitive data requiring protection"
},
{
"display_name": "Restricted",
"description": "Highly sensitive PII/PHI"
}
]
}
}
}
)
# Apply policy tag to BigQuery column
# In BigQuery, use policy tags for column-level security
-- Apply policy tag to BigQuery column
ALTER TABLE `project.dataset.users`
ALTER COLUMN email
SET OPTIONS (
policy_tag = 'projects/my-project/locations/us-central1/taxonomies/123/policyTags/456'
);
Cloud DLP Integration
from google.cloud import dlp_v2
client = dlp_v2.DlpServiceClient()
# Inspect data for sensitive information
def inspect_data(project_id, content):
"""Inspect data for PII."""
parent = f"projects/{project_id}"
inspect_config = {
"info_types": [
{"name": "EMAIL_ADDRESS"},
{"name": "PHONE_NUMBER"},
{"name": "CREDIT_CARD_NUMBER"},
{"name": "US_SOCIAL_SECURITY_NUMBER"}
],
"min_likelihood": "LIKELY"
}
response = client.inspect_content(
request={
"parent": parent,
"inspect_config": inspect_config,
"item": {"value": content}
}
)
return response.result.info_type_inspectations
# De-identify sensitive data
def deidentify_data(project_id, content):
"""De-identify sensitive data."""
parent = f"projects/{project_id}"
deidentify_config = {
"info_type_transformations": {
"transformations": [
{
"info_types": [{"name": "EMAIL_ADDRESS"}],
"primitive_transformation": {
"character_mask_config": {
"masking_character": "*",
"number_to_mask": 0,
"reverse_order": False
}
}
}
]
}
}
response = client.deidentify_content(
request={
"parent": parent,
"deidentify_config": deidentify_config,
"item": {"value": content}
}
)
return response.result.item.value
β¨
Best Practice: Implement a data classification framework: Public, Internal, Confidential, Restricted. Use policy tags for column-level security in BigQuery. Apply Cloud DLP for automated PII detection. Enable audit logging for compliance. Review access controls quarterly.
Common Interview Questions
Q1: What are the key components of data governance?
Answer: 1) Data quality management, 2) Data security and access control, 3) Data lineage tracking, 4) Data cataloging and discovery, 5) Compliance management, 6) Data retention policies, 7) Privacy protection.
Q2: How do you implement column-level security in BigQuery?
Answer: Use policy tags to classify columns by sensitivity level. Apply IAM policies to policy tags to control access. Users without access see NULL values. Policy tags support four levels: Public, Internal, Confidential, Restricted.
Q3: What is Cloud DLP and when should you use it?
Answer: Cloud DLP detects, classifies, and de-identifies sensitive data. Use it for: 1) PII detection in data lakes, 2) Data masking for non-production environments, 3) Compliance auditing, 4) Automated classification of sensitive data.
Q4: How do you handle GDPR data deletion requests?
Answer: 1) Identify all data stores containing the user's data, 2) Use BigQuery time-travel for historical data, 3) Implement soft deletes with retention policies, 4) Use Cloud DLP to scan for residual PII, 5) Document deletion for compliance auditing.
Q5: What is the purpose of audit logs in data governance?
Answer: Audit logs track who accessed what data and when. They're essential for: 1) Compliance auditing (HIPAA, GDPR), 2) Security incident investigation, 3) Access pattern analysis, 4) Data usage tracking, 5) Policy enforcement verification.