Data Quality & Governance¶
A comprehensive reference covering best-practice data quality and governance principles for an Analytics Engineering consultancy. Grounded in DAMA-DMBOK, dbt best practices, and Google's data quality framework.
15|## 1. Data Quality Principles
16|
17|> *"Data quality is not an event β it is a continuous process."* β DAMA-DMBOK
18|
19|Data quality is the foundation upon which all analytics value is built. Without quality, trust erodes and adoption collapses. These principles draw from the DAMA-DMBOK six dimensions of data quality and practical AE experience.
20|
21|### 1.1 Completeness
22|
23|**Principle:** All required data is present; no critical records or attributes are missing beyond an acceptable threshold.
24|
25|- **Define thresholds per dimension:** A customer table with 95% email population may be acceptable for marketing, but not for billing. Establish completeness SLAs per use case.
26|- **Null checks as first-class tests:** Every staging model should include `not_null` tests on business-critical columns. Use dbt's `not_null` and `accepted_values` tests as minimum viable quality gates.
27|- **Monitor trending completeness:** Track completeness ratios over time in a metrics dashboard. Sudden drops often indicate upstream breakage before anyone reports it.
28|
29|### 1.2 Accuracy
30|
31|**Principle:** Data correctly represents the real-world entity or event it describes.
32|
33|- **Source-system authority:** The system of record owns accuracy. Never "fix" data downstream without a clear, documented reason and ideally a correction back to the source.
34|- **Reconciliation tests:** Build dbt tests that compare mart aggregates back to source counts. `dbt_utils.equal_rowcount` and custom reconciliation queries catch drift early.
35|- **Data contracts enforce accuracy:** Agree with source system owners what valid values look like. Contracts prevent "soft" schema changes that silently corrupt accuracy.
36|
37|### 1.3 Timeliness
38|
39|**Principle:** Data is available within the timeframe required by its consumers.
40|
41|- **Define freshness SLAs:** Explicitly state expectations (e.g., "financial data refreshed by 07:00 UTC daily"). dbt source freshness checks operationalise this.
42|- **Right-size latency:** Not all data needs real-time. Batch is simpler, cheaper, and often sufficient. Choose timeliness based on decision cadence, not vanity.
43|- **Model dependency awareness:** A mart model is only as fresh as its slowest upstream dependency. Map critical paths and optimise bottleneck models.
44|
45|### 1.4 Consistency
46|
47|**Principle:** Data values are uniform and non-contradictory across systems and time.
48|
49|- **Canonical definitions once, referenced everywhere:** Define business metrics in a single semantic layer. If "active customer" means different things in different dashboards, you have a consistency problem.
50|- **Referential integrity tests:** Use `relationships` tests in dbt to enforce foreign keys. Broken FKs are the most common source of inconsistency in dimensional models.
51|- **Type consistency across sources:** When combining sources, cast and standardise types in staging. Never let inconsistent type handling (e.g., VARCHAR vs INTEGER customer IDs) propagate to marts.
52|
53|### 1.5 Validity
54|
55|**Principle:** Data conforms to defined business rules, formats, and value ranges.
56|
57|- **Assert business rules as dbt tests:** Every business rule (positive amounts, date ordering, status transitions) should have a corresponding test. Custom singular tests are ideal for complex rules.
58|- **Use accepted_values pragmatically:** Over-constraining accepted values creates breakage when legitimate new values arrive. Balance safety with flexibility.
59|- **Valid vs. complete are different:** Validity checks that all email addresses contain "@" are validity; checking that 99% of records have email addresses is completeness. Measure both.
60|
61|### 1.6 Uniqueness
62|
63|**Principle:** Each real-world entity is represented exactly once; no unintended duplicates exist.
64|
65|- **Primary key uniqueness as non-negotiable:** Every model must have a tested unique primary key. `unique` tests in dbt are table stakes, not optional.
66|- **Surrogate keys over natural keys:** Generate deterministic surrogate keys (e.g., `md5(concat(business_key))`) to handle source key collisions and changes gracefully.
67|- **Dedup in staging, not marts:** Handle deduplication as early as possible in the transformation layer. Staging models should produce one row per grain.
68|
69|**Practical implementation note:** In dbt, these six dimensions map directly to test types: `unique` (uniqueness), `not_null` (completeness), `accepted_values` (validity), `relationships` (consistency), source `freshness` (timeliness), and custom reconciliation tests (accuracy). A well-tested dbt project operationalises data quality.
70|
71|---
72|
73|## 2. Data Governance Principles
74|
75|> *"You can't manage what you don't know you have."* β Adapted from DAMA-DMBOK
76|
77|Governance is not bureaucracy β it is the organisational scaffolding that makes data trustworthy, findable, and accountable. For an AE consultancy, governance is what separates a one-off model from a sustainable data product.
78|
79|### 2.1 Data Ownership
80|
81|**Principle:** Every data asset has a single, accountable owner responsible for its quality and lifecycle.
82|
83|- **One owner per domain:** Data ownership follows organisational domains (finance owns financial data, CRM team owns customer data). Avoid shared ownership β it means no ownership.
84|- **Owners are business roles, not tech roles:** A data owner makes decisions about access and definitions; they are business leaders, not engineers.
85|- **Ownership is documented in the catalog:** Every table, model, and dashboard should have an `owner` property. In dbt, populate the `owner` field in `properties.yml`.
86|
87|### 2.2 Data Stewardship
88|
89|**Principle:** Data stewards are the operational bridges between ownership and engineering β they execute governance policy.
90|
91|- **Stewards translate policy to practice:** Where owners set direction, stewards ensure the taxonomy is applied, metadata is maintained, and quality rules are enforced.
92|- **Stewardship is a role, not a job title:** In smaller organisations, the AE often wears the steward hat. Be explicit about this dual role and its limits.
93|- **Stewards maintain the data dictionary:** They own the definitions, business rules, and lineage annotations. If the documentation is stale, the steward is accountable.
94|
95|### 2.3 Metadata Management
96|
97|**Principle:** Metadata β data about data β is a first-class asset, managed with the same rigour as the data itself.
98|
99|- **Three metadata tiers:**
100| - Technical metadata: Schema, types, row counts, partitioning, freshness.
101| - Business metadata: Definitions, owners, sensitivity classification, metric definitions.
102| - Operational metadata: Query history, model run times, test results, cost per model.
103|- Metadata as code: Store metadata in version-controlled YAML files (dbt properties.yml, schema.yml). Not in wiki pages that drift.
104|- Auto-generated where possible: Tools like dbt generate structural metadata automatically. Layer on business metadata manually but enforce it in CI.
105|
106|### 2.4 Data Lineage
107|
108|Principle: Every data point's journey from source to consumption must be traceable, automated, and trustworthy.
109|
110|- End-to-end lineage: From source system extraction through staging, intermediate, and mart models to BI dashboards. If you cannot trace a KPI back to its source table, your lineage is incomplete.
111|- Lineage is derived, not documented: Use dbt's manifest.json and DAG to auto-generate lineage. Manual documentation always drifts.
112|- Lineage enables impact analysis: Before changing a staging model, query the lineage to understand downstream consumers. This is how you avoid "I didn't know that fed the CEO's dashboard."
113|
114|### 2.5 Data Cataloguing
115|
116|Principle: Data assets are discoverable, searchable, and documented in a central catalogue.
117|
118|- Catalogue early and incrementally: Don't wait for a perfect taxonomy. Start with dbt docs and a basic data catalogue (Atlan, Alation, DataHub, or even a well-maintained Notion page).
119|- Every model documented: dbt's description fields on models and columns are non-negotiable. Undocumented models are undocumented architecture.
120|- Tag and classify: Use tags for domain, sensitivity, and refresh cadence. Classification (PII, confidential, public) should be automated where possible.
121|
122|---
463|## Appendix A: Source References 464| 465|| Source | Area | Key Contribution | 466||--------|------|------------------| 467|| DAMA-DMBOK (2nd Ed.) | Data Quality, Governance | Six dimensions of data quality; DAMA-DMBOK knowledge areas | 468|| dbt Best Practices | AE, Architecture | Model organisation, testing, documentation, project structure | 469|| DataOps Manifesto (dataopsmanifesto.org) | DataOps | Eight principles of DataOps methodology | 470|| Google's Data Quality Framework | Data Quality | Cloud data quality dimensions and measurement | 471|| Snowflake Well-Architected Framework | Cloud Platform | Compute-storage separation, RBAC, cost optimisation | 472|| GDPR (Regulation 2016/679) | Privacy & Ethics | Privacy by design, data minimisation, right to erasure | 473|| Martin Fowler β DataMonolith Pattern | Architecture | Data mesh and domain-oriented data ownership | 474|| Gartner β Data Maturity Models | Consultancy | Maturity assessment frameworks | 475|| dbt Semantic Layer Documentation | AE, Metrics | Metric definitions, semantic layer architecture | 476|| The Data Contract Specification (datacontract.com) | Architecture | Data contract structure and enforcement | 477|
Part of [[Data-Principles-for-Analytics-Engineering]]. See also: [[Data-Architecture]], [[DataOps-and-Data-Engineering]], [[AE-Consultancy-Delivery]].
Sub-pages¶
- [[Data-Quality-and-Governance β Sub-pages]]