Data Architecture¶
Data architecture principles for an Analytics Engineering consultancy β how we design the structural integrity of the platform. Grounded in Martin Fowler, dbt best practices, and the Data Contract Specification.
124|## 3. Data Architecture Principles
125|
126|> "Architecture is the decisions you wish you'd got right early on." β Martin Fowler (adapted)
127|
128|Data architecture defines the structural integrity of your platform. These principles ensure your architecture scales, remains understandable, and supports reliable analytics.
129|
130|### 3.1 Single Source of Truth
131|
132|Principle: Each business concept is materialised in exactly one canonical location; all consumers reference that location.
133|
134|- One model per grain per domain: A customer's current state lives in one mart model. If three dashboards need three slightly different customer views, build them as lightweight views off the canonical model β not three separate pipelines.
135|- Centralise business logic, decentralise consumption: The mart layer is where business logic converges. Downstream consumption (dashboards, exports, ML features) diverges, but always references the same core model.
136|- Avoid the "just pull it from the source" anti-pattern: When analysts bypass the warehouse to query source systems directly, you lose version control, lineage, and testing. Ingest first, transform second, consume third.
137|
138|### 3.2 Idempotent ELT
139|
140|Principle: Every transformation must be safely re-runnable and produce identical results for identical inputs, regardless of how many times it runs.
141|
142|- Idempotency is non-negotiable: A dbt incremental model must use unique_key and appropriate incremental_strategy so that re-running never duplicates or corrupts data.
143|- Prefer insert_overwrite over append: For incremental models, insert_overwrite with a partition key ensures idempotency. Raw append creates duplicates on re-run.
144|- Full-refresh as fallback: Always ensure models can run in full_refresh mode. If a model cannot be rebuilt from scratch, it is fragile by design.
145|
146|### 3.3 Incremental Processing
147|
148|Principle: Process only new or changed data; avoid full-table scans where incremental logic is safe and correct.
149|
150|- Incremental for volume, full-refresh for simplicity: If a model processes millions of rows daily but only thousands are new, incremental processing saves time and money. For small tables, full-refresh is simpler and equally fast.
151|- Test incremental logic explicitly: The most dangerous bugs hide in incremental logic. Test that re-running an incremental model produces the same result as a full-refresh (use dbt's --full-refresh flag to verify).
152|- Watermark and late-arriving data: Choose incremental keys (typically updated_at or _loaded_at) carefully. Account for late-arriving data with lookback windows.
153|
154|### 3.4 Separation of Raw / Staging / Mart
155|
156|Principle: A clear three-tier architecture enforces separation of concerns: raw preserves source fidelity, staging standardises, marts serve business logic.
157|
158|- Raw (Bronze / Landing): Load-and-forget. No transformations, no filtering, no renaming. Preserve source data exactly as received. This is your replay log β if you need to reprocess, you start here.
159|- Staging (Silver): One-to-one with source tables. Rename, recast, deduplicate, and apply lightweight business rules (e.g., is_valid_flag). Staging is the single point where source naming conventions meet warehouse naming conventions.
160|- Mart (Gold): Business-facing models organised by domain. Join, aggregate, and apply business logic. Mart models are what analysts and BI tools consume. They must be well-tested, documented, and stable.
161|- Intermediate models: Between staging and mart, use intermediate models for complex multi-step transformations. This keeps mart SQL readable and staging pure.
162|
163|> dbt project structure mapping:
164|> 165|> models/
166|> βββ staging/ # One .sql per source table, light transforms
167|> βββ intermediate/ # Multi-step joins, pivots, business logic prep
168|> βββ marts/ # Star schema, fact/dim tables, domain-organised
169|> βββ finance/
170|> βββ marketing/
171|> βββ operations/
172|>
173|
174|### 3.5 Data Contracts
175|
176|Principle: Interfaces between data producers and consumers are formally agreed, versioned, and enforced.
177|
178|- A data contract defines: Schema (columns, types, constraints), freshness SLA, quality thresholds, ownership, and breaking change protocol.
179|- Contracts are bidirectional: Source system teams contract with the platform team on what they'll deliver; the platform team contracts with analysts on what they can rely on.
180|- Enforce with dbt contracts: Use dbt model contracts (contracts in model config with enforced: true) to lock column types and prevent silent schema changes. This is your strongest guardrail against drift.
181|- Version your contracts: Treat contracts like APIs. Breaking changes (removing a column, changing a type) require versioning and consumer notification. Non-breaking changes (adding nullable columns) can be additive.
182|
183|---
Part of [[Data-Principles-for-Analytics-Engineering]]. See also: [[Data-Quality-and-Governance]], [[DataOps-and-Data-Engineering]], [[AE-Consultancy-Delivery]].
Related¶
- [[Data-Principles-for-Analytics-Engineering]]
- [[Data-Quality-and-Governance]]
- [[DataOps-and-Data-Engineering]]
- [[AE-Consultancy-Delivery]]
- [[10-Opportunities-for-Additional-Value]]