Skip to content

Data Architecture

Data architecture principles for an Analytics Engineering consultancy β€” how we design the structural integrity of the platform. Grounded in Martin Fowler, dbt best practices, and the Data Contract Specification.


124|## 3. Data Architecture Principles 125| 126|> "Architecture is the decisions you wish you'd got right early on." β€” Martin Fowler (adapted) 127| 128|Data architecture defines the structural integrity of your platform. These principles ensure your architecture scales, remains understandable, and supports reliable analytics. 129| 130|### 3.1 Single Source of Truth 131| 132|Principle: Each business concept is materialised in exactly one canonical location; all consumers reference that location. 133| 134|- One model per grain per domain: A customer's current state lives in one mart model. If three dashboards need three slightly different customer views, build them as lightweight views off the canonical model β€” not three separate pipelines. 135|- Centralise business logic, decentralise consumption: The mart layer is where business logic converges. Downstream consumption (dashboards, exports, ML features) diverges, but always references the same core model. 136|- Avoid the "just pull it from the source" anti-pattern: When analysts bypass the warehouse to query source systems directly, you lose version control, lineage, and testing. Ingest first, transform second, consume third. 137| 138|### 3.2 Idempotent ELT 139| 140|Principle: Every transformation must be safely re-runnable and produce identical results for identical inputs, regardless of how many times it runs. 141| 142|- Idempotency is non-negotiable: A dbt incremental model must use unique_key and appropriate incremental_strategy so that re-running never duplicates or corrupts data. 143|- Prefer insert_overwrite over append: For incremental models, insert_overwrite with a partition key ensures idempotency. Raw append creates duplicates on re-run. 144|- Full-refresh as fallback: Always ensure models can run in full_refresh mode. If a model cannot be rebuilt from scratch, it is fragile by design. 145| 146|### 3.3 Incremental Processing 147| 148|Principle: Process only new or changed data; avoid full-table scans where incremental logic is safe and correct. 149| 150|- Incremental for volume, full-refresh for simplicity: If a model processes millions of rows daily but only thousands are new, incremental processing saves time and money. For small tables, full-refresh is simpler and equally fast. 151|- Test incremental logic explicitly: The most dangerous bugs hide in incremental logic. Test that re-running an incremental model produces the same result as a full-refresh (use dbt's --full-refresh flag to verify). 152|- Watermark and late-arriving data: Choose incremental keys (typically updated_at or _loaded_at) carefully. Account for late-arriving data with lookback windows. 153| 154|### 3.4 Separation of Raw / Staging / Mart 155| 156|Principle: A clear three-tier architecture enforces separation of concerns: raw preserves source fidelity, staging standardises, marts serve business logic. 157| 158|- Raw (Bronze / Landing): Load-and-forget. No transformations, no filtering, no renaming. Preserve source data exactly as received. This is your replay log β€” if you need to reprocess, you start here. 159|- Staging (Silver): One-to-one with source tables. Rename, recast, deduplicate, and apply lightweight business rules (e.g., is_valid_flag). Staging is the single point where source naming conventions meet warehouse naming conventions. 160|- Mart (Gold): Business-facing models organised by domain. Join, aggregate, and apply business logic. Mart models are what analysts and BI tools consume. They must be well-tested, documented, and stable. 161|- Intermediate models: Between staging and mart, use intermediate models for complex multi-step transformations. This keeps mart SQL readable and staging pure. 162| 163|> dbt project structure mapping: 164|> 165|> models/ 166|> β”œβ”€β”€ staging/ # One .sql per source table, light transforms 167|> β”œβ”€β”€ intermediate/ # Multi-step joins, pivots, business logic prep 168|> └── marts/ # Star schema, fact/dim tables, domain-organised 169|> β”œβ”€β”€ finance/ 170|> β”œβ”€β”€ marketing/ 171|> └── operations/ 172|> 173| 174|### 3.5 Data Contracts 175| 176|Principle: Interfaces between data producers and consumers are formally agreed, versioned, and enforced. 177| 178|- A data contract defines: Schema (columns, types, constraints), freshness SLA, quality thresholds, ownership, and breaking change protocol. 179|- Contracts are bidirectional: Source system teams contract with the platform team on what they'll deliver; the platform team contracts with analysts on what they can rely on. 180|- Enforce with dbt contracts: Use dbt model contracts (contracts in model config with enforced: true) to lock column types and prevent silent schema changes. This is your strongest guardrail against drift. 181|- Version your contracts: Treat contracts like APIs. Breaking changes (removing a column, changing a type) require versioning and consumer notification. Non-breaking changes (adding nullable columns) can be additive. 182| 183|---


Part of [[Data-Principles-for-Analytics-Engineering]]. See also: [[Data-Quality-and-Governance]], [[DataOps-and-Data-Engineering]], [[AE-Consultancy-Delivery]].

  • [[Data-Principles-for-Analytics-Engineering]]
  • [[Data-Quality-and-Governance]]
  • [[DataOps-and-Data-Engineering]]
  • [[AE-Consultancy-Delivery]]
  • [[10-Opportunities-for-Additional-Value]]