AE Consultancy Delivery¶
Consultancy craft principles for Analytics Engineering β dbt practices, cloud platform choices, privacy & ethics, client engagement, and knowledge transfer. Grounded in dbt Labs, Snowflake well-architected framework, GDPR, and Gartner maturity models.
251|## 5. Analytics Engineering Principles
252|
253|> "Analytics engineering is the data transformation layer between raw data and business decisions." β dbt Labs
254|
255|Analytics engineering (AE) is where data architecture, software engineering, and domain knowledge converge. These principles define how AE practitioners should work, particularly with dbt.
256|
257|### 5.1 dbt Model Organisation
258|
259|Principle: Models are organised by function and domain, not by source or team.
260|
261|- Three-tier structure (raw β staging β mart): See Β§3.4. Never skip layers. Each layer has a defined purpose and constraint.
262|- Domain-oriented marts: Organise mart models by business domain (finance, marketing, operations), not by source system. Domains are what business users understand.
263|- One thing per model: A model does one logical transformation. If a model does three joins and an aggregation, consider splitting into intermediate + mart models.
264|- Naming conventions:
265| - Staging: stg_[source]_[entity] (e.g., stg_stripe__charges)
266| - Intermediate: int_[domain]_[description] (e.g., int_marketing__campaign_performance)
267| - Mart: [domain]_[entity]_[grain] (e.g., fct_orders, dim_customers)
268|
269|### 5.2 dbt Testing
270|
271|Principle: Testing is the primary mechanism for ensuring data quality at the transformation layer.
272|
273|- Minimum test coverage:
274| - Every model: unique on primary key, not_null on primary key and critical business fields
275| - Every foreign key: relationships test
276| - Every status/boolean field: accepted_values test
277|- Custom generic tests for domain rules: Write reusable tests for common business patterns (positive revenue, valid date ranges, referential integrity with conditions).
278|- Store test severity: Not all tests are equal. Use error_if and warn_if to differentiate hard failures from soft warnings.
279|- Test results are part of the data product: Publish test results. Consumers should be able to see data quality scores alongside the data itself.
280|
281|### 5.3 dbt Documentation
282|
283|Principle: If a model isn't documented, it isn't ready for production.
284|
285|- Every model and column described: Use description fields in properties.yml. Business descriptions, not technical ones. "Amount paid by customer in GBP" not "numeric(18,2) column."
286|- Documentation as code, not wiki: Docs live in YAML alongside models. Not in Confluence. Not in READMEs that no one reads. Generate static docs with dbt docs generate.
287|- Document assumptions and caveits: If a model excludes certain edge cases (e.g., "excludes test accounts"), document it. Future-you will thank present-you.
288|
289|### 5.4 dbt Macros and Packages
290|
291|Principle: Macros and packages reduce duplication, enforce consistency, and encode institutional knowledge.
292|
293|- DRY principle (Don't Repeat Yourself): If you write the same SQL logic more than twice, it should be a macro. Common patterns: date truncation, surrogate key generation, SCD logic.
294|- Local macros before packages: If a macro is domain-specific, keep it in the project. If it's broadly useful, consider open-sourcing it. Only use packages for truly universal patterns.
295|- Evaluate packages critically: dbt_utils and dbt_expectations are near-essential. Beyond these, justify each package. Every dependency is a maintenance burden.
296|- Pin package versions: Always specify exact versions in packages.yml. Unpinned dependencies will break your project.
297|
298|### 5.5 Metric Layer and Semantic Layer
299|
300|Principle: Business metrics are defined once, computed consistently, and served everywhere.
301|
302|- The metric layer problem: Without a central metric definition, "revenue" means different things in different dashboards. This is the root cause of most "whose number is right?" disputes.
303|- dbt Semantic Layer / metrics: Use dbt's metric definitions (in YAML) as the canonical source. Metrics should have clear descriptions, dimensions, filters, and calculation methods.
304|- Metrics as code: Metric definitions are version-controlled, tested, and reviewed like any other code. They are the contract between AE and the business.
305|- From semantic layer to consumption: The semantic layer feeds BI tools, ad-hoc queries, and operational systems. It is the single point of metric truth.
306|
307|---
308|
309|## 6. Cloud Data Platform Principles
310|
311|> "Cloud-native data platforms maximise the separation of compute and storage, the elasticity of resources, and the governance of access." β Snowflake Architecture Principles
312|
313|These principles guide the selection and architecture of cloud data platforms, with a focus on the modern warehouse-first approach exemplified by Snowflake, BigQuery, and Databricks.
314|
315|### 6.1 Cloud-Native
316|
317|Principle: Design for the cloud's strengths β elasticity, managed services, and consumption-based pricing β not by replicating on-premises patterns.
318|
319|- Serverless where sensible: Use serverless compute (Snowflake auto-suspend, BigQuery on-demand) for variable workloads. Reserve provisioned compute for predictable, high-throughput pipelines.
320|- Managed services over self-hosted: Prefer managed Airflow (MWAA), managed dbt (dbt Cloud), and managed observability over self-hosted equivalents. Reduce operational overhead, not just cost.
321|- Resist the lift-and-shift temptation: Moving on-premises SSIS packages to EC2 is not cloud-native. Redesign pipelines for ELT patterns and cloud-native services.
322|
323|### 6.2 Warehouse-First Architecture
324|
325|Principle: The data warehouse is the central processing engine; the cloud provides storage, compute, and services around it.
326|
327|- Warehouse as transformation engine: All transformation logic (joins, aggregations, business rules) executes in the warehouse using SQL. Do not transform in Python scripts, Airflow tasks, or custom services.
328|- ELT, not ETL: Extract raw data, load it into the warehouse, then transform with dbt. The warehouse's distributed compute is far more powerful than any mid-pipeline transformation layer.
329|- Warehouse governs access: Role-based access, row-level security, and column masking are enforced in the warehouse. Not in the BI tool, not in the API layer, not in application logic.
330|
331|### 6.3 Compute-Storage Separation
332|
333|Principle: Storage and compute scale independently; never over-provision one to accommodate the other.
334|
335|- Snowflake virtual warehouses: Size warehouses to the workload. XS for dbt models, XL for large aggregations, auto-suspend after idle periods. One warehouse does not fit all.
336|- Storage is cheap, compute is expensive: Never optimise storage at the cost of compute complexity. Store data in its most usable form (wide, denormalised marts) even if it uses more disk.
337|- Multi-cluster for concurrency: Use Snowflake's multi-cluster warehouses for concurrent BI query patterns. Don't queue queries unnecessarily.
338|
339|### 6.4 Role-Based Access Control (RBAC)
340|
341|Principle: Access is granted to roles, not individuals. Roles follow the principle of least privilege and align with organisational structure.
342|
343|- Functional roles: DATA_ENGINEER, ANALYST, READONLY_REPORTER. Not JANE_DOE.
344|- Warehouse-role binding: Analysts should not run queries on the engineering warehouse. Bind roles to appropriately sized warehouses.
345|- Data classification drives access: PII data is restricted to PII_READER roles. Financial data to FINANCE_ANALYST. Public data to READONLY. Classify first, grant second.
346|- Row-level security for multi-tenant data: Use dynamic data masking and row-access policies (Snowflake) to restrict visibility without duplicating tables.
347|
348|### 6.5 Cost Optimisation
349|
350|Principle: Cloud data costs are variable and controllable; treat cost as a first-class architectural concern.
351|
352|- Monitor and attribute costs: Tag queries by team, project, and cost centre. Snowflake's QUERY_HISTORY and resource monitors make this possible.
353|- Right-size compute: Start small (XS), measure, then scale up. Over-provisioned warehouses waste more than under-provisioned ones (which just run slower).
354|- Caching is free performance: Clustering keys and result caching in Snowflake are free. Use them before scaling up compute.
355|- Credit budgets, not unlimited spending: Set resource monitors with hard and soft limits. Alert at 75%, suspend at 100%. No surprise bills.
356|- Optimise the top 10 queries: In most platforms, 10% of queries cause 90% of cost. Identify and optimise these first.
357|
358|---
359|
360|## 7. Privacy & Ethics Principles
361|
362|> "Privacy is not the opposite of analytics β poorly designed analytics is the opposite of good analytics." β Adapted from GDPR Recital 78
363|
364|Privacy and ethics are not regulatory afterthoughts β they are design constraints that, when properly integrated, produce better, more trustworthy data products.
365|
366|### 7.1 GDPR and Regulation by Design
367|
368|Principle: Data protection is embedded into the design and operation of data systems from the outset, not bolted on at the end.
369|
370|- Privacy by design (Article 25, GDPR): Collect only what's needed, process only what's justified, retain only as long as required. Build these constraints into dbt models (filter PII in staging where possible).
371|- Data Protection Impact Assessments (DPIAs): Before building a new data product that processes personal data, conduct a DPIA. Document the purpose, necessity, and proportionality.
372|- Lawful bases matter: Know your lawful basis for each processing activity (consent, contract, legitimate interest, etc.) and document it in your data catalogue.
373|- Regulation varies by jurisdiction: UK GDPR, EU GDPR, CCPA, and others have different requirements. Design for the strictest applicable regulation and localise where necessary.
374|
375|### 7.2 Data Minimisation
376|
377|Principle: Collect and retain only the data that is strictly necessary for the stated purpose.
378|
379|- Minimise at ingestion: When loading from source systems, exclude columns you don't need. If you'll never use the middle_name column, don't load it.
380|- Minimise at transformation: Mart models consumed by analysts should contain only the columns needed for analysis. A customer dimension for marketing doesn't need national_insurance_number.
381|- Aggregation as privacy tool: Aggregate data where individual-level records aren't needed. A daily sales summary is often sufficient and inherently less risky.
382|- Document minimisation decisions: If you exclude PII from a model, note it in the model description. This is evidence of responsible design.
383|
384|### 7.3 Right to Erasure
385|
386|Principle: Data subjects have the right to request deletion of their personal data; systems must be able to comply efficiently.
387|
388|- Design for deletability: If a system cannot efficiently delete an individual's data, it was not designed for compliance. Ensure every PII-bearing model can be filtered or rebuilt for a specific subject.
389|- Cascading deletes: When personal data is deleted from a source, the deletion must propagate through staging, intermediate, and mart models. Re-running dbt after source deletion achieves this.
390|- Retention policies: Define and enforce data retention periods. Archive or anonymise data that exceeds its retention window. Automate this with dbt models or warehouse tasks.
391|- Audit trail of erasure: Log erasure requests and completions. Regulators and data subjects may ask for proof.
392|
393|### 7.4 Consent Management
394|
395|Principle: Where consent is the lawful basis, it must be freely given, specific, informed, and unambiguous β and withdrawable.
396|
397|- Consent as metadata: Store consent flags with the data they govern. A marketing consent flag should travel with the customer record, not be a separate lookup.
398|- Consent withdrawal must cascade: If a subject withdraws consent for marketing, their data must be excluded from marketing models immediately. Build this as a filter in staging, not a manual process.
399|- Don't process without consent metadata: If you cannot verify consent status for a record, treat it as unconsented. Err on the side of non-processing.
400|
401|---
402|
403|## 8. Consultancy Delivery Principles
404|
405|> "A consultancy delivers value through expertise transfer, not expertise dependency."
406|
407|These principles govern how an Analytics Engineering consultancy engages clients, delivers work, and ensures lasting impact.
408|
409|### 8.1 Client Engagement
410|
411|Principle: Engagement begins with understanding the client's business context, not their technology stack.
412|
413|- Listen before architecting: Understand the business problem, the decision landscape, and the organisational culture before discussing Snowflake vs. BigQuery.
414|- Stakeholder mapping: Identify data producers, data consumers, executive sponsors, and potential blockers. The technical solution is often easier than the political one.
415|- Language matters: Speak the client's language. Finance teams think in margins and variances, not in fact and dimension tables. Translate accordingly.
416|- Set expectations early: Be clear about what AE can deliver, what requires upstream changes, and what is genuinely a people problem disguised as a data problem.
417|
418|### 8.2 Maturity Assessment
419|
420|Principle: Before proposing solutions, assess where the client sits on the data maturity spectrum.
421|
422|- Five maturity levels:
423| 1. Initial: Ad-hoc, no processes, hero-dependent.
424| 2. Developing: Some tools, limited standardisation, quality is inconsistent.
425| 3. Defined: Standard processes exist, but compliance varies.
426| 4. Managed: Processes are measured, monitored, and enforced.
427| 5. Optimised: Continuous improvement, automated governance, data-driven culture.
428|- Assess all dimensions: Technology alone is not maturity. Assess people (skills, culture), process (governance, workflows), data (quality, architecture), and technology (tooling, infrastructure).
429|- Diagnose the bottleneck: Most clients have one critical constraint. Identify it. A dbt implementation doesn't help if the constraint is "we have no source data pipeline."
430|
431|### 8.3 Phased Delivery
432|
433|Principle: Deliver in small, valuable increments β not in a big bang at the end.
434|
435|- Phase 0 β Discovery (1β2 weeks): Stakeholder interviews, current-state assessment, architecture review, maturity assessment. Deliverable: discovery report and roadmap.
436|- Phase 1 β Foundation (4β6 weeks): Source integration, raw layer, basic dbt project, CI/CD pipeline, initial tests. Deliverable: working end-to-end pipeline with 1β2 critical data products.
437|- Phase 2 β Acceleration (4β8 weeks): Expand marts, implement governance, add monitoring, train the client team. Deliverable: production-ready platform with documented processes.
438|- Phase 3 β Optimisation (ongoing): Performance tuning, advanced features, self-service enablement, knowledge transfer. Deliverable: client independence.
439|- Never ship without value: Each phase must deliver something the client can use. No "we spent six weeks and here's a framework."
440|
441|### 8.4 Knowledge Transfer
442|
443|Principle: The consultancy's success is measured by how little the client needs the consultancy after engagement.
444|
445|- Pair programming over documentation alone: Sit with client engineers and analysts. Walk through code together. Knowledge transfer is a social process, not a document.
446|- Train the trainers: Identify internal champions and invest disproportionately in them. They will propagate knowledge after you leave.
447|- Gradual responsibility transfer: Phase from "consultant does, client watches" to "client does, consultant watches" to "client does, consultant reviews" to "client is independent."
448|
449|### 8.5 Documentation as Deliverable
450|
451|Principle: Documentation is not an afterthought β it is a first-class deliverable, as important as the code.
452|
453|- Four documentation types:
454| 1. Architecture Decision Records (ADRs): Why we chose X over Y, with context and consequences.
455| 2. Runbooks: How to operate the platform. Incident response, pipeline restarts, monitoring interpretation.
456| 3. Data Dictionary: What each model means, its grain, freshness SLA, and known issues.
457| 4. Onboarding Guide: How a new team member gets started. Environment setup, access requests, where things live.
458|- Documentation as code: dbt YAML, README.md, and ADRs in Git. Not Word documents in a shared drive. Version-controlled documentation ages slower.
459|- Client signs off on documentation: Make documentation acceptance an explicit milestone. If the client doesn't accept the docs, the engagement isn't complete.
460|
461|---
501|
Part of [[Data-Principles-for-Analytics-Engineering]]. See also: [[Data-Quality-and-Governance]], [[Data-Architecture]], [[DataOps-and-Data-Engineering]].
Sub-pages¶
- [[AE-Consultancy-Delivery β Sub-pages]]