Breaking the monolith: Scalable transformations with Data Mesh and dbt
Divide and conquer
In the ever-evolving digital landscape, organizations are awash in a sea of data, generating and collecting vast amounts of information from various sources. However, extracting meaningful insights from this deluge of data can be a daunting task. Traditional data management approaches, often centered around centralized silos and monolithic architectures, struggle to effectively handle the complexities of modern data ecosystems. This is where the concept of Data Mesh and tools such as dbt emerge as transformative solutions to address the challenges of the data analytics and unlock insights by democratizing the access to data. Let’s discover how to use them to implement scalable and distributed data architecture.
Principles of Data Mesh
Data mesh is a relatively new data management paradigm that was introduced by Zhamak Dehghani in 2019 and described in her book. However, the concept of decentralized data ownership and governance has been around for much longer. In the early days of computing, data was often siloed within individual business units, and there was little centralized oversight and governance policies. This led to data quality issues, slow data delivery, and a lack of cross-functional data sharing. In the 1990s and 2000s, organizations began to adopt data warehousing and data lakes to centralize and integrate their data enabling cross-team collaboration. This helped to improve data quality, accessibility and lineage visibility, but it also imposed new challenges. Centralized data platform architecture can become a bottleneck, and it can be difficult to manage as the volume and complexity grows. The Data Mesh approach addresses existing data architecture problems by following four principles:
- domain ownership,
- data as a product,
- self-serve data platform,
- federated computational governance.
Nurturing Decentralization: The Role of Domain Ownership in Data Mesh Architecture
Domain ownership advocates for a decentralized approach to data management. In a traditional data management model, data ownership is typically centralized, with an engineering department responsible for managing all of the organization's data. This can lead to bottlenecks and poor data quality, as the central data team may not have the expertise nor time to understand the specific needs of each single domain. In a data mesh architecture, data ownership is delegated to data producers, who are responsible for managing the data that is relevant to their specific business domain. This approach has several benefits:
- Domain experts have a deep understanding of the business logic of the data and are therefore better equipped to ensure its quality and relevance.
- Domain experts are more likely to share data with other teams if they feel that they have ownership over it.
- Decentralized ownership speeds up the building of data products.
The overall goal is to create a more decentralized and agile data management environment that is better equipped to meet the needs of modern businesses.
Empowering Data Consumption: Unveiling the Essence of Data as a Product in Data Mesh Architecture
Data as a product emphasizes treating data as a valuable asset that is produced, packaged, and delivered to consumers professionally. Data products are not simply raw data sets, they are carefully curated and transformed datasets that are tailored to the specific needs of business users. This approach shifts the focus from data management to data delivery, empowering data consumers to access and utilize data insights without relying on IT specialists. Data products are packaged in a consistent, well-defined format, similar to any other commercial product, making them easy to find, consume, and share. Clear ownership and data lineage are tracked for each data product, providing transparency and traceability. Data products are delivered on-demand, ensuring timely access to actionable insights to support business decisions and strategic initiatives.
Unlocking Data Empowerment: Self-Serve Platforms in Data Mesh
Self-serve data platform empowers domain experts to build and maintain their own data pipelines using self-service tools, reducing bottlenecks, accelerating data delivery, and promoting data democratization. This approach enables domain experts to become data product owners, responsible for creating, transforming, and managing the data they need to support their specific business functions. It also fosters collaboration and shared ownership of data, breaking down data silos and encouraging cross-functional data sharing. The self-serve data platform provides a standardized and user-friendly interface for data ingestion, transformation, and modeling, allowing domain experts to create data products without relying on specialized data engineering skills. This reduces the reliance on centralized data teams and empowers domain experts to take ownership of their data responsibilities. The self-service platform also facilitates the sharing and reuse of data products across the organization, ensuring that data is easily accessible to those who need it and promoting data-driven decision-making across all departments.
Dynamic Governance in Data Mesh: Federated Computational Governance Explained
Data governance is a crucial aspect of data mesh, ensuring that the decentralized data products are secure, trusted, and interconnected. Federated computational governance, embraces constant change to the data landscape, delegates responsibility to individual domains, heavily automates data quality and risk management, and embeds data governance policies in each product. It addresses the limitations of traditional data governance, which doesn't align with the decentralized nature of data mesh. In addition to the existing objectives, data mesh must also address new concerns such as data interoperability, scalability and standardization to enable the correlation of independent data products.
dbt as a scalable transformation enabler
dbt, created by Fishtown Analytics (now dbt Labs) is a command-line tool and open-source framework enabling data analysts and engineers to transform data within their warehouses. It employs a SQL-first approach, promoting modularity, scalability, and testability within data transformation workflow. dbt empowers collaboration across teams and offers an analytics engineering workflow that aligns with established software development practices.
dbt places the power of SQL at the forefront of the data transformation process. This means that data analysts and engineers, no matter their level of experience, can utilize the language they're most comfortable with. For analysts specifically, dbt acts as an accessible entry point into the world of analytical engineering. Additionally, dbt encourages best practices by breaking down complex data transformations into smaller, manageable, and reusable components. This modular design significantly improves project organization, ensuring maintainability in the long term.
Crucially, dbt prioritizes the reliability of data pipelines. With built-in testing checks and the ability for users to define custom tests tailored to their specific datasets, data quality and accuracy are ensured. Data lineage and dependency graphs are automatically generated by dbt, providing essential insights into how raw data is transformed throughout the project. This transparent approach eliminates guesswork and promotes cross-team understanding of complex systems.
dbt is designed to integrate seamlessly with the existing technology stack. Collaboration becomes effortless with built-in support for version control systems, ensuring trackable workflows and streamlined review processes. Deployment becomes simpler through dbt's robust automation features, aligning perfectly with CI/CD pipelines. These integrations accelerate development cycles and guarantee that data models are reliably delivered to end-users.
Implementing dbt Mesh
During Coalesce 2023, an annual conference organized by dbt Labs, the concept of dbt Mesh was introduced. It is a pattern enabled by several dbt features that integrates with data mesh principles. Furthermore, a new CLI tool that automates the creation of these features was presented - dbt-meshify.
Groups enable categorization and logical organization of related data models. They enhance project maintainability, navigation, and facilitate understanding of complex data transformations. It is possible to create groups like "marketing", "finance", and "sales", where each group contains data models specifically relevant to that domain. This helps teams easily find relevant models and keeps the project organized.
Contracts define essential tests and constraints on the expected structure and content of data model outputs, with real-time protection against inconsistencies or unexpected downstream consequences of changes. For example, a contract may state that a field named "customer_id" should never have null values, or that the column "total_revenue" should always be a positive number.
Access control provides granular control over model visibility and interaction. Permissions can be configured at both the group and model level, enabling developers to enforce clear access policies within data environments. Access to sensitive financial data models may be restricted to only finance team members, while allowing marketing-related models to be viewable by a wider group.
Versions facilitate the creation of 'snapshots' of data models throughout their development lifecycle. This provides a historical reference point for troubleshooting, auditing, and tracking the evolution of transformation logic. Before making major changes to a critical data model, developers can create a back-up version. If the changes introduce errors, it's easy to revert to the previous version, ensuring stability.
Project dependencies are particularly valuable for large and complex dbt implementations. This feature streamlines the process of splitting expansive projects into manageable modules or linking resources across multiple, interrelated dbt projects. Organizations may have a core dbt project defining common data transformations. Separate departmental projects (like marketing and finance) may depend upon this core project, ensuring consistency across data usage.
Break the monolith with dbt-meshify
Consider a dbt project straining under the weight of a monolithic structure. It includes 6 sources, 6 staging models, 5 intermediate models, and 6 marts – all within a single repository. This project mainly serves Finance and Marketing departments. Monolithic dbt projects hinder scalability, maintainability, and team autonomy. Data mesh principles offer a solution. Let's employ dbt-meshify to refactor the project.
Domain-Driven Grouping - we begin by using dbt-meshify's group command to create three distinct groups: finance, web_domain, and customer_domain. dbt-meshify automatically places existing models into their groups based on the provided selection.
Encapsulating Knowledge with Access Control - the access policies 'private' and 'protected' are automatically applied in ‘dbt-meshify group’ command. 'Private' suit models are tightly coupled to their own group, while models with external dependencies or serving as endpoints become 'protected'.
Enforcing Data Contracts (Edge Models) - edge models, where domains interact, necessitate clear data contracts. dbt-meshify simplifies contract enforcement, ensuring consistency and reliability. For all models with ‘protected’ policy, contract enforcement is automatically applied in the ‘dbt-meshify group’ command. Later on contracts may be also created via command ‘dbt-meshify operation add-contract’.
New version - to create a new model version and provide tracking of logic evolution through a product development lifecycle, use command ‘dbt-meshify version’’. While integration with git provides version control for model definition in .sql file in the repository, dbt version lets developers define new versions of the model stored as separate files. This dbt feature works seamlessly with ref functions. Hence, the old version of the model may be still used in the production environment, while the new version is tested for its upcoming release.
This operation will create a new sql script for selected models and configure versioning in yml files. To use a specific version of a model, update existing ref functions by adding version argument:
before:
select * from {{ ref('model_name') }}
after:
select * from {{ ref('model_name', version=1) }}
Configuration Formalization - the _groups.yml file holds the information of specified groups. Implementation details, column definitions, enforced contracts and model versions are defined within _models.yml.
Data Mesh Realized - our refactored project now adheres to data mesh standards. Domain teams manage their respective groups, producing curated models as data products. Federated governance maintains consistency across a shared data platform. The following image presents project lineage after described operations.
In addition, command ‘dbt-meshify split’ can split the project into multiple subprojects based on specified selection criteria. Newly created projects may be managed in different repositories. While using dbt Cloud it is possible to use cross-project ref functions. In dbt core, upstream projects can be defined as sources in downstream projects, with the other option being to install an upstream project as a package in downstream one.
The Future of Scalable Data Transformations: Data Mesh with dbt
The data industry is rapidly evolving as new concepts and technologies emerge, transforming the way data is collected, analyzed, and utilized. This rapid development is driven by several factors, including the increasing availability of data, the growing demand for data analytical insight, and the need for organizations to stay ahead of the curve. Data mesh brings a powerful change to data management that addresses the limitations of traditional monolithic architectures. It advocates for decentralized data ownership, data as a product, self-serve data platforms, and federated computational governance. dbt emerges as a crucial tool in implementing data mesh, enabling domain experts to create, manage, share and store data products without relying on centralized data engineering teams. By combining data mesh principles with dbt meshify's capabilities, organizations can break down data silos, democratize data access, and accelerate the delivery of data products. Having trouble scaling your project or want to learn more about dbt? Contact us to schedule a meeting!