Data Mesh From an Engineering Perspective
Why You May Need a Data Mesh
Many organizations have invested in a central data lake and a data team with the expectation to drive their business based on data. However, after a few initial quick wins, they notice that the central data team often becomes a bottleneck. The team cannot handle all the analytical questions of management and product owners quickly enough. This is a massive problem because making timely data-driven decisions is crucial to stay competitive. For example: Is it a good idea to offer free shipping during Black Week? Do customers accept longer but more reliable shipping times? How does a product page change influence the checkout and returns rate?
The data team wants to answer all those questions quickly. In practice, however, they struggle because they need to spend too much time fixing broken data pipelines after operational database changes. In their little time remaining, the data team has to discover and understand the necessary domain data. For every question, they need to learn domain knowledge to give meaningful insights. Getting the required domain expertise is a daunting task.
On the other hand, organizations have also invested in domain-driven design, autonomous domain teams (also known as stream-aligned teams or product teams) and a decentralized microservice architecture. These domain teams own and know their domain, including the information needs of the business. They design, build, and run their web applications and APIs on their own. Despite knowing the domain and the relevant information needs, the domain teams have to reach out to the overloaded central data team to get the necessary data-driven insights.
With the eventual growth of the organization, the situation of the domain teams and the central data team becomes worse. A way out of this is to shift the responsibility for data from the central data team to the domain teams. This is the core idea behind the data mesh concept: Domain-oriented decentralization for analytical data. A data mesh architecture enables domain teams to perform cross-domain data analysis on their own and interconnects data, similar to APIs in a microservice architecture.
What Is Data Mesh?
The term data mesh was coined by Zhamak Dehghani in 2019 and is based on four fundamental principles that bundle well-known concepts: The domain ownership principle mandates the domain teams to take responsibility for their data. According to this principle, analytical data should be composed around domains, similar to the team boundaries aligning with the system’s bounded context. Following the domain-driven distributed architecture, analytical and operational data ownership is moved to the domain teams, away from the central data team. The data as a product principle projects a product thinking philosophy onto analytical data. This principle means that there are consumers for the data beyond the domain. The domain team is responsible for satisfying the needs of other domains by providing high-quality data. Basically, domain data should be treated as any other public API. The idea behind the self-serve data infrastructure platform is to adopt platform thinking to data infrastructure. A dedicated data platform team provides domain-agnostic functionality, tools, and systems to build, execute, and maintain interoperable data products for all domains. With its platform, the data platform team enables domain teams to seamlessly consume and create data products. The federated governance principle achieves interoperability of all data products through standardization, which is promoted through the whole data mesh by the governance guild. The main goal of federated governance is to create a data ecosystem with adherence to the organizational rules and industry regulations.
How To Design a Data Mesh?
A data mesh architecture is a decentralized approach that enables domain teams to perform cross-domain data analysis on their own. At its core is the domain with its responsible team and its operational and analytical data. The domain team ingests operational data and builds analytical data models to perform their own analysis. It uses analytical data to build data products based on other domains’ needs.
The domain team agrees with others on global policies, such as interoperability, security, and documentation standards in a federated governance guild, so that domain teams know how to discover, understand and use data products available in the data mesh. The self-serve domain-agnostic data platform, provided by the data platform team, enables domain teams to easily build their own data products and do their own analysis effectively. An enabling team guides domain teams on how to model analytical data, use the data platform, and build and maintain interoperable data products.
The mesh emerges when teams use other domain's data products. Using data from upstream domains simplifies data references and lookups (such as getting an article's price), while data from downstream domains enables analyzing effects, e.g. for A/B tests (such as changes in the conversion rate). Data from multiple other domains can be aggregated to build comprehensive reports and new data products.
Data mesh is primarily an organizational approach, and that's why you can't buy a data mesh from a vendor. Technology, however, is important still as it acts as an enabler for data mesh, and only useful and easy to use solutions will lead to domain teams' acceptance. The available offerings of cloud providers already provide a sufficient set of good self-serve data services to let you form a data platform for your data mesh. We want to show which services can be used to get started.