The data mesh is a new approach to designing and developing data architectures. Unlike a centralized and monolithic architecture based on a data warehouse or data lake, a data mesh is a highly decentralized data architecture.
In this blog, we’ll take a look at the data mesh and how data virtualization can help to develop one. Since it’s almost impossible to describe and do justice to data mesh in a single, short blog , I’ll limit myself to focusing on the basic essentials. The majority of the data architectures designed over the past thirty years have been designed to integrate data from a large number of source systems, allowing a wide range of users to exploit that integrated data. For example, data warehouse architectures are designed to support reports and dashboards for which data must be merged from all kinds of source systems.
But there are also data lakes being developed to allow data scientists to analyze data from multiple sources. The approach in all such data architectures is to extract data from different source systems and copy it to a centralized, monolithic data store.
Although these centralized data architectures have served countless companies well, they have the following structural drawbacks:
The data mesh is designed to avoid such problems altogether, rather than trying to solve them. A data mesh is a highly distributed data architecture.
One of the most important differences is that source systems engineers are also responsible for developing interfaces for allowing all kinds of applications and users (including data scientists, self-service BI users, batch reports, etc.) to use the domain data. Note that these interfaces are not just service interfaces on top of source systems. Implementing an interface may involve a data warehouse, a data lake, a data mart, or all of these. So, the nodes of a data mesh are like source systems combined with miniature data architectures for data delivery. The interface will take care of data delivery in such a way that users can focus on using the data, without having to concern themselves with verifying data quality, security, and privacy, since these are taken care of by the nodes. Additionally, data mesh nodes should be able to provide all the relevant metadata to describe the data, so a single node in fact incorporates a source system/s plus its/their data delivery solutions, plus their interfaces.
Besides source system-based data mesh nodes, some nodes may support users who need data from multiple nodes. Such nodes, which may also contain databases that resemble data warehouses or data lakes, can also accessed by users through an interface.
Data engineers of a data mesh node remain single-domain experts, but for the entire node, not just for the original source system. This minimizes communication problems and misinterpretations of data. The interface of a data mesh node must support any form of data usage, from simple requests for a patient file or invoice, via straightforward dashboards and reports, to advanced forms of analytics, such as data science. Therefore, the technology used to develop these interfaces must support both record-oriented and set-oriented usage. This is where data virtualization comes into play. Data virtualization technology has been developed to create interfaces on almost any kind of system and makes it possible to access that data through a variety of interfaces, including record-oriented and set-oriented interfaces. Additionally, it enables the implementation of data security and privacy rules and provides users with metadata. Several technologies are available for developing those interfaces, and what data virtualization brings to the table is that it already supports most of the required technology, enabling quick development of data meshes.
The data mesh architecture is too interesting to ignore and I strongly recommend that architects study its principles and compare implementation technologies based on ease of development.