The importance of master data cannot be denied. There are a variety of articles, books, and events that emphasize its importance. Yet for many organizations’, it is a struggle to organize their master data in such a way that it can be used by many data consumers. Those who have been able to organize it well, use master data primarily within a limited and controlled set of applications.
Note: In this article, reference data is regarded as a form of master data.
By Rick F. van der Lans
It is important that organizations’ focus on making master data available to all potential data consumers. Master data is not only useful for integrating data when it is copied to a data warehouse. Its use benefits data scientists, business users developing ad-hoc reports, apps running on mobile devices enabling customers to manage their own bank account, and so on.
Even if an organization does not have a centralized master data system (MDM), it is still data that can be considered as master data. It is likely stored in a variety of systems, such as the following:
For example, the International Classification of Primary Care (ICPC) is accepted in several countries as the standard for coding and classifying medical complaints, symptoms and disorders in general practice. This can be considered as master data.
To bring all that master data together, the first solution that always comes to mind is to copy and store it in a centralized MDM. However, this seemingly simple solution does have its drawbacks, such as:
With all this master data distributed across many systems, it can be a challenge to make everything available in an integrated way to all the data consumers. A decentralized solution may be preferred. Such a decentralized solution can be implemented using data virtualization. In this case, a data virtualization layer is defined on all the internal and external systems that contain master data. Without copying it, the master data is presented to all the data consumers as integrated master data.
This data virtualization-based solution offers the following advantages:
However, it will still operate behind the data virtualization solution to hide from data consumers that the implementation has changed. In general, what data virtualization can do for master data is what it has always been able to do for data itself.
In my book, Data Virtualization for Business Intelligence Architectures, I have included the following definition of data virtualization:
“Data virtualization is the technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.”
Perhaps this definition needs to be extended a bit:
If organizations recognize that their master data is stored in a heterogeneous set of internal and external systems; it is well worth considering data virtualization to virtually integrate all that master data. This includes/Not to mention without physically centralizing the copied master data and to make it available to a wide range of data consumers quickly.