How Medallion Architecture With ER/Studio And Databricks Solves Data As A Product For Both Business And IT
Introduction
The demand for data accessibility with built-in governance is a common goal for organizations. To achieve this, a collaborative approach is not just beneficial but necessary. It’s a shared journey where Data Architecture, Data Governance, and Data Analytics teams work in sync within a shared data ecosystem. This is where the power of Medallion Architecture, ER/Studio, and Databricks comes into play. These tools are instrumental in implementing Data Mesh programs, which deliver Data Products within a Medallion architecture on the Databricks platform, forming a comprehensive Data Fabric of tools.
Effective data architecture, coupled with accepted and standardized data models, is not just a cornerstone but the foundation of success in data management. It plays a crucial role in overcoming the challenges of data swamps and the emergence of an uncontrolled “Wild West” of data. Striking the right balance between decentralization and rapid delivery while maintaining governance, standardization, and orchestration is a challenge that cannot be overlooked.
Let’s define some of the techniques.
Data Mesh
Data Mesh is not just a buzzword but a set of principles that align with decentralization, autonomy, and accountability. It forms the core of the Data as a Product (DaaP) concept. Let’s explore why Data Mesh is a vital component in implementing Data as a Product within the Medallion Architecture:
Data Mesh is integral to Data as a Product because it operationalizes the principles of decentralization, autonomy, and accountability in data management. By embracing Data Mesh, organizations can incorporate data agility and ensure the quality and consistency of their data assets, ultimately driving value and innovation across the organization.
The challenge with this approach is that decentralization can result in data silos. Each discrete domain may produce data products aggregating foundational data products. Many of those foundational data products may be owned by other domains. Ensuring consistency across these domains reduces ambiguity and improves interoperability. If a “Sales Figures” data product from the “Sales” domain refers to products that are different from the “Products” described by the “Product Management” domain, then we will have problems.
Medallion Architecture
At its core, Medallion Architecture, proposed by Databricks and adopted as part of Microsoft Fabric, embodies a multi-tiered approach to data model design, fostering data quality within a data lakehouse:
Layered Structure:
This is not a new approach to warehousing. Bill Inmon advocated Staging, Enterprise Data Warehouse, and Data Mart layers decades ago and has been used by many organizations.
Maintaining data in its raw form improves governance and allows historical records to be held. The silver layer allows standardization of reconciled, cleansed, and verified data that is well documented, governed and conforms to accepted standards. The Silver layer ensures that data is reusable and well-understood. Standards on data quality can be ensured so that the data is reliable. The Gold layer ensures that data is fit for purpose and in a form that is consumable by the stakeholders.
Data Models Are Key
Our objectives of delivering Data Products into the Gold Layer as quickly as possible with governance baked in needs some coordination. Data Models can help in several ways.
Recommended by LinkedIn
Data Models Of Source Data
Having an up-to-date data catalog makes the system ready to deliver. Physical data models of source data assets are the first step, but the metadata within the asset is often insufficient. We need to understand the nature of the data in these assets. We will need to understand each field to determine if this holds valuable information for the business and then attach this to an accepted definition. In practice, we will map to business terms but, better still, produce a logical data model that clearly shows the schema of the asset in business language with helpful metadata about the asset. Now, the asset is well understood and is ready if there are demands for that data.
The Enterprise Data Model
This is the framework that ties everything together. Here, we create standardized entities for the organization’s vital information. We will standardize the naming of entities and understand the taxonomies of them. Standardization of names will help when hardening requirements for Data Products. “We need a data product that describes the characteristics of all the people within a clinical trial.” Does this mean the “Human Subject” for which the trial is running on the “Healthcare Professional” that is already working externally with the Human Subject to the trial or the “Investigator” that we’ve engaged in supporting the trial? Each has very different data and sources.
Once we have standardized the names and definitions of entities, we can agree on how these entities may be identified. They may have multiple ways of being identified, and we need to understand them and decide on the primary method. This will help when reconciling data from diverse sources.
Then, we need to understand these entities’ key attributes, definitions, and nature. Again, this will help harden requirements for data products, reduce ambiguity, and speed up product delivery.
We now have a catalog of crucial standardized data entities. These can all be defined as foundational data products and associated with different domains as the domain owners.
Finally, we will connect the entities with relationships. This gives us a map that makes it easier to understand how data can be connected. Building aggregate data products becomes easy as we know precisely which foundational data products can be connected.
We have two challenges: 1) to reconcile the many data sources and 2) to coordinate Data Products across domains. If all Data Sources exist in our catalog mapped to our Enterprise Data Model, then sourcing data becomes accessible. As we have our standardized reusable “plug and play” foundational data products, we automatically ensure consistency across domains.
If our Enterprise data model is mapped to the business glossary where our data policies are set then identifying the governance policies associated with Data Products also becomes easy.
This process doesn’t have to be a massive exercise that takes forever. The Enterprise Data Model can be built over time, starting from the most important entities and growing over time.
Data Models Of The Warehouse
As we stated above, the Silver layer of the warehouse needs to be well structured with reusability and governance baked in. The Enterprise model can provide the core structure of this layer, whether it uses a third-normal form or a Data Vault schema system. This makes it easier to reconcile source data assets and understand what needs to be created to deliver against requirements.
The designs or our data products will drive the Gold layer. The logical models of these products can be used to create structures in the Gold layer, which may be realized as star schema tables or views. The logical models ensure that they are well documented and easily governed, and traceability back to the enterprise model makes sourcing data much more accessible.
Why ER/Studio?
ER/Studio has been designed to support enterprise data models. Building and mapping the models to models of data sources and business terms is baked in and easy to use. The ability to visualize these models and trace relationships is accessible within the Team Server component so that business and technical users alike can take advantage. Data Architect’s deep support for Databricks with essential features like nested structures means that physical models of the warehouse can be produced repeatably from enterprise models. ER/Studio will generate the code to create Databricks assets, so you don’t have to. Connectivity to your Data Governance tool means that your models are always properly classified and can be published back to the main data catalog. Also, these tools’ Data Product design features, such as with Microsoft Purview, can be made actionable in ER/Studio.
Want to see more? Please register for a demo so our engineers can walk you through your use cases.