How Medallion Architecture With ER/Studio And Databricks Solves Data As A Product For Both Business And IT

How Medallion Architecture With ER/Studio And Databricks Solves Data As A Product For Both Business And IT

Introduction

The demand for data accessibility with built-in governance is a common goal for organizations. To achieve this, a collaborative approach is not just beneficial but necessary. It’s a shared journey where Data Architecture, Data Governance, and Data Analytics teams work in sync within a shared data ecosystem. This is where the power of Medallion Architecture, ER/Studio, and Databricks comes into play. These tools are instrumental in implementing Data Mesh programs, which deliver Data Products within a Medallion architecture on the Databricks platform, forming a comprehensive Data Fabric of tools. 

Effective data architecture, coupled with accepted and standardized data models, is not just a cornerstone but the foundation of success in data management. It plays a crucial role in overcoming the challenges of data swamps and the emergence of an uncontrolled “Wild West” of data. Striking the right balance between decentralization and rapid delivery while maintaining governance, standardization, and orchestration is a challenge that cannot be overlooked.

Let’s define some of the techniques.

Data Mesh

Data Mesh is not just a buzzword but a set of principles that align with decentralization, autonomy, and accountability. It forms the core of the Data as a Product (DaaP) concept. Let’s explore why Data Mesh is a vital component in implementing Data as a Product within the Medallion Architecture:

  1. Decentralization of Data Ownership: In a Data Mesh architecture, data is not just a resource but a product. The responsibility for specific data domains is decentralized to individual teams or domains. This decentralization is not just a shift in responsibility but a shift in mindset. It ensures that teams have ownership and accountability for the data they produce, similar to how product teams own and manage their products. This is the value of decentralization in Data Mesh.
  2. Autonomy and Empowerment: Data Mesh is not just a concept; it’s a philosophy that empowers teams to take control of their data. Assigning ownership of data domains to specific teams gives them the power to shape their data products to their particular needs and use cases. This level of autonomy is a game-changer, enabling teams to make decisions regarding data schema, quality, and access controls and, ultimately, driving innovation in data management.
  3. Scalability and Agility: Data Mesh facilitates scalability and agility by allowing organizations to scale their data infrastructure horizontally. Instead of relying on centralized data teams to manage all data-related tasks, Data Mesh empowers individual teams to manage their data domains independently. This distributed approach enables organizations to adapt quickly to changing business requirements and scale their data infrastructure as needed.
  4. Data Quality and Consistency: Despite decentralization, Data Mesh emphasizes the importance of data quality and consistency. Each team is responsible for ensuring the quality and integrity of the data they produce, aligning with the principles of Data as a Product. Additionally, Data Mesh promotes standardized data governance and quality control practices to maintain consistency across data domains.
  5. Cross-Functional Collaboration: Data Mesh promotes decentralization and fosters collaboration across teams and domains. Teams collaborate to define data contracts, establish data standards, and share best practices, facilitating cross-functional alignment and knowledge sharing. This collaborative approach ensures that data products meet the needs of various stakeholders and support the organization’s broader goals.

Data Mesh is integral to Data as a Product because it operationalizes the principles of decentralization, autonomy, and accountability in data management. By embracing Data Mesh, organizations can incorporate data agility and ensure the quality and consistency of their data assets, ultimately driving value and innovation across the organization.

The challenge with this approach is that decentralization can result in data silos. Each discrete domain may produce data products aggregating foundational data products. Many of those foundational data products may be owned by other domains. Ensuring consistency across these domains reduces ambiguity and improves interoperability. If a “Sales Figures” data product from the “Sales” domain refers to products that are different from the “Products” described by the “Product Management” domain, then we will have problems.

Medallion Architecture

At its core, Medallion Architecture, proposed by Databricks and adopted as part of Microsoft Fabric, embodies a multi-tiered approach to data model design, fostering data quality within a data lakehouse:

Layered Structure:

  • Bronze Layer: Raw data enters the Medallion Architecture at this stage, where it undergoes minimal processing and remains in its original form.
  • Silver Layer: Data is refined and transformed in the Silver Layer, ensuring cleanliness and structure before it progresses to the next stage.
  • Gold Layer: The pinnacle of the architecture houses data optimized for analysis and business use, enriched with additional insights, and aggregated for deeper insights.

This is not a new approach to warehousing. Bill Inmon advocated Staging, Enterprise Data Warehouse, and Data Mart layers decades ago and has been used by many organizations.

Maintaining data in its raw form improves governance and allows historical records to be held. The silver layer allows standardization of reconciled, cleansed, and verified data that is well documented, governed and conforms to accepted standards. The Silver layer ensures that data is reusable and well-understood. Standards on data quality can be ensured so that the data is reliable. The Gold layer ensures that data is fit for purpose and in a form that is consumable by the stakeholders.

Data Models Are Key

Our objectives of delivering Data Products into the Gold Layer as quickly as possible with governance baked in needs some coordination. Data Models can help in several ways.

Data Models Of Source Data

Having an up-to-date data catalog makes the system ready to deliver. Physical data models of source data assets are the first step, but the metadata within the asset is often insufficient. We need to understand the nature of the data in these assets. We will need to understand each field to determine if this holds valuable information for the business and then attach this to an accepted definition. In practice, we will map to business terms but, better still, produce a logical data model that clearly shows the schema of the asset in business language with helpful metadata about the asset. Now, the asset is well understood and is ready if there are demands for that data.

The Enterprise Data Model

This is the framework that ties everything together. Here, we create standardized entities for the organization’s vital information. We will standardize the naming of entities and understand the taxonomies of them. Standardization of names will help when hardening requirements for Data Products. “We need a data product that describes the characteristics of all the people within a clinical trial.” Does this mean the “Human Subject” for which the trial is running on the “Healthcare Professional” that is already working externally with the Human Subject to the trial or the “Investigator” that we’ve engaged in supporting the trial? Each has very different data and sources.

Once we have standardized the names and definitions of entities, we can agree on how these entities may be identified. They may have multiple ways of being identified, and we need to understand them and decide on the primary method. This will help when reconciling data from diverse sources.

Then, we need to understand these entities’ key attributes, definitions, and nature. Again, this will help harden requirements for data products, reduce ambiguity, and speed up product delivery.

We now have a catalog of crucial standardized data entities. These can all be defined as foundational data products and associated with different domains as the domain owners.

Finally, we will connect the entities with relationships. This gives us a map that makes it easier to understand how data can be connected. Building aggregate data products becomes easy as we know precisely which foundational data products can be connected.

We have two challenges: 1) to reconcile the many data sources and 2) to coordinate Data Products across domains. If all Data Sources exist in our catalog mapped to our Enterprise Data Model, then sourcing data becomes accessible. As we have our standardized reusable “plug and play” foundational data products, we automatically ensure consistency across domains.

If our Enterprise data model is mapped to the business glossary where our data policies are set then identifying the governance policies associated with Data Products also becomes easy.

This process doesn’t have to be a massive exercise that takes forever. The Enterprise Data Model can be built over time, starting from the most important entities and growing over time.

Data Models Of The Warehouse

As we stated above, the Silver layer of the warehouse needs to be well structured with reusability and governance baked in. The Enterprise model can provide the core structure of this layer, whether it uses a third-normal form or a Data Vault schema system. This makes it easier to reconcile source data assets and understand what needs to be created to deliver against requirements.

The designs or our data products will drive the Gold layer. The logical models of these products can be used to create structures in the Gold layer, which may be realized as star schema tables or views. The logical models ensure that they are well documented and easily governed, and traceability back to the enterprise model makes sourcing data much more accessible.  

Why ER/Studio?

ER/Studio has been designed to support enterprise data models. Building and mapping the models to models of data sources and business terms is baked in and easy to use. The ability to visualize these models and trace relationships is accessible within the Team Server component so that business and technical users alike can take advantage. Data Architect’s deep support for Databricks with essential features like nested structures means that physical models of the warehouse can be produced repeatably from enterprise models. ER/Studio will generate the code to create Databricks assets, so you don’t have to. Connectivity to your Data Governance tool means that your models are always properly classified and can be published back to the main data catalog. Also, these tools’ Data Product design features, such as with Microsoft Purview, can be made actionable in ER/Studio.

Want to see more? Please register for a demo so our engineers can walk you through your use cases.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics