Data Lake 101: Architecture

Shanoj Kumar V

VP - Technology Architect & Data Engineering | AWS | AI & ML | Big Data & Analytics | Digital Transformation Leader | Author

Published Feb 5, 2024

A Data Lake is a centralized location designed to store, process, and protect large amounts of data from various sources in its original format. It is built to manage the scale, versatility, and complexity of big data, which includes structured, semi-structured, and unstructured data. It provides extensive data storage, efficient data management, and advanced analytical processing across different data types. The logical architecture of a Data Lake typically consists of several layers, each with a distinct purpose in the data lifecycle, from data intake to utilization.

Data Delivery Type and Production Cadence

Data within the Data Lake can be delivered in multiple forms, including table rows, data streams, and discrete data files. It supports various production cadences, catering to batch processing and real-time streaming, to meet different operational and analytical needs.

Landing / Raw Zone The Landing or Raw Zone

Is the initial repository for all incoming data, where it is stored in its original, unprocessed form. This area serves as the data’s entry point, maintaining its integrity and ensuring traceability by preserving it immutable.

Clean/Transform Zone

Following the landing zone, data is moved to the Clean/Transform Zone, where it undergoes cleaning, normalization, and transformation. This step prepares the data for analysis by standardizing its format and structure, enhancing data quality and usability.

Cataloguing & Search Layer

The Ingestion Layer manages data entry into the Data Lake, capturing essential metadata and categorizing data appropriately. It supports various data ingestion methods, including batch and real-time streams, facilitating efficient data discovery and management.

Data Structure

The Data Lake accommodates a wide range of data structures, from structured databases and CSV files to semi-structured, like JSON and XML, and unstructured data, including text documents and multimedia files.

Processing Layer

The Processing Layer is at the heart of the Data Lake, equipped with powerful tools and engines for data manipulation, transformation, and analysis. It facilitates complex data processing tasks, enabling advanced analytics and data science projects.

Curated/Enriched Zone

Data that has been cleaned and transformed is further refined in the Curated/Enriched Zone. It is enriched with additional context or combined with other data sources, making it highly valuable for analytical and business intelligence purposes. This zone hosts data ready for consumption by end-users and applications.

Consumption Layer

Finally, the Consumption Layer provides mechanisms for end-users to access and utilize the data. Through various tools and applications, including business intelligence platforms, data visualization tools, and APIs, users can extract insights and drive decision-making processes based on the data stored in the Data Lake.

AWS Data Lakehouse Architecture

Recommended by LinkedIn

Strategies for Building Robust Data Analytics Pipelines

Genx Consultancy Services DMCC 7 months ago

The Solid Foundation of Insights: Why Data…

Fred Krimmelbein 4 months ago

noModel: A Business Data Modelling

Mahmudur Rahman Manna 6 years ago

An AWS Data Lakehouse is a powerful combination of data lakes and data warehouses which utilizes Amazon Web Services to establish a centralized data storage solution. This solution caters to both raw data in its primitive form and the precision required for intricate analysis. By breaking down data silos, a Data Lakehouse strengthens data governance and security while simplifying advanced analytics. It offers businesses an opportunity to uncover new insights while preserving the flexibility of data management and analytical capabilities.

Kinesis Firehose

Amazon Kinesis Firehose is a fully managed service provided by Amazon Web Services (AWS) that enables you to easily capture and load streaming data into data stores and analytics tools. With Kinesis Firehose, you can ingest, transform, and deliver data in real time to various destinations such as Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service. The service is designed to scale automatically to handle any amount of streaming data and requires no administration. Kinesis Firehose supports data formats such as JSON, CSV, and Apache Parquet, among others, and provides built-in data transformation capabilities to prepare data for analysis. With Kinesis Firehose, you can focus on your data processing logic and leave the data delivery infrastructure to AWS.

Amazon CloudWatch

Amazon CloudWatch is a monitoring service that helps you keep track of your operational metrics and logs and sends alerts to optimize performance. It enables you to monitor and collect data on various resources like EC2 instances, RDS databases, and Lambda functions, in real-time. With CloudWatch, you can gain insights into your application's performance and troubleshoot issues quickly.

Amazon S3 for State Backend

The Amazon S3 state backend serves as the backbone of the Data Lakehouse. It acts as a repository for the state of streaming data, eternally preserving it.

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics uses SQL and Apache Flink to provide real-time analytics on streaming data with precision.

Amazon S3

Amazon S3 is a secure, scalable, and resilient storage for the Data Lakehouse's data.

AWS Glue Data Catalog

The AWS Glue Data Catalog is a fully managed metadata repository that enables easy data discovery, organization, and management for streamlined analytics and processing in the Data Lakehouse. It provides a unified view of all data assets, including databases, tables, and partitions, making it easier for data engineers, analysts, and scientists to find and use the data they need. The AWS Glue Data Catalog also supports automatic schema discovery and inference, making it easier to maintain accurate and up-to-date metadata for all data assets. With the AWS Glue Data Catalog, organizations can improve data governance and compliance, reduce data silos, and accelerate time-to-insight.

Amazon Athena

Amazon Athena enables users to query data in Amazon S3 using standard SQL without ETL complexities, thanks to its serverless and interactive architecture.

Amazon Redshift

Amazon Redshift is a highly efficient and scalable data warehouse service that streamlines the process of data analysis. It is designed to enable users to query vast amounts of structured and semi-structured data stored across their data warehouse, operational database, and data lake using standard SQL. With Amazon Redshift, users can gain valuable insights and make data-driven decisions quickly and easily. Additionally, Amazon Redshift is fully managed, allowing users to focus on their data analysis efforts rather than worrying about infrastructure management. Its flexible pricing model, based on usage, makes it a cost-effective solution for businesses of all sizes.

Consumption Layer

The Consumption Layer includes business intelligence tools and applications like Amazon QuickSight. This layer allows end-users to visualize, analyze, and interpret the processed data to derive actionable business insights.

Data Lake 101: Architecture

Shanoj Kumar V

VP - Technology Architect & Data Engineering | AWS | AI & ML | Big Data & Analytics | Digital Transformation Leader | Author

Data Delivery Type and Production Cadence

Landing / Raw Zone The Landing or Raw Zone

Clean/Transform Zone

Cataloguing & Search Layer

Data Structure

Processing Layer

Curated/Enriched Zone

Consumption Layer

Recommended by LinkedIn

Kinesis Firehose

Amazon CloudWatch

Amazon S3 for State Backend

Amazon Kinesis Data Analytics

Amazon S3

AWS Glue Data Catalog

Amazon Athena

Amazon Redshift

Consumption Layer

More articles by this author

Insights from the community

Others also viewed

Unlocking Real-Time Analytics: The Crucial Role of Data Engineering

DATA PIPELINE – TYPES, ARCHITECTURE, & ANALYSIS

Why is Data-centric Architecture a must in the Business Ecosystem?

SMART WRANGLER

Designing a Knowledge Repository

Data Lake Architecture – Part 2

Advanced-Data Modeling Techniques for Big Data Applications

Modern Big Data Tools and Architecture Strategies

The Backbone of Analytics and AI: Why Data Architecture Matters

Data Engineering Pipeline: From Raw Data to Actionable Insights

Explore topics

Data Delivery Type and Production Cadence

Landing / Raw Zone The Landing or Raw Zone

Clean/Transform Zone

Cataloguing & Search Layer

Data Structure

Processing Layer

Curated/Enriched Zone

Consumption Layer

Recommended by LinkedIn

Kinesis Firehose

Amazon CloudWatch

Amazon S3 for State Backend

Amazon Kinesis Data Analytics

Amazon S3

AWS Glue Data Catalog

Amazon Athena

Amazon Redshift

Consumption Layer

Management is a Role, Not a Promotion!

Sep 3, 2024

System Design: Automating Banking Reconciliation with AWS

Sep 2, 2024

Ace Your Data Engineering Interviews: A 6-Month Plan for Engineers and Managers

Sep 1, 2024

Distribution Styles in Amazon Redshift: A Banking Reconciliation Use Case

Aug 20, 2024

My Leadership Journey: An Open Letter to My Team

Aug 15, 2024

Daily Dose of Cloud Learning: AWS Resource Cleanup with Cloud-nuke

Aug 11, 2024

Microservices Architectures: The SAGA Pattern

Jul 27, 2024

Apache Hive 101: Enabling ACID Transactions

Jul 21, 2024

Bulkhead Architecture Pattern: Data Security & Governance

Jul 20, 2024

Software Architecture: Space-Based Architecture Pattern

Jun 15, 2024

Insights from the community

Others also viewed

Unlocking Real-Time Analytics: The Crucial Role of Data Engineering

DATA PIPELINE – TYPES, ARCHITECTURE, & ANALYSIS

Why is Data-centric Architecture a must in the Business Ecosystem?

SMART WRANGLER

Designing a Knowledge Repository

Data Lake Architecture – Part 2

Advanced-Data Modeling Techniques for Big Data Applications

Modern Big Data Tools and Architecture Strategies

The Backbone of Analytics and AI: Why Data Architecture Matters

Data Engineering Pipeline: From Raw Data to Actionable Insights

Explore topics