A Comprehensive Guide to Data Engineering - Part Eight: Mastering Data Transformation (3)
Data, in its raw form, is often a complex maze of numbers and facts. To unlock its true potential, it must be transformed into a more accessible, understandable, and actionable format. Data transformations are critical in managing and analyzing data effectively. They enable the integration of disparate data sources into a unified format, optimize resource usage, and support the creation of complex data processing pipelines, ultimately enhancing the overall value and utility of the data.
What are Transformations?
While a query retrieves data based on specific conditions and join logic, transformation persists the results of these queries for future use. This persistence can be either temporary or permanent.
Data transformations often involve building complex pipelines that integrate data from multiple sources and may reuse intermediate results. These pipelines can include data normalization, modelling, aggregation, and feature extraction. While complex dataflows can be created in single queries, this approach can become unwieldy and inconsistent. Transformations offer a more structured and manageable alternative.
Transformations rely heavily on orchestration an undercurrent of the data engineering lifecycle. It involves storing data temporarily or permanently for use in downstream transformations or for serving to end users. Modern transformation pipelines often span multiple tables, datasets, and even systems, necessitating efficient orchestration.
Batch Transformations
Batch transformations process data in discrete chunks or batches. This means that the data is collected over some time and then processed all at once. These transformations can be scheduled to run at specific intervals, such as daily, hourly, or every 15 minutes. The schedule depends on the requirements of the project or the specific use case. There are various patterns and technologies specific to batch transformations.
Distributed Join
A distributed join breaks down a larger, logical join (defined by query logic) into smaller joins that are executed on individual servers within a cluster. This approach is used to manage and process large data sets that cannot be handled by a single server due to their size or complexity.
Broadcast Join
Broadcast joins are a powerful technique in distributed data processing used to efficiently handle join operations between a large and a small table. By broadcasting the smaller table across the cluster and joining it with segments of the larger table at each node, broadcast joins minimize resource usage and improve query performance, especially when combined with smart query optimization techniques like join reordering and prefiltering.
Shuffle Hash Joins
Shuffle hash join is a technique used in distributed data processing for joining large datasets that are initially distributed across multiple nodes. It involves repartitioning the data based on the join key using a hashing scheme and then reshuffling the data so that the relevant parts of the tables can be joined on the same node. This method is more resource-intensive compared to broadcast joins due to the extensive data movement required.
ETL, ELT, and data pipelines
The evolution from traditional ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform) reflects advancements in data management practices. Initially, ETL involved extracting data from source systems, transforming it externally, and then loading it into target systems like data warehouses, constrained by limited storage and processing power.
With the advent of more powerful data warehouses, the ELT approach gained prominence, allowing raw data to be loaded directly into warehouses and transformed there, streamlining the process. This shift is further nuanced with the emergence of data lakes, where data is often loaded without immediate transformation, risking the creation of unorganized 'data swamps'.
The distinction between ETL and ELT blurs in modern data lakehouse environments, leading to a recommendation that these strategies be applied more flexibly at the micro-level within individual pipelines rather than as a standardized organizational approach, emphasizing a tailored, case-by-case methodology in data transformation practices.
SQL-Based and Non-SQL-Based Transformation Systems
In the realm of big data processing, the distinction between SQL-based and non-SQL-based transformation systems is increasingly nuanced. SQL, once seen as less flexible due to its declarative nature, has become integral in big data platforms like Hive, Spark, and streaming frameworks such as Kafka and Flink. This development blurs the lines between SQL-only tools and those supporting broader programming paradigms.
While SQL excels in creating complex data workflows and is effective in certain scenarios, it has limitations, especially in cases requiring procedural programming or advanced data manipulations, where Spark or PySpark might be more suitable. When choosing between SQL and programming languages in data processing environments, considerations include the complexity and maintainability of the code, the potential for code reuse, and the use of user-defined functions.
A balanced approach, leveraging the strengths of both SQL and programming languages, is recommended, with an emphasis on optimization skills for complex data processing tasks.
Update Patterns
In data engineering, transformations often involve persisting data, and a common challenge is updating this persisted data. Updating data can be tricky, particularly when transitioning between different data engineering technologies or dealing with SQL Data Manipulation Language (DML). Several update patterns are commonly used:
Recommended by LinkedIn
Data Wrangling
Data wrangling is an essential process in data engineering where messy, often unstructured data is transformed into clean, usable data, typically through batch transformation. This process involves ingesting data, which can be challenging due to its poor quality, and then parsing and segmenting it to correct anomalies and edge cases. Data wrangling tools, often equipped with graphical interfaces, streamline this process by automating the parsing and ingestion, allowing users to apply processing steps for issues like data type correction and field splitting. These tools, sometimes viewed as 'no-code' IDEs for malformed data, can save significant time for data engineers, enabling them to focus on more complex tasks. They are particularly useful in organizations frequently handling new, messy data sources, making the data cleaning process more efficient and allowing data engineers to allocate their time to advanced tasks.
MapReduce
MapReduce, a paradigm introduced by Google and popularized by Hadoop, played a pivotal role in the early days of big data, significantly influencing today's data engineering practices. It's a batch data transformation pattern involving three primary steps: map, shuffle, and reduce.
Materialized Views, Federation, and Query Virtualization
Several techniques in data engineering virtualize query results, presenting them as table-like objects, useful in transformation pipelines or just before end-user data consumption.
Streaming Transformations
Streaming transformations and queries, while related, serve different purposes in data processing. Streaming queries continuously provide a dynamic view of the data, reflecting its latest state. In contrast, streaming transformations prepare data for downstream use, such as enriching IoT event streams with additional metadata for further analysis.
The distinction between transformations and queries becomes blurrier in streaming compared to batch processing, especially in operations like computing statistics on data windows. Streaming Directed Acyclic Graphs (DAGs) facilitate complex real-time operations like merging and splitting multiple streams.
A key debate in stream processing is between micro-batch and true streaming approaches. Micro-batching processes small, frequent batches and suits scenarios where immediate real-time processing isn't crucial. True streaming, on the other hand, handles events individually and is ideal for low-latency requirements. The choice between these methods depends on the specific performance needs and the nature of the data.
Choosing the right approach involves understanding the specific requirements of the project, domain expertise, and careful evaluation of the technology's capabilities. It's also important to be cautious of vendor-provided benchmarks, as they might not always align with real-world scenarios.
Undercurrents and The Transformation Stage
These are some of the considerations to be made in terms of the undercurrents of the data engineering lifecycle when it comes to the transformation stage.
Resources