A Comprehensive Guide to Data Engineering - Part Eight: Mastering Data Transformation (3)
Photo by Christopher Burns on Unsplash

A Comprehensive Guide to Data Engineering - Part Eight: Mastering Data Transformation (3)

Data, in its raw form, is often a complex maze of numbers and facts. To unlock its true potential, it must be transformed into a more accessible, understandable, and actionable format. Data transformations are critical in managing and analyzing data effectively. They enable the integration of disparate data sources into a unified format, optimize resource usage, and support the creation of complex data processing pipelines, ultimately enhancing the overall value and utility of the data.

What are Transformations?

While a query retrieves data based on specific conditions and join logic, transformation persists the results of these queries for future use. This persistence can be either temporary or permanent.

Data transformations often involve building complex pipelines that integrate data from multiple sources and may reuse intermediate results. These pipelines can include data normalization, modelling, aggregation, and feature extraction. While complex dataflows can be created in single queries, this approach can become unwieldy and inconsistent. Transformations offer a more structured and manageable alternative.

Transformations rely heavily on orchestration an undercurrent of the data engineering lifecycle. It involves storing data temporarily or permanently for use in downstream transformations or for serving to end users. Modern transformation pipelines often span multiple tables, datasets, and even systems, necessitating efficient orchestration.

Batch Transformations

Batch transformations process data in discrete chunks or batches. This means that the data is collected over some time and then processed all at once. These transformations can be scheduled to run at specific intervals, such as daily, hourly, or every 15 minutes. The schedule depends on the requirements of the project or the specific use case. There are various patterns and technologies specific to batch transformations.

Distributed Join

A distributed join breaks down a larger, logical join (defined by query logic) into smaller joins that are executed on individual servers within a cluster. This approach is used to manage and process large data sets that cannot be handled by a single server due to their size or complexity.

Broadcast Join

Broadcast joins are a powerful technique in distributed data processing used to efficiently handle join operations between a large and a small table. By broadcasting the smaller table across the cluster and joining it with segments of the larger table at each node, broadcast joins minimize resource usage and improve query performance, especially when combined with smart query optimization techniques like join reordering and prefiltering.

Shuffle Hash Joins

Shuffle hash join is a technique used in distributed data processing for joining large datasets that are initially distributed across multiple nodes. It involves repartitioning the data based on the join key using a hashing scheme and then reshuffling the data so that the relevant parts of the tables can be joined on the same node. This method is more resource-intensive compared to broadcast joins due to the extensive data movement required.

ETL, ELT, and data pipelines

The evolution from traditional ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform) reflects advancements in data management practices. Initially, ETL involved extracting data from source systems, transforming it externally, and then loading it into target systems like data warehouses, constrained by limited storage and processing power.

With the advent of more powerful data warehouses, the ELT approach gained prominence, allowing raw data to be loaded directly into warehouses and transformed there, streamlining the process. This shift is further nuanced with the emergence of data lakes, where data is often loaded without immediate transformation, risking the creation of unorganized 'data swamps'.

The distinction between ETL and ELT blurs in modern data lakehouse environments, leading to a recommendation that these strategies be applied more flexibly at the micro-level within individual pipelines rather than as a standardized organizational approach, emphasizing a tailored, case-by-case methodology in data transformation practices.

SQL-Based and Non-SQL-Based Transformation Systems

In the realm of big data processing, the distinction between SQL-based and non-SQL-based transformation systems is increasingly nuanced. SQL, once seen as less flexible due to its declarative nature, has become integral in big data platforms like Hive, Spark, and streaming frameworks such as Kafka and Flink. This development blurs the lines between SQL-only tools and those supporting broader programming paradigms.

While SQL excels in creating complex data workflows and is effective in certain scenarios, it has limitations, especially in cases requiring procedural programming or advanced data manipulations, where Spark or PySpark might be more suitable. When choosing between SQL and programming languages in data processing environments, considerations include the complexity and maintainability of the code, the potential for code reuse, and the use of user-defined functions.

A balanced approach, leveraging the strengths of both SQL and programming languages, is recommended, with an emphasis on optimization skills for complex data processing tasks.

Update Patterns

In data engineering, transformations often involve persisting data, and a common challenge is updating this persisted data. Updating data can be tricky, particularly when transitioning between different data engineering technologies or dealing with SQL Data Manipulation Language (DML). Several update patterns are commonly used:

  1. Truncate and Reload: This pattern involves completely clearing a table and reloading it with transformed data. It's a straightforward approach but can be resource-intensive.
  2. Insert Only: In this approach, new records are added without altering or deleting old records. It's useful for maintaining a current view of data but can be computationally expensive to query over time.
  3. Deletes: Deletion is essential, especially with regulations like GDPR requiring targeted data removal. In columnar systems and data lakes, deletes are more resource-intensive than inserts and can be handled as either hard or soft deletes.
  4. Upsert/Merge: This is a more complex pattern, especially in transitioning from row-based to column-based systems. It involves updating existing records or inserting new ones based on a key match. While powerful, it can be performance-intensive due to the nature of file-based systems and their reliance on copy-on-write (COW) mechanisms.
  5. Schema Updates: Managing schema updates in columnar systems is easier than in row-based systems. However, organizational challenges persist, like automating schema updates or managing changes without breaking downstream processes.

Data Wrangling

Data wrangling is an essential process in data engineering where messy, often unstructured data is transformed into clean, usable data, typically through batch transformation. This process involves ingesting data, which can be challenging due to its poor quality, and then parsing and segmenting it to correct anomalies and edge cases. Data wrangling tools, often equipped with graphical interfaces, streamline this process by automating the parsing and ingestion, allowing users to apply processing steps for issues like data type correction and field splitting. These tools, sometimes viewed as 'no-code' IDEs for malformed data, can save significant time for data engineers, enabling them to focus on more complex tasks. They are particularly useful in organizations frequently handling new, messy data sources, making the data cleaning process more efficient and allowing data engineers to allocate their time to advanced tasks.

MapReduce

MapReduce, a paradigm introduced by Google and popularized by Hadoop, played a pivotal role in the early days of big data, significantly influencing today's data engineering practices. It's a batch data transformation pattern involving three primary steps: map, shuffle, and reduce.

  1. Map Step: In a MapReduce job, the map tasks operate on data blocks distributed across different nodes. Each map task processes a data block independently, a concept known as "embarrassing parallelism," where tasks are executed in parallel without interdependencies, scaling linearly with the number of nodes.
  2. Shuffle Step: After the map step, a shuffle process redistributes the data across the cluster. This step involves reorganizing the data so that all data related to a particular key is brought to the same node. The shuffle often uses a hashing algorithm on the keys to efficiently distribute the data.
  3. Reduce Step: The final stage is the reduce step, where the data is aggregated on each node based on the key. The results from all nodes can then be combined to form the final output.

Materialized Views, Federation, and Query Virtualization

Several techniques in data engineering virtualize query results, presenting them as table-like objects, useful in transformation pipelines or just before end-user data consumption.

  1. Views: These are database objects that can be selected like any other table but are essentially queries referencing other tables. When selected, the database combines the view's query with the user's query for optimization and execution. Views have various roles:Security: By selecting specific columns and filtering rows, they provide restricted data access. Data Deduplication: They can present a current, deduplicated picture of data, especially useful in insert-only patterns. Common Data Access: Simplifying complex queries, such as joining multiple tables for frequent use.
  2. Materialized Views: Unlike regular views that don’t precompute results, materialized views do some or all computations in advance, saving results for faster access. They can be particularly beneficial in optimizing queries that don’t directly reference them but resemble the materialized view, allowing the optimizer to select from pre-computed results.
  3. Composable Materialized Views: Traditional materialized views don’t allow for composition (selecting from another materialized view), but newer tools like Databricks’ live tables support this, updating tables as new data arrives and allowing data to flow down to subsequent tables asynchronously.
  4. Federated Queries: These allow OLAP databases to select from external data sources, such as object storage or other RDBMS. For instance, Snowflake can define external tables on S3 buckets, combining S3 data with internal database tables, enhancing compatibility with data lake environments.
  5. Data Virtualization: This entails a data processing and query system that doesn’t store data internally, exemplified by systems like Trino and Presto. A key aspect is query pushdown, aiming to offload computation to source databases and minimize data transfer across networks. Data virtualization is suitable for organizations with data spread across various sources but requires careful application to avoid impacting production systems. It can be part of data ingestion and processing pipelines, helping to unify data access across an organization and fitting well with the concept of a data mesh.

Streaming Transformations

Streaming transformations and queries, while related, serve different purposes in data processing. Streaming queries continuously provide a dynamic view of the data, reflecting its latest state. In contrast, streaming transformations prepare data for downstream use, such as enriching IoT event streams with additional metadata for further analysis.

The distinction between transformations and queries becomes blurrier in streaming compared to batch processing, especially in operations like computing statistics on data windows. Streaming Directed Acyclic Graphs (DAGs) facilitate complex real-time operations like merging and splitting multiple streams.

A key debate in stream processing is between micro-batch and true streaming approaches. Micro-batching processes small, frequent batches and suits scenarios where immediate real-time processing isn't crucial. True streaming, on the other hand, handles events individually and is ideal for low-latency requirements. The choice between these methods depends on the specific performance needs and the nature of the data.

Choosing the right approach involves understanding the specific requirements of the project, domain expertise, and careful evaluation of the technology's capabilities. It's also important to be cautious of vendor-provided benchmarks, as they might not always align with real-world scenarios.

Undercurrents and The Transformation Stage

These are some of the considerations to be made in terms of the undercurrents of the data engineering lifecycle when it comes to the transformation stage.

  1. Security: Security is paramount in data transformations. It's important to control who has access to the new datasets created from transformations and to manage access at the column, row, and cell levels. Protecting databases against attack vectors and maintaining tight control over read/write privileges is essential. Credentials should be kept secure and not embedded in code or unencrypted files.
  2. Data Management: Data management is crucial during the transformation phase. New datasets created by transformations need careful management. Involving all stakeholders in the data modelling process and agreeing on naming conventions and data definitions is important. Tools like data catalogues can provide clarity on data fields, including their meaning, creation, and maintenance details. Ensuring data accuracy and adherence to business logic is vital, and semantic or metrics layers can be used to maintain this logic independently of the transformations.
  3. DataOps: In DataOps, monitoring both data and systems is key. For data, this involves ensuring the correctness of inputs and outputs, schema accuracy, and data quality. For operations, monitoring system performance metrics like query queue length, memory usage, and disk I/O is important. Identifying and addressing bottlenecks and performance issues is crucial. In cloud environments, managing costs efficiently is also a significant aspect of DataOps.
  4. Data Architecture: The architecture chosen for data ingestion and storage greatly influences the ability to perform reliable queries and transformations. Using the right data pipelines and databases is essential to prevent performance issues. Understanding the trade-offs in architectural choices is necessary to ensure effective query performance.
  5. Orchestration: Managing transformation pipelines often starts with simple schedules but can quickly become complex. Using orchestration tools can help manage these pipelines more efficiently, especially when dealing with dependencies and multiple systems.
  6. Software Engineering: When writing transformation code, choosing the right language and platform is crucial. Best practices should be followed to ensure clean, performant code. The rise of analytics engineering and tools like dbt empowers analysts and data scientists to perform transformations using SQL, shifting some responsibilities from data engineers. Even when using GUI-based low-code tools, understanding the underlying processes is important for debugging and performance optimization.

Resources

  1. Fundamentals of Data Engineering by Joe Reis & Matt Housley





To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics