Creating Efficient ETL Pipelines with Databricks Delta

DataThrive
Feb 19, 2023
4 min read

Data Engineering is a fundamental part of the data lifecycle. Efficiently processing and transforming data into a structured format is essential for data analysts and scientists to extract insights and derive meaningful conclusions from data. One of the most popular tools used for creating ETL pipelines is Apache Spark, and Databricks Delta extends the capabilities of Spark for efficient data processing.

In this blog post, we will explore how Databricks Delta can help us create efficient ETL pipelines. We will start by understanding what Databricks Delta is and how it is different from Apache Spark. Then, we will move on to creating an ETL pipeline using Databricks Delta with both PySpark and Spark SQL.

What is Databricks Delta?

Databricks Delta is an optimized version of Apache Spark that adds a layer of reliability to data processing. It is designed to handle large datasets and is particularly useful for processing data lakes. Delta provides many advantages over Spark, such as faster processing times, increased reliability, and improved query performance. Delta also adds transactional capabilities that allow users to manage metadata and ensure data integrity.

Creating an ETL Pipeline with Databricks Delta using PySpark

For this example, we will use a more complex dataset containing online retail transactions. Our goal is to create an ETL pipeline that processes this data and saves the transformed data to a new location. We will perform the following operations in our pipeline:

Read the data from the source location.
Filter out any rows with null values.
Create a new column that contains the total transaction amount.
Aggregate the data by country and month.
Join the aggregated data with a lookup table.
Save the transformed data to a new location.

Here is the PySpark code for the ETL pipeline using Databricks Delta:

from pyspark.sql.functions import col, sum, year, month

# Read the data from the source location
source_data = spark.read.format("csv").option("header", "true").load("/mnt/source_location")

# Filter out any rows with null values
filtered_data = source_data.filter(col("InvoiceNo").isNotNull())

# Create a new column that contains the total transaction amount
transformed_data = filtered_data.withColumn("total_amount", col("Quantity") * col("UnitPrice"))

# Aggregate the data by country and month
aggregated_data = transformed_data.groupBy("Country", year("InvoiceDate").alias("year"), month("InvoiceDate").alias("month")).agg(sum("total_amount").alias("total_sales"))

# Join the aggregated data with a lookup table
lookup_table = spark.read.format("delta").load("/mnt/lookup_table_location")
joined_data = aggregated_data.join(lookup_table, "Country")

# Save the transformed data to a new location
joined_data.write.format("delta").mode("overwrite").save("/mnt/destination_location")

In this code, read the data from the source location using the CSV format and filter out any rows with null values. We create a new column for the total transaction amount and aggregate the data by country and month. Next, we join the aggregated data with a lookup table using the join method, and finally, we save the transformed data to a new location using the Databricks Delta format.

Creating an ETL Pipeline with Databricks Delta using Spark SQL

For this example, we will use the same dataset containing information about online retail transactions as we used in the PySpark example. Our goal is to create an ETL pipeline that processes this data and saves the transformed data to a new location. We will perform the following operations in our pipeline:

Create a table from the data in the source location using the CSV format.
Filter out any rows with null values and create a temporary view of the filtered data.
Create another temporary view of the filtered data with a new column for the total transaction amount.
Aggregate the data by country and month, and create a temporary view of the aggregated data.
Join the aggregated data with a lookup table using the Country column and create another temporary view of the joined data.
Save the transformed data to a new location using the Databricks Delta format by creating a new table from the joined data.

Here is the Spark SQL code for the ETL pipeline using Databricks Delta:

-- Read the data from the source location
CREATE TABLE source_data
USING csv
OPTIONS (header "true", path "/mnt/source_location") ;

-- Filter out any rows with null values
CREATE OR REPLACE TEMPORARY VIEW filtered_data AS
SELECT *
FROM source_data
WHERE InvoiceNo IS NOT NULL ;

-- Create a new column that contains the total transaction amount
CREATE OR REPLACE TEMPORARY VIEW transformed_data AS
SELECT *, Quantity * UnitPrice AS total_amount
FROM filtered_data ;

-- Aggregate the data by country and month
CREATE OR REPLACE TEMPORARY VIEW aggregated_data AS
SELECT Country, year(InvoiceDate) AS year, month(InvoiceDate) AS month, sum(total_amount) AS total_sales
FROM transformed_data
GROUP BY Country, year(InvoiceDate), month(InvoiceDate) ;

-- Join the aggregated data with a lookup table
CREATE TABLE lookup_table
USING delta
OPTIONS (path "/mnt/lookup_table_location") ;

CREATE OR REPLACE TEMPORARY VIEW joined_data AS
SELECT a.Country, a.year, a.month, a.total_sales, b.lookup_value
FROM aggregated_data a
JOIN lookup_table b
ON a.Country = b.Country ;

-- Save the transformed data to a new location
CREATE TABLE transformed_data
USING delta
OPTIONS (path "/mnt/destination_location")
AS SELECT *
FROM joined_data ;

In this Spark SQL code, we create a table from the data in the source location using the CSV format. We filter out any rows with null values and create a temporary view of the filtered data. We create another temporary view of the filtered data with a new column for the total transaction amount. We aggregate the data by country and month, and join the aggregated data with a lookup table. Finally, we save the transformed data to a new location using the Databricks Delta format.

Spark SQL provides an alternative approach for creating ETL pipelines, and it can be useful for those who are more comfortable with SQL-based transformations. It also provides a way to combine the power of Spark with the simplicity of SQL.

Conclusion

In conclusion, Databricks Delta is a powerful tool that extends the capabilities of Apache Spark for efficient data processing. We have explored how to create an ETL pipeline using Databricks Delta with both PySpark and Spark SQL, and demonstrated the advantages of using Databricks Delta over Apache Spark alone. By using Databricks Delta, data engineers and data scientists can build ETL pipelines that are faster, more reliable, and more performant.