November 28th | Meetup: AI in finance 🦾 | Register now
July 21, 2023
July 21, 2023

Time travel with Apache Iceberg data lakehouses on AWS

In this article, we answer why they are useful, what problems they solve and demonstrate basic usage based on the Apache Iceberg table format.

Rafał Mituła

What is a data lakehouse?

In an ever-growing world of data, traditional data lakes were assumed to be immutable once written, they remain unchanged. This idea came from their main use case, which involved storing huge amounts of raw data for historical analysis and reporting.

The business continues to grow, and some of the use cases were hard (or even too expensive) to create with existing toolsets. Recently, the need for handling data mutations on big data scale — such as updates and deletes — has grown. This is where the data lakehouse steps in. By combining the capabilities of data lakes and data warehouses, Lakehouses allow for transactional operations, like updates and deletes, similar to what we see in data warehouses. It's designed to handle both structured and unstructured data, provide support for various data types and offers the flexibility to run different types of analytics – from machine learning to business intelligence.

Data lake + data warehouse = data lakehouse

In the architecture of a data lakehouse (often called "transactional data lake") there’s an essential component — a table format, such as Apache Iceberg, Apache Hudi, Delta Lake. These technologies provide transactional capabilities, data versioning, rollback, time travel and upsert features, making them crucial to make the most out of your data lakehouse and unlock the true potential of your data.

What is an open table format like Apache Iceberg?

Apache Iceberg is an open-source table format for large-scale data processing, initially developed by Netflix and Apple. Iceberg was created to address the limitations and challenges of existing data storage formats, such as Apache Parquet. Apache Iceberg adds ACID (atomicity, consistency, isolation, and durability) transactions, snapshot isolation, time travel, schema evolution and more. It is designed to provide efficient and scalable data storage and analytics capabilities — particularly for big data workloads.

Apache Iceberg Logo

Iceberg provides a table format abstraction that allows users to work with data using familiar SQL-like semantics. It captures rich metadata information about the dataset when individual data files are created. Iceberg tables consist of three layers: the Iceberg catalog, the metadata layer, and the data layer leveraging immutable file formats like Parquet, Avro and ORC.

Common use cases for Apache Iceberg

Some key features of Apache Iceberg include: 

But to fully unlock the potential of Apache Iceberg, we need to place it in a fully integrated environment. This is where AWS steps in.

Apache Iceberg on AWS

Apache Iceberg works with data frameworks like Apache Spark, Flink, Hive, Presto, and AWS services like Amazon Athena, EMR, and AWS Glue. These AWS services, combined with Iceberg, support a data lakehouse architecture with the data stored on Amazon S3 Bucket and metadata on AWS Glue Data Catalog.

The following diagram demonstrates how we can approach it on AWS:

Diagram of the Data Lakehouse architecture solution on AWS
Data lakehouse platform leveraging Apache Iceberg and AWS

The process starts with a data lake that functions as a primary repository for raw, unprocessed data. This initial stage contains three integral components:

Following this, arrows signify data transfer from the data lake to one of the three processing components. We can use any of these tools to gather our data and put it into the next stage:

All three tools perform data transformation, cleaning, and loading operations and are capable of saving processed data in Apache Iceberg table format. 

The last stage is the final data lakehouse, which is effectively the same infrastructure stack as the initial data lake (Amazon S3, AWS Glue Data Catalog, AWS Lake Formation), but with data stored in the optimized Apache Iceberg table format. Such lakehouses can then be consumed by downstream services, such as:

Show time!

After this brief introduction, we will now demonstrate the basic features of a data lakehouse architecture on AWS using Apache Iceberg, an open-source table format for large datasets. 

Please note that this section is neither an end to end tutorial to follow, nor a deep dive into specific Iceberg features. Our focus here is to provide a basic walkthrough of Iceberg features and give you a sense of the core component used in a data lakehouse on AWS. The writing skips some boilerplate details, letting you focus only on the important parts.

We start with ingesting data into the Apache Iceberg realm. Then, we’ll show how you can query that data from Amazon Athena while using the time travel feature. Lastly, we will demonstrate how to generate a table with differences between each version of your datasets using the changelog mechanism of Apache Iceberg and Apache Spark, also known as change data capture.

Diagram of the simplified Data Lakehouse architecture solution on AWS
Simplified data lakehouse platform architecture

Saving raw data to Apache Iceberg data lakehouse

We will now focus on the first part of the process, which is mainly handled by a service called AWS Glue Job. Its main purpose is to collect raw, unprocessed data. Then, it transforms this data into Apache Iceberg format, which is used as the core of the data lakehouse.

Diagram showing how to move the data from Data Lake to the Data Lakehouse using AWS Glue Job
Moving data from data lake to the data lakehouse using an AWS Glue job

Let’s assume we already have some data in our Raw Data Zone Amazon S3 Bucket that we want to process and analyze. For storage purposes, we utilize JSON-Lines format which is a popular option for raw data. The data we used is about the products and looks as follows:

Sample products records are stored in the Raw Data Zone under /day=01 partition

In our Raw Data Zone, we partitioned the data by the ingestion date. So each day we put our ingestion results under a specific prefix in Amazon S3 which looks like this:

Amazon S3 Bucket structure under Raw Data Zone

Now, let's transfer our data into what's called a curated data zone, a data lakehouse where we'll store it in the Apache Iceberg table format ready for further analytics. For that, we are going to use an AWS Glue job written in Python that uses PySpark (Apache Spark), to read and write data accordingly. The AWS Glue needs to be configured to load the necessary libraries for Apache Iceberg. This can be done in three different ways:

  1. AWS Glue Job parameter called --datalake-formats set to 'iceberg' as a value. This automatically loads the Apache Iceberg version that comes with the AWS Glue Job version. For instance, AWS Glue 4.0, with version 1.0.0 of Apache Iceberg.
  2. Apache Iceberg Connector for AWS Glue, which is available with a free AWS Marketplace subscription. However, the version of Apache Iceberg is pre-set, and not always the most recent one. At the time of writing, the latest version was 1.2.1.
  3. Manually download and upload official jar files for Apache Iceberg to an Amazon S3 Bucket and then use it as a reference in the AWS Glue Job using the --extra-jars parameter. This allows you to choose any Apache Iceberg version. We used this method to access the latest features of Apache Iceberg, version 1.3.0 at the time of writing.

The following script is used to load data from Amazon S3 for a specific day, perform data transformation tasks, and then either merge the results with an existing Apache Iceberg table or create a new one, depending on the presence of a specific table in AWS Glue Data Catalog. Besides the Amazon S3 Bucket, this script also requires the creation of the AWS Glue Data Catalog database, which in our case is called 'apache_iceberg_showcase'. As a result, the job will save the data in Amazon S3 and update the AWS Glue Data Catalog table with its schema:

Sample Python script in AWS Glue Job leverages Apache Spark to transform JSON data from the Raw Data Zone into Apache Iceberg format in the Curated Data Zone, simultaneously updating the AWS Glue Data Catalog

After the AWS Glue Job is executed successfully, our data is stored in an Amazon S3 Bucket, and it's ready to be queried in Amazon Athena as an Apache Iceberg table listed in the AWS Glue Data Catalog.

The picture below illustrates how the products table schema appears in the AWS Glue Data Catalog table after successful AWS Glue job execution.

AWS Glue Data Catalog console showing Apache Iceberg table created and updated by the AWS Glue Job
AWS Glue Data Catalog console showing Apache Iceberg table created and updated by the AWS Glue job

This is the way Apache Iceberg keeps data in the Amazon S3 Bucket, leveraging immutable file formats like Parquet and Avro. This is why we call Apache Iceberg a table format as it does it all by keeping your data in open source file formats on Amazon S3 Bucket.

Amazon S3 Bucket structure in a curated data zone

The data directory (data layer) works in conjunction with the metadata directory (metadata layer). While the data directory contains the raw data, the metadata directory contains information about the table layout, the schema, and the partitioning config, as well as the snapshots of the table's contents. The metadata allows for efficient data querying and supports the time travel feature, making it easy to query the table at different points in time. Whereas the Iceberg catalog stores the metadata pointer to the current table metadata file which in our case is the AWS Glue Data Catalog table.

Apache Iceberg table overview matched with AWS Services showing metadata and data layers
Apache Iceberg table overview matched with AWS Services showing metadata and data layers (source)

Once we've stored the data in the Apache Iceberg format, we can begin to query it. We'll start with a basic query in Amazon Athena. This query uses the products table from the AWS Glue Data Catalog.

In this instance, we've simply selected all records. This appears similar to a standard query made in Amazon Athena.

Amazon Athena console with SQL query that previews all of the data stored in the Apache Iceberg products table
Amazon Athena console with SQL query that previews all of the data stored in the Apache Iceberg products table

But hold on, there's more. The Apache Iceberg table format allows us to easily update these tables, a feature that wasn't nearly as straightforward in the traditional data lake architecture.

Updating Apache Iceberg table data

Suppose we want to revise the price of Product A. Below is the record in our dataset that we're planning to modify. We are going to put this record in the next partition under the Raw Data Zone Amazon S3 Bucket and re-run the same AWS Glue job shown before.

Updated products record of Product A stored in the Raw Data Zone under /day=02 partition.

So, after we've executed the AWS Glue job once again on the data added under the /day=02 partition in the Raw Data Zone of our Amazon S3 Bucket, we are ready for querying. If we run the same query again, the products table's current state will be updated to reflect the price increase shown below.

Amazon Athena console with SQL query that previews all of the data stored in the Apache Iceberg products table after performing update operation on Product A
Amazon Athena console with SQL query that previews all of the data stored in the Apache Iceberg products table after performing update operation on Product A

Time travel: querying Apache Iceberg tables in Amazon Athena

Now, we want to check the past price of products. Apache Iceberg makes this really easy. Because of its integration with Amazon Athena, we can use simple SQL commands to look back in time.

Diagram showing architecture of how to query Apache Iceberg table using Amazon Athena
Querying Apache Iceberg table using Amazon Athena

In the image below, there is a query used in Amazon Athena. This query targets an Apache Iceberg table. A special suffix, $history, is added to the table name to query it’s metadata. This allows us to see the history of actions performed on the table over time.

Amazon Athena console with SQL query that shows the history of actions performed on the table over time
Amazon Athena console with SQL query that shows the history of actions performed on the table over time

Once we know when the table was modified with the exact timestamp. Let's perform a time travel query. The image below displays the original condition of the table before we increased the price of Product A moving us back in time.

Amazon Athena console with SQL query that shows the previous price of Product A (timestamp before we updated the Apache Iceberg table)
Amazon Athena console with SQL query that shows the previous price of Product A (timestamp before we updated the Apache Iceberg table)

Let's take a look at how the table changes after we've updated the price by adjusting the timestamp value.

Amazon Athena console with SQL query that shows the updated price of Product A (timestamp after we updated the Apache Iceberg table)
Figure 11. Amazon Athena console with SQL query that shows the updated price of Product A (timestamp after we updated the Apache Iceberg table)

As demonstrated, we can easily specify any given point in time to view the state of the table at that particular moment.

But what if we want to track when certain changes were made throughout the course of time?

Creating a changelog table with AWS Glue and Apache Iceberg

Let's pay attention to the AWS Glue job that creates the changelog table. Here, we are going to need another AWS Glue job instance, that is going to use the same job configuration as the one we mentioned before. What differs is the code it uses. 

Diagram showing the creation of Apache Iceberg changelog view using AWS Glue Job and querying it using Amazon Athena
Creation of Apache Iceberg changelog view using AWS Glue job and querying it using Amazon Athena

The code below generates the Apache Iceberg changelog view, and saves it in an Amazon S3 bucket via AWS Glue Data Catalog as parquet_changelog table. Note that this time, we are saving the data in plain Parquet (without Apache Iceberg table format over it) for ad-hoc querying. AWS Glue job execution allows us to compute the data changelog between specific snapshots or exact timestamps. 

Sample Python script in the AWS Glue job that utilizes Apache Spark to run an Apache Iceberg procedure, creating a changelog table on Amazon S3 and updating the products_changelog table in the AWS Glue Data Catalog.

After successful AWS Glue job execution, we can query the table from Amazon Athena and get a changelog of a specific record throughout the time. We can see the history of Product A that was modified earlier along with the commit timestamps. We see the state of the record before and after a particular change was applied indicated by _change_type column.

Amazon Athena console with SQL query that shows the changelog with commit timestamps
Amazon Athena console with SQL query that shows the changelog with commit timestamps

Summary

We've traveled back and forth in time with Apache Iceberg, took a tour of the data lakehouse on AWS and got to know why this is such a big deal changing the data game. 

Traditional data lakes are big stores of data that don't change. However, modern businesses often need to update their data more frequently. To do this, we have data lakehouses. These blend features from both data lakes and data warehouses, allowing changes to be made to the data, such as adding or removing information.

Apache Iceberg is a tool that helps manage big amounts of data better. It can track changes over time, fix data without rewriting everything, and make finding and accessing data easier. Also, it works well with AWS services, allowing data to be stored in a flexible and efficient way. This unlocks several features like time travel, seamless handling of updates, and incremental data processing for data stored on Amazon S3.

Technologies

Series

Remaining chapters

No items found.
Insights

Related articles

Let's talk about your project

We'd love to answer your questions and help you thrive in the cloud.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We'd like to keep improving our site - and your anonymous analytical cookies would help with that. Is that OK with you?
Analytics
These items help us understand how our website performs, how visitors interact with the site, and whether there may be technical issues. The information we collect for this purpose is fully anonymous.
Confirm