18 min read 21 July 2023

Time travel with Apache Iceberg tables on AWS

Exploring Apache Iceberg in AWS data lakehouses by implementing time travel for Iceberg tables in Amazon Athena.

Apache Iceberg
Amazon Athena
AWS Glue
Amazon S3
Python

Rafał Mituła AWS Community Hero Tech Lead, Data & AI

In the ever-growing world of data, traditional data lakes were assumed to be immutable — once written, they remain unchanged. This idea came from their main use case, which involved storing huge amounts of raw data for historical analysis and reporting.

The business continues to grow, and some of the use cases were hard (or even too expensive) to create with existing toolsets. Recently, the need for handling data mutations on big data scale — such as updates and deletes — has grown. This is where the data lakehouse comes in. By combining the capabilities of data lakes and data warehouses, lakehouses allow for transactional operations, like updates and deletes, similar to what we see in data warehouses. It’s designed to handle both structured and unstructured data, provide support for various data types and offers the flexibility to run different types of analytics – from machine learning to business intelligence.

In essence: Data lake + data warehouse = data lakehouse.

The architecture of data lakehouse (often called “transactional data lake”) contains an essential component — a table format, such as Apache Iceberg, Apache Hudi, Delta Lake. They provide transactional capabilities, data versioning, rollback, time travel and upsert features — it’s easy to see why those could be considered crucial for most uses.

What is Apache Iceberg?

Apache Iceberg is an open-source table format for large-scale data processing, initially developed by Netflix and Apple. Iceberg was created to address the limitations and challenges of existing data storage formats, such as Apache Parquet. Apache Iceberg adds ACID (atomicity, consistency, isolation, and durability) transactions, snapshot isolation, time travel, schema evolution and more. It is designed to provide efficient and scalable data storage and analytics capabilities — particularly for big data workloads.

Iceberg provides a table format abstraction that allows users to work with data using familiar SQL-like semantics. It captures rich metadata information about the dataset when individual data files are created. Iceberg tables consist of three layers: the Iceberg catalog, the metadata layer, and the data layer leveraging immutable file formats like Parquet, Avro and ORC.

Key features of Apache Iceberg

Transactions: Apache Iceberg brings ACID transactional guarantees to data lakes on Amazon S3. If you’re currently struggling with multiple processes overwriting the same dataset in your data lake, this solves the problem!
Time travel: Iceberg maintains a history of table snapshots. This feature is useful for data auditing, debugging, and data recovery - you can easily query your data as it existed at different points in time. No data loss and no additional work required to implement that feature!
Incremental processing: Iceberg supports efficient incremental processing by tracking changes made to a table over time. This enables data processing frameworks like Apache Spark to perform incremental operations, such as appends and updates, without scanning the entire dataset. This saves money and time, because ETL processes need fewer resources to run and spend less time processing.
Schema evolution: Iceberg supports schema evolution by allowing users to add, remove, or modify columns in a table without rewriting the entire dataset. This makes it easier to evolve data models over time. If your schema changes frequently, this is the simplest way to handle it.
Partitioning: Iceberg allows users to partition data based on one or more columns, improving query performance by eliminating the need to scan the entire dataset. Partitioning enables efficient data pruning and filtering: most cloud query engines bill per data scanned, so fewer data scanned means the cloud bill is lower!
Data catalog integration: Iceberg integrates with popular data catalogs like Apache Hive and AWS Glue Data Catalog. It leverages the catalog’s metadata management capabilities, making it easier to discover and access Iceberg tables. You can integrate your existing data ecosystem with Iceberg.

But to fully unlock the potential of Apache Iceberg, we need to place it in a fully integrated environment. This is where AWS steps in.

Apache Iceberg on AWS

Apache Iceberg works with data frameworks like Apache Spark, Flink, Hive, Presto, and AWS services like Amazon Athena, EMR, and AWS Glue. These AWS services, combined with Iceberg, support a data lakehouse architecture with the data stored on Amazon S3 Bucket and metadata on AWS Glue Data Catalog.

The following diagram demonstrates how we can approach it on AWS:

Diagram of the Data Lakehouse architecture solution on AWS — Data lakehouse platform using Apache Iceberg and AWS

The process starts with a data lake that functions as a primary repository for raw, unprocessed data. This initial stage contains three integral components:

Amazon S3 (Simple Storage Service) is represented as the actual storage for the raw data within the data lake.
AWS Glue Data Catalog serves as the centralized metadata repository, which stores metadata related to data assets stored in Amazon S3. Enables querying the data using Amazon Athena.
AWS Lake Formation is shown as the security and governance layer, which defines the permissions for data access and use in the data lake.

Following this, arrows signify data transfer from the data lake to one of the three processing components. We can use any of these tools to gather our data and put it into the next stage:

AWS Glue Job is represented as the serverless data integration service that prepares and loads the data for analytics using popular frameworks such as Apache Spark.
Amazon EMR (Elastic MapReduce) is depicted as the cloud-native big data platform, used for processing large datasets using popular frameworks.
AWS Glue Studio is shown as the visual interface to create, run, and monitor extract, transform, and load (ETL) jobs.

All three tools perform data transformation, cleaning, and loading operations and are capable of saving processed data in Apache Iceberg table format.

The last stage is the final data lakehouse, which is effectively the same infrastructure stack as the initial data lake (Amazon S3, AWS Glue Data Catalog, AWS Lake Formation), but with data stored in the optimized Apache Iceberg table format. Such lakehouses can then be consumed by downstream services, such as:

Amazon Athena — a serverless analytics service supporting open-table formats and offers flexible data analysis over the data stored in Amazon S3 Buckets. Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.
Amazon QuickSight — a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud, and is used for data visualization and business reporting.
Amazon SageMaker — a fully managed machine learning service, allowing developers and data scientists to quickly build, train, and deploy machine learning models.

Show time!

After this brief introduction, we will now demonstrate the basic features of a data lakehouse architecture on AWS using Apache Iceberg, an open-source table format for large datasets.

Please note that this section is neither an end to end tutorial to follow, nor a deep dive into specific Iceberg features. Our focus here is to provide a basic walkthrough of Iceberg features and give you a sense of the core component used in a data lakehouse on AWS. The writing skips some boilerplate details, letting you focus only on the important parts.

We start with ingesting data into the Apache Iceberg realm. Then, we’ll show how you can query that data from Amazon Athena while using the time travel feature. Lastly, we will demonstrate how to generate a table with differences between each version of your datasets using the changelog mechanism of Apache Iceberg and Apache Spark, also known as change data capture.

Diagram of the simplified Data Lakehouse architecture solution on AWS — Simplified data lakehouse platform architecture

Saving raw data to an Apache Iceberg data lakehouse

We will now focus on the first part of the process, which is mainly handled by a service called AWS Glue Job. Its main purpose is to collect raw, unprocessed data. Then, it transforms this data into Apache Iceberg format, which is used as the core of the data lakehouse.

Diagram showing how to move the data from Data Lake to the Data Lakehouse using AWS Glue Job — Moving data from data lake to the data lakehouse using an AWS Glue job

Let’s assume we already have some data in our Raw Data Zone Amazon S3 Bucket that we want to process and analyze. For storage purposes, we utilize JSON-Lines format which is a popular option for raw data. The data we used is about the products and looks as follows:

data.jsonjson

{  "product_id": "29e17633-8d1e-4d63-8291-7a34fd79a4e5",  "name": "Product A",  "category": "Electronics",  "variants": [    {      "color": "black",      "size": "M",      "stock": 100    },    {      "color": "white",      "size": "L",      "stock": 75    }  ],  "specifications": {    "weight": "1kg",    "dimensions": "10x5x2cm"  },  "ratings": {    "average": 4.5,    "reviews": 300  },  "price": {    "retail": 100,    "discounted": 90  }}

Sample product records are stored in the Raw Data Zone under the /day=01 partition

In our Raw Data Zone, we partitioned the data by the ingestion date. So each day we put our ingestion results under a specific prefix in Amazon S3 which looks like this:

raw-zone/└── products/    └── year=2023/        └── month=05/            ├── day=01/            |   └── data.json            ├── day=02/            |   └── data.json            └── (...)                └── (...)

Amazon S3 bucket structure in a Raw Data Zone

Now, let’s transfer our data into what’s called a curated data zone, a data lakehouse where we’ll store it in the Apache Iceberg table format ready for further analytics. For that, we are going to use an AWS Glue job written in Python that uses PySpark (Apache Spark), to read and write data accordingly. The AWS Glue needs to be configured to load the necessary libraries for Apache Iceberg. This can be done in three different ways:

AWS Glue Job parameter called --datalake-formats set to iceberg as a value. This automatically loads the Apache Iceberg version that comes with the AWS Glue Job version. For instance, AWS Glue 4.0, with version 1.0.0 of Apache Iceberg.
Apache Iceberg Connector for AWS Glue, which is available with a free AWS Marketplace subscription. However, the version of Apache Iceberg is pre-set, and not always the most recent one. At the time of writing, the latest version was 1.2.1.
Manually download and upload official jar files for Apache Iceberg to an Amazon S3 Bucket and then use it as a reference in the AWS Glue Job using the --extra-jars parameter. This allows you to choose any Apache Iceberg version. We used this method to access the latest features of Apache Iceberg, version 1.3.0 at the time of writing.

The following script is used to load data from Amazon S3 for a specific day, perform data transformation tasks, and then either merge the results with an existing Apache Iceberg table or create a new one, depending on the presence of a specific table in AWS Glue Data Catalog. Besides the Amazon S3 Bucket, this script also requires the creation of the AWS Glue Data Catalog database, which in our case is called apache_iceberg_showcase. As a result, the job will save the data in Amazon S3 and update the AWS Glue Data Catalog table with its schema:

job.pypython

import sysimport boto3from pyspark.sql.functions import concat_ws, lpadfrom awsglue.transforms import *from awsglue.utils import getResolvedOptionsfrom pyspark.context import SparkContextfrom awsglue.context import GlueContextfrom awsglue.job import Job# Initialize Spark and Glue contextsc = SparkContext()glueContext = GlueContext(sc)spark = glueContext.spark_session# Retrieve and initialize job parametersargs = getResolvedOptions(sys.argv, ["JOB_NAME"])job = Job(glueContext)job.init(args["JOB_NAME"], args)# Function to check if a given table exists in the databasedef table_exist(database, table_name):    client = boto3.client("glue")    try:        response = client.get_table(DatabaseName=database, Name=table_name)        return True    except client.exceptions.EntityNotFoundException:        return False# Define catalog, schema, and table namecatalog_name = "glue_catalog"database_name = "apache_iceberg_showcase"table_name = "products"# Load data from S3 into a DataFramedf = spark.read.format("json").load("s3:///raw-zone/products/year=2023/month=05/day=01/")# Print the schema of the dataframedf.printSchema()# Check if the table existsif table_exist(database_name, table_name):    # Create a temporary view for the dataframe    temporary_view = "TempView"    df.createOrReplaceTempView(temporary_view)    # Perform the UPSERT operation using SQL statement    spark.sql(f"""        MERGE INTO {catalog_name}.{database_name}.{table_name} AS target        USING {temporary_view} AS source        ON target.product_id = source.product_id        WHEN MATCHED THEN            UPDATE SET            target.name = source.name,            target.category = source.category,            target.variants = source.variants,            target.specifications = source.specifications,            target.ratings = source.ratings,            target.price = source.price        WHEN NOT MATCHED THEN            INSERT *    """)else:    # If the table doesn't exist, create the table and perform the INSERT operation    df.writeTo(f"{catalog_name}.{database_name}.{table_name}") \      .tableProperty("format-version", "2") \      .tableProperty("location", "s3:///curated-zone/products") \      .create()# Commit the jobjob.commit()

Sample Python script in AWS Glue Job leverages Apache Spark to transform JSON data from the Raw Data Zone into Apache Iceberg format in the Curated Data Zone, simultaneously updating the AWS Glue Data Catalog

After the AWS Glue Job is executed successfully, our data is stored in an Amazon S3 Bucket, and it’s ready to be queried in Amazon Athena as an Apache Iceberg table listed in the AWS Glue Data Catalog.

The picture below illustrates how the products table schema appears in the AWS Glue Data Catalog table after successful AWS Glue job execution.

AWS Glue Data Catalog console showing the Apache Iceberg table created and updated by the AWS Glue Job — AWS Glue Data Catalog console showing the Apache Iceberg table created and updated by the AWS Glue job

This is the way Apache Iceberg keeps data in the Amazon S3 Bucket, leveraging immutable file formats like Parquet and Avro. This is why we call Apache Iceberg a table format as it does it all by keeping your data in open source file formats on Amazon S3 Bucket.

curated-zone/└── products/    ├── data/    │   └── 00000-(...)-00001.parquet    └── metadata/        ├── 00000-(...)-7c99bf9d1216.metadata.json        ├── 5b9bf671-(...)-06c3dd2fe777.avro

Amazon S3 bucket structure in a curated data zone

The data directory (data layer) works in conjunction with the metadata directory (metadata layer). While the data directory contains the raw data, the metadata directory contains information about the table layout, the schema, and the partitioning config, as well as the snapshots of the table’s contents. The metadata allows for efficient data querying and supports the time travel feature, making it easy to query the table at different points in time. Whereas the Iceberg catalog stores the metadata pointer to the current table metadata file which in our case is the AWS Glue Data Catalog table.

Apache Iceberg table overview matched with AWS Services showing metadata and data layers (source)

Once we’ve stored the data in the Apache Iceberg format, we can begin to query it. We’ll start with a basic query in Amazon Athena. This query uses the products table from the AWS Glue Data Catalog.

In this instance, we’ve simply selected all records. This appears similar to a standard query made in Amazon Athena.

Amazon Athena console with SQL query that previews all of the data stored in the Apache Iceberg products table — Amazon Athena console with SQL query that previews all the data stored in the Apache Iceberg `products` table

But hold on, there’s more. The Apache Iceberg table format allows us to easily update these tables, a feature that wasn’t nearly as straightforward in the traditional data lake architecture.

Updating data in Apache Iceberg tables

Suppose we want to revise the price of Product A. Below is the record in our dataset that we’re planning to modify. We are going to put this record in the next partition under the Raw Data Zone Amazon S3 Bucket and re-run the same AWS Glue job shown before.

data.jsonjson

{  "product_id": "29e17633-8d1e-4d63-8291-7a34fd79a4e5",  "name": "Product A",  "category": "Electronics",  "variants": [    {      "color": "black",      "size": "M",      "stock": 100    },    {      "color": "white",      "size": "L",      "stock": 75    }  ],  "specifications": {    "weight": "1kg",    "dimensions": "10x5x2cm"  },  "ratings": {    "average": 4.5,    "reviews": 300  },  "price": {    "retail": 120, <- NEW PRICE (WAS 100)    "discounted": 100 <- NEW PRICE (WAS 90)  }}

Updated record of Product A stored in the Raw Data Zone under the /day=02 partition

So, after we’ve executed the AWS Glue job once again on the data added under the /day=02 partition in the Raw Data Zone of our Amazon S3 Bucket, we are ready for querying. If we run the same query again, the products table’s current state will be updated to reflect the price increase shown below.

Amazon Athena console with SQL query that previews all of the data stored in the Apache Iceberg products table after performing update operation on Product A — Amazon Athena console with SQL query that previews all the data stored in the Apache Iceberg `products` table after performing update operation on `Product A`

Time travel: querying Apache Iceberg tables in Amazon Athena

Now, we want to check the past price of products. Apache Iceberg makes this really easy. Because of its integration with Amazon Athena, we can use simple SQL commands to look back in time.

Diagram showing how Apache Iceberg tables help form data lakehouses that can be queried by Amazon Athena — Querying an Apache Iceberg table using Amazon Athena

In the image below, there is a query used in Amazon Athena. This query targets an Apache Iceberg table. A special suffix, $history, is added to the table name to query its metadata. This allows us to see the history of actions performed on the table over time.

SELECT * FROM "apache_iceberg_showcase"."products$history";

Amazon Athena console with SQL query that shows the history of actions performed on the table over time

Once we know the exact timestamp of when the table was modified, we can finally perform a time travel query.

SELECT *FROM "apache_iceberg_showcase"."products"FOR TIMESTAMP AS OF TIMESTAMP '2023-06-14 13:49:00 UTC'WHERE name = 'Product A';

Amazon Athena console with SQL query that shows the previous price of Product A (timestamp before we updated the Apache Iceberg table) — Amazon Athena console with SQL query that shows the previous price of `Product A` (timestamp before we updated the Apache Iceberg table)

The image shows the original condition of the table, i.e. before we increased the price of Product A. In other words, we’ve effectively moved back in time.

Let’s take a look at how the table’s state changes once we adjust the timestamp to a moment after the price update.

SELECT *FROM "apache_iceberg_showcase"."products"FOR TIMESTAMP AS OF TIMESTAMP '2023-06-14 13:51:00 UTC'WHERE name = 'Product A';

Amazon Athena console with SQL query that shows the updated price of Product A (timestamp after we updated the Apache Iceberg table) — Amazon Athena console with SQL query that shows the updated price of `Product A` (timestamp after we updated the Apache Iceberg table)

As demonstrated, we can easily specify any given point in time to view the state of the table at that particular moment.

But what if we want to track when certain changes were made throughout the course of time?

Creating a changelog table with AWS Glue and Apache Iceberg

Let’s pay attention to the AWS Glue job that creates the changelog table. Here, we are going to need another AWS Glue job instance, that is going to use the same job configuration as the one we mentioned before. What differs is the code it uses.

Diagram showing the creation of Apache Iceberg changelog view using AWS Glue Job and querying it using Amazon Athena — Creation of Apache Iceberg changelog view using AWS Glue job and querying it using Amazon Athena

The code below generates the Apache Iceberg changelog view, and saves it in an Amazon S3 bucket via AWS Glue Data Catalog as parquet_changelog table. Note that this time, we are saving the data in plain Parquet (without Apache Iceberg table format over it) for ad-hoc querying. AWS Glue job execution allows us to compute the data changelog between specific snapshots or exact timestamps.

job.pypython

import sysfrom awsglue.transforms import *from awsglue.utils import getResolvedOptionsfrom pyspark.context import SparkContextfrom awsglue.context import GlueContextfrom awsglue.job import Job# Initialize Spark and Glue contextsc = SparkContext()glueContext = GlueContext(sc)spark = glueContext.spark_session# Retrieve and initialize job parametersargs = getResolvedOptions(sys.argv, ["JOB_NAME"])job = Job(glueContext)job.init(args["JOB_NAME"], args)# Define catalog, schema, and table namedatabase_name = "apache_iceberg_showcase"table_name = "products"# Define Change Data Capture settingschangelog_table_name = "products_changelog"identifier_columns = "product_id"start_snapshot_id = "8355263591683472575" # Value based on exampleend_snapshot_id = "4192901519627873695" # Value based on examplespark.sql(f"""    CALL glue_catalog.system.create_changelog_view(        table => '{database_name}.{table_name}',        options => map(            'start-snapshot-id', '{start_snapshot_id}',            'end-snapshot-id', '{end_snapshot_id}'        ),        changelog_view => '{changelog_table_name}',        compute_updates => true,        identifier_columns => array('{identifier_columns}')    )""")changelog_df = spark.sql(f"SELECT * FROM {changelog_table_name}")changelog_df.write \    .option("path", f"s3:///curated-zone/{changelog_table_name}") \    .mode("append") \    .saveAsTable(f"{database_name}.{changelog_table_name}")job.commit()

Sample Python script in the AWS Glue job that utilizes Apache Spark to run an Apache Iceberg procedure, creating a changelog table on Amazon S3 and updating the products_changelog table in the AWS Glue Data Catalog

After successful AWS Glue job execution, we can query the table from Amazon Athena and get a changelog of a specific record throughout the time. We can see the history of Product A that was modified earlier along with the commit timestamps. We see the state of the record before and after a particular change was applied indicated by _change_type column.

SELECT  changelog.product_id,  changelog.name,  changelog.price,  changelog._change_type,  changelog._commit_snapshot_id,  snapshots.committed_atFROM "apache_iceberg_showcase"."products_changelog" AS changelog  INNER JOIN "apache_iceberg_showcase"."products$snapshots" AS snapshots  ON changelog._commit_snapshot_id = snapshots.snapshot_idWHERE "name" = 'Product A'

Amazon Athena console with SQL query that shows the changelog with commit timestamps

Summary

We’ve traveled back and forth in time with Apache Iceberg, took a tour of the data lakehouse on AWS and got to know why this is such a big deal changing the data game.

Traditional data lakes are big stores of data that don’t change. However, modern businesses often need to update their data more frequently. To do this, we have data lakehouses. These blend features from both data lakes and data warehouses, allowing changes to be made to the data, such as adding or removing information.

Apache Iceberg is a tool that helps manage big amounts of data better. It can track changes over time, fix data without rewriting everything, and make finding and accessing data easier. Also, it works well with AWS services, allowing data to be stored in a flexible and efficient way. This unlocks several features like time travel, seamless handling of updates, and incremental data processing for data stored on Amazon S3.