Time travel with Apache Iceberg tables on AWS
Exploring Apache Iceberg in AWS data lakehouses by implementing time travel for Iceberg tables in Amazon Athena.
Apache Iceberg
Amazon Athena
AWS Glue
Amazon S3
Python
In the ever-growing world of data, traditional data lakes were assumed to be immutable — once written, they remain unchanged. This idea came from their main use case, which involved storing huge amounts of raw data for historical analysis and reporting.
The business continues to grow, and some of the use cases were hard (or even too expensive) to create with existing toolsets. Recently, the need for handling data mutations on big data scale — such as updates and deletes — has grown. This is where the data lakehouse comes in. By combining the capabilities of data lakes and data warehouses, lakehouses allow for transactional operations, like updates and deletes, similar to what we see in data warehouses. It’s designed to handle both structured and unstructured data, provide support for various data types and offers the flexibility to run different types of analytics – from machine learning to business intelligence.
In essence: Data lake + data warehouse = data lakehouse.
The architecture of data lakehouse (often called “transactional data lake”) contains an essential component — a table format, such as Apache Iceberg, Apache Hudi, Delta Lake. They provide transactional capabilities, data versioning, rollback, time travel and upsert features — it’s easy to see why those could be considered crucial for most uses.
What is Apache Iceberg?
Apache Iceberg is an open-source table format for large-scale data processing, initially developed by Netflix and Apple. Iceberg was created to address the limitations and challenges of existing data storage formats, such as Apache Parquet. Apache Iceberg adds ACID (atomicity, consistency, isolation, and durability) transactions, snapshot isolation, time travel, schema evolution and more. It is designed to provide efficient and scalable data storage and analytics capabilities — particularly for big data workloads.

Iceberg provides a table format abstraction that allows users to work with data using familiar SQL-like semantics. It captures rich metadata information about the dataset when individual data files are created. Iceberg tables consist of three layers: the Iceberg catalog, the metadata layer, and the data layer leveraging immutable file formats like Parquet, Avro and ORC.
Key features of Apache Iceberg
- Transactions: Apache Iceberg brings ACID transactional guarantees to data lakes on Amazon S3. If you’re currently struggling with multiple processes overwriting the same dataset in your data lake, this solves the problem!
- Time travel: Iceberg maintains a history of table snapshots. This feature is useful for data auditing, debugging, and data recovery - you can easily query your data as it existed at different points in time. No data loss and no additional work required to implement that feature!
- Incremental processing: Iceberg supports efficient incremental processing by tracking changes made to a table over time. This enables data processing frameworks like Apache Spark to perform incremental operations, such as appends and updates, without scanning the entire dataset. This saves money and time, because ETL processes need fewer resources to run and spend less time processing.
- Schema evolution: Iceberg supports schema evolution by allowing users to add, remove, or modify columns in a table without rewriting the entire dataset. This makes it easier to evolve data models over time. If your schema changes frequently, this is the simplest way to handle it.
- Partitioning: Iceberg allows users to partition data based on one or more columns, improving query performance by eliminating the need to scan the entire dataset. Partitioning enables efficient data pruning and filtering: most cloud query engines bill per data scanned, so fewer data scanned means the cloud bill is lower!
- Data catalog integration: Iceberg integrates with popular data catalogs like Apache Hive and AWS Glue Data Catalog. It leverages the catalog’s metadata management capabilities, making it easier to discover and access Iceberg tables. You can integrate your existing data ecosystem with Iceberg.
But to fully unlock the potential of Apache Iceberg, we need to place it in a fully integrated environment. This is where AWS steps in.
Apache Iceberg on AWS
Apache Iceberg works with data frameworks like Apache Spark, Flink, Hive, Presto, and AWS services like Amazon Athena, EMR, and AWS Glue. These AWS services, combined with Iceberg, support a data lakehouse architecture with the data stored on Amazon S3 Bucket and metadata on AWS Glue Data Catalog.
The following diagram demonstrates how we can approach it on AWS:

The process starts with a data lake that functions as a primary repository for raw, unprocessed data. This initial stage contains three integral components:
- Amazon S3 (Simple Storage Service) is represented as the actual storage for the raw data within the data lake.
- AWS Glue Data Catalog serves as the centralized metadata repository, which stores metadata related to data assets stored in Amazon S3. Enables querying the data using Amazon Athena.
- AWS Lake Formation is shown as the security and governance layer, which defines the permissions for data access and use in the data lake.
Following this, arrows signify data transfer from the data lake to one of the three processing components. We can use any of these tools to gather our data and put it into the next stage:
- AWS Glue Job is represented as the serverless data integration service that prepares and loads the data for analytics using popular frameworks such as Apache Spark.
- Amazon EMR (Elastic MapReduce) is depicted as the cloud-native big data platform, used for processing large datasets using popular frameworks.
- AWS Glue Studio is shown as the visual interface to create, run, and monitor extract, transform, and load (ETL) jobs.
All three tools perform data transformation, cleaning, and loading operations and are capable of saving processed data in Apache Iceberg table format.
The last stage is the final data lakehouse, which is effectively the same infrastructure stack as the initial data lake (Amazon S3, AWS Glue Data Catalog, AWS Lake Formation), but with data stored in the optimized Apache Iceberg table format. Such lakehouses can then be consumed by downstream services, such as:
- Amazon Athena — a serverless analytics service supporting open-table formats and offers flexible data analysis over the data stored in Amazon S3 Buckets. Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.
- Amazon QuickSight — a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud, and is used for data visualization and business reporting.
- Amazon SageMaker — a fully managed machine learning service, allowing developers and data scientists to quickly build, train, and deploy machine learning models.
Show time!
After this brief introduction, we will now demonstrate the basic features of a data lakehouse architecture on AWS using Apache Iceberg, an open-source table format for large datasets.
Please note that this section is neither an end to end tutorial to follow, nor a deep dive into specific Iceberg features. Our focus here is to provide a basic walkthrough of Iceberg features and give you a sense of the core component used in a data lakehouse on AWS. The writing skips some boilerplate details, letting you focus only on the important parts.
We start with ingesting data into the Apache Iceberg realm. Then, we’ll show how you can query that data from Amazon Athena while using the time travel feature. Lastly, we will demonstrate how to generate a table with differences between each version of your datasets using the changelog mechanism of Apache Iceberg and Apache Spark, also known as change data capture.

Saving raw data to an Apache Iceberg data lakehouse
We will now focus on the first part of the process, which is mainly handled by a service called AWS Glue Job. Its main purpose is to collect raw, unprocessed data. Then, it transforms this data into Apache Iceberg format, which is used as the core of the data lakehouse.

Let’s assume we already have some data in our Raw Data Zone Amazon S3 Bucket that we want to process and analyze. For storage purposes, we utilize JSON-Lines format which is a popular option for raw data. The data we used is about the products
and looks as follows:
{
"product_id": "29e17633-8d1e-4d63-8291-7a34fd79a4e5",
"name": "Product A",
"category": "Electronics",
"variants": [
{
"color": "black",
"size": "M",
"stock": 100
},
{
"color": "white",
"size": "L",
"stock": 75
}
],
"specifications": {
"weight": "1kg",
"dimensions": "10x5x2cm"
},
"ratings": {
"average": 4.5,
"reviews": 300
},
"price": {
"retail": 100,
"discounted": 90
}
}
/day=01
partitionIn our Raw Data Zone, we partitioned the data by the ingestion date. So each day we put our ingestion results under a specific prefix in Amazon S3 which looks like this:
raw-zone/
└── products/
└── year=2023/
└── month=05/
├── day=01/
| └── data.json
├── day=02/
| └── data.json
└── (...)
└── (...)
Now, let’s transfer our data into what’s called a curated data zone, a data lakehouse where we’ll store it in the Apache Iceberg table format ready for further analytics. For that, we are going to use an AWS Glue job written in Python that uses PySpark (Apache Spark), to read and write data accordingly. The AWS Glue needs to be configured to load the necessary libraries for Apache Iceberg. This can be done in three different ways:
- AWS Glue Job parameter called
--datalake-formats
set toiceberg
as a value. This automatically loads the Apache Iceberg version that comes with the AWS Glue Job version. For instance, AWS Glue 4.0, with version 1.0.0 of Apache Iceberg. - Apache Iceberg Connector for AWS Glue, which is available with a free AWS Marketplace subscription. However, the version of Apache Iceberg is pre-set, and not always the most recent one. At the time of writing, the latest version was 1.2.1.
- Manually download and upload official jar files for Apache Iceberg to an Amazon S3 Bucket and then use it as a reference in the AWS Glue Job using the
--extra-jars
parameter. This allows you to choose any Apache Iceberg version. We used this method to access the latest features of Apache Iceberg, version 1.3.0 at the time of writing.
The following script is used to load data from Amazon S3 for a specific day, perform data transformation tasks, and then either merge the results with an existing Apache Iceberg table or create a new one, depending on the presence of a specific table in AWS Glue Data Catalog. Besides the Amazon S3 Bucket, this script also requires the creation of the AWS Glue Data Catalog database, which in our case is called apache_iceberg_showcase
. As a result, the job will save the data in Amazon S3 and update the AWS Glue Data Catalog table with its schema:
import sys
import boto3
from pyspark.sql.functions import concat_ws, lpad
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Initialize Spark and Glue context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Retrieve and initialize job parameters
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Function to check if a given table exists in the database
def table_exist(database, table_name):
client = boto3.client("glue")
try:
response = client.get_table(DatabaseName=database, Name=table_name)
return True
except client.exceptions.EntityNotFoundException:
return False
# Define catalog, schema, and table name
catalog_name = "glue_catalog"
database_name = "apache_iceberg_showcase"
table_name = "products"
# Load data from S3 into a DataFrame
df = spark.read.format("json").load("s3:///raw-zone/products/year=2023/month=05/day=01/")
# Print the schema of the dataframe
df.printSchema()
# Check if the table exists
if table_exist(database_name, table_name):
# Create a temporary view for the dataframe
temporary_view = "TempView"
df.createOrReplaceTempView(temporary_view)
# Perform the UPSERT operation using SQL statement
spark.sql(f"""
MERGE INTO {catalog_name}.{database_name}.{table_name} AS target
USING {temporary_view} AS source
ON target.product_id = source.product_id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.category = source.category,
target.variants = source.variants,
target.specifications = source.specifications,
target.ratings = source.ratings,
target.price = source.price
WHEN NOT MATCHED THEN
INSERT *
""")
else:
# If the table doesn't exist, create the table and perform the INSERT operation
df.writeTo(f"{catalog_name}.{database_name}.{table_name}") \
.tableProperty("format-version", "2") \
.tableProperty("location", "s3:///curated-zone/products") \
.create()
# Commit the job
job.commit()
After the AWS Glue Job is executed successfully, our data is stored in an Amazon S3 Bucket, and it’s ready to be queried in Amazon Athena as an Apache Iceberg table listed in the AWS Glue Data Catalog.
The picture below illustrates how the products
table schema appears in the AWS Glue Data Catalog table after successful AWS Glue job execution.

This is the way Apache Iceberg keeps data in the Amazon S3 Bucket, leveraging immutable file formats like Parquet and Avro. This is why we call Apache Iceberg a table format as it does it all by keeping your data in open source file formats on Amazon S3 Bucket.
curated-zone/
└── products/
├── data/
│ └── 00000-(...)-00001.parquet
└── metadata/
├── 00000-(...)-7c99bf9d1216.metadata.json
├── 5b9bf671-(...)-06c3dd2fe777.avro
The data directory (data layer) works in conjunction with the metadata directory (metadata layer). While the data directory contains the raw data, the metadata directory contains information about the table layout, the schema, and the partitioning config, as well as the snapshots of the table’s contents. The metadata allows for efficient data querying and supports the time travel feature, making it easy to query the table at different points in time. Whereas the Iceberg catalog stores the metadata pointer to the current table metadata file which in our case is the AWS Glue Data Catalog table.

Once we’ve stored the data in the Apache Iceberg format, we can begin to query it. We’ll start with a basic query in Amazon Athena. This query uses the products
table from the AWS Glue Data Catalog.
In this instance, we’ve simply selected all records. This appears similar to a standard query made in Amazon Athena.

products
tableBut hold on, there’s more. The Apache Iceberg table format allows us to easily update these tables, a feature that wasn’t nearly as straightforward in the traditional data lake architecture.
Updating data in Apache Iceberg tables
Suppose we want to revise the price of Product A
. Below is the record in our dataset that we’re planning to modify. We are going to put this record in the next partition under the Raw Data Zone Amazon S3 Bucket and re-run the same AWS Glue job shown before.
{
"product_id": "29e17633-8d1e-4d63-8291-7a34fd79a4e5",
"name": "Product A",
"category": "Electronics",
"variants": [
{
"color": "black",
"size": "M",
"stock": 100
},
{
"color": "white",
"size": "L",
"stock": 75
}
],
"specifications": {
"weight": "1kg",
"dimensions": "10x5x2cm"
},
"ratings": {
"average": 4.5,
"reviews": 300
},
"price": {
"retail": 120, <- NEW PRICE (WAS 100)
"discounted": 100 <- NEW PRICE (WAS 90)
}
}
/day=02
partitionSo, after we’ve executed the AWS Glue job once again on the data added under the /day=02
partition in the Raw Data Zone of our Amazon S3 Bucket, we are ready for querying. If we run the same query again, the products
table’s current state will be updated to reflect the price increase shown below.

products
table after performing update operation on Product A
Time travel: querying Apache Iceberg tables in Amazon Athena
Now, we want to check the past price of products
. Apache Iceberg makes this really easy. Because of its integration with Amazon Athena, we can use simple SQL commands to look back in time.

In the image below, there is a query used in Amazon Athena. This query targets an Apache Iceberg table. A special suffix, $history
, is added to the table name to query its metadata. This allows us to see the history of actions performed on the table over time.
SELECT * FROM "apache_iceberg_showcase"."products$history";

Once we know the exact timestamp of when the table was modified, we can finally perform a time travel query.
SELECT *
FROM "apache_iceberg_showcase"."products"
FOR TIMESTAMP AS OF TIMESTAMP '2023-06-14 13:49:00 UTC'
WHERE name = 'Product A';

Product A
(timestamp before we updated the Apache Iceberg table)The image shows the original condition of the table, i.e. before we increased the price of Product A. In other words, we’ve effectively moved back in time.
Let’s take a look at how the table’s state changes once we adjust the timestamp to a moment after the price update.
SELECT *
FROM "apache_iceberg_showcase"."products"
FOR TIMESTAMP AS OF TIMESTAMP '2023-06-14 13:51:00 UTC'
WHERE name = 'Product A';

Product A
(timestamp after we updated the Apache Iceberg table)As demonstrated, we can easily specify any given point in time to view the state of the table at that particular moment.
But what if we want to track when certain changes were made throughout the course of time?
Creating a changelog table with AWS Glue and Apache Iceberg
Let’s pay attention to the AWS Glue job that creates the changelog table. Here, we are going to need another AWS Glue job instance, that is going to use the same job configuration as the one we mentioned before. What differs is the code it uses.

The code below generates the Apache Iceberg changelog view, and saves it in an Amazon S3 bucket via AWS Glue Data Catalog as parquet_changelog
table. Note that this time, we are saving the data in plain Parquet (without Apache Iceberg table format over it) for ad-hoc querying. AWS Glue job execution allows us to compute the data changelog between specific snapshots or exact timestamps.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Initialize Spark and Glue context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Retrieve and initialize job parameters
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Define catalog, schema, and table name
database_name = "apache_iceberg_showcase"
table_name = "products"
# Define Change Data Capture settings
changelog_table_name = "products_changelog"
identifier_columns = "product_id"
start_snapshot_id = "8355263591683472575" # Value based on example
end_snapshot_id = "4192901519627873695" # Value based on example
spark.sql(f"""
CALL glue_catalog.system.create_changelog_view(
table => '{database_name}.{table_name}',
options => map(
'start-snapshot-id', '{start_snapshot_id}',
'end-snapshot-id', '{end_snapshot_id}'
),
changelog_view => '{changelog_table_name}',
compute_updates => true,
identifier_columns => array('{identifier_columns}')
)
""")
changelog_df = spark.sql(f"SELECT * FROM {changelog_table_name}")
changelog_df.write \
.option("path", f"s3:///curated-zone/{changelog_table_name}") \
.mode("append") \
.saveAsTable(f"{database_name}.{changelog_table_name}")
job.commit()
products_changelog
table in the AWS Glue Data CatalogAfter successful AWS Glue job execution, we can query the table from Amazon Athena and get a changelog of a specific record throughout the time. We can see the history of Product A that was modified earlier along with the commit timestamps. We see the state of the record before and after a particular change was applied indicated by _change_type
column.
SELECT
changelog.product_id,
changelog.name,
changelog.price,
changelog._change_type,
changelog._commit_snapshot_id,
snapshots.committed_at
FROM "apache_iceberg_showcase"."products_changelog" AS changelog
INNER JOIN "apache_iceberg_showcase"."products$snapshots" AS snapshots
ON changelog._commit_snapshot_id = snapshots.snapshot_id
WHERE "name" = 'Product A'

Summary
We’ve traveled back and forth in time with Apache Iceberg, took a tour of the data lakehouse on AWS and got to know why this is such a big deal changing the data game.
Traditional data lakes are big stores of data that don’t change. However, modern businesses often need to update their data more frequently. To do this, we have data lakehouses. These blend features from both data lakes and data warehouses, allowing changes to be made to the data, such as adding or removing information.
Apache Iceberg is a tool that helps manage big amounts of data better. It can track changes over time, fix data without rewriting everything, and make finding and accessing data easier. Also, it works well with AWS services, allowing data to be stored in a flexible and efficient way. This unlocks several features like time travel, seamless handling of updates, and incremental data processing for data stored on Amazon S3.