Enhance Apache Iceberg queries on Amazon S3 with new optimizations

Enhance Apache Iceberg Query Performance with Sort and Z-Order Compaction on Amazon S3

Amazon Web Services (AWS) continues to evolve its offerings to enhance data management and query performance. A significant update has been made to Apache Iceberg, a popular tool for managing large-scale analytical datasets, particularly in Amazon Simple Storage Service (Amazon S3). This update introduces two new compaction strategies aimed at optimizing query performance: sort and z-order compaction. Let’s delve into these enhancements and explore how they can benefit users managing data on Amazon S3.

Understanding Apache Iceberg and Its Application

Apache Iceberg is a high-performance table format designed specifically for managing large-scale datasets. It is particularly well-suited for environments where data is frequently ingested or updated, such as data lakes. Iceberg tables are often used in conjunction with AWS Glue Data Catalog or S3 Tables to support various use cases including concurrent streaming and batch ingestion, schema evolution, and even time travel, which allows users to query past versions of their data.

However, one of the challenges users face when dealing with high-ingest datasets or frequently updated data is the accumulation of numerous small files. These small files can significantly impact both the cost and performance of queries. To address this, Amazon S3 has introduced two advanced compaction strategies: sort and z-order.

Introducing Sort and Z-Order Compaction

Amazon S3’s new compaction strategies aim to optimize data layout and improve query efficiency. These strategies are available for both fully managed S3 Tables and Iceberg tables stored in general-purpose S3 buckets, leveraging AWS Glue Data Catalog optimizations.

Sort Compaction

Sort compaction organizes files based on a user-defined column order. This means that if your tables have a defined sort order, the compaction process will cluster similar values together. By organizing data in this way, query execution becomes more efficient because fewer files need to be scanned. For instance, if a table is sorted by columns such as state and zip_code, queries filtering on these columns will benefit from reduced latency and lower query engine costs.

Z-Order Compaction

Z-order compaction takes optimization a step further by enabling efficient file pruning across multiple dimensions. It interleaves the binary representation of values from multiple columns into a single scalar that can be sorted. This strategy is especially useful for spatial or multidimensional queries. For example, if your workload includes queries filtering by pickup_location, dropoff_location, and fare_amount simultaneously, z-order compaction can significantly reduce the number of files scanned compared to traditional sort-based layouts.

How to Implement Sort and Z-Order Compaction

To utilize these new compaction strategies, users need to update their table maintenance configurations. For S3 Tables, the Iceberg table metadata determines the current sort order. If a table has a sort order defined, sort compaction is automatically applied during ongoing maintenance without any additional configuration. To use z-order compaction, users must update the table maintenance configuration via the S3 Tables API and set the strategy to z-order.

For Iceberg tables stored in general-purpose S3 buckets, the AWS Glue Data Catalog can be configured to apply sort or z-order compaction during optimization by updating the compaction settings.

It’s important to note that only new data written after enabling sort or z-order compaction will be affected. Existing compacted files remain unchanged unless explicitly rewritten by increasing the target file size in table maintenance settings or using standard Iceberg tools.

Practical Example: Applying Sort Compaction with Apache Spark

Let’s walk through a simplified example using Apache Spark and the AWS Command Line Interface (CLI). Suppose you have a Spark cluster and an S3 table bucket with a table named testtable in a testnamespace. Initially, you might disable compaction to add data into the table. After adding data, the file structure of the table can be checked to reveal multiple small files with overlapping upper and lower bounds, indicating unsorted data.

By setting the table sort order using Spark SQL and enabling table compaction through the AWS CLI, the compaction job will eventually trigger. Once the compaction process completes, the files within the table will be reorganized into fewer, larger files, showing that data was sorted efficiently.

To use z-order compaction, similar steps are followed, but the strategy is set to z-order in the maintenance configuration.

Availability and Cost Considerations

Sort and z-order compaction are available in all AWS Regions where Amazon S3 Tables are supported, as well as for general-purpose S3 buckets optimized with AWS Glue Data Catalog. There is no additional charge for S3 Tables beyond existing usage and maintenance fees. However, for Data Catalog optimizations, compute charges apply during the compaction process.

Performance Improvements and User Experience

With these advancements, queries filtering on sort or z-order columns experience faster scan times and reduced engine costs. Users have reported performance improvements of threefold or more when switching from the default binpack strategy to sort or z-order compaction, depending on data layout and query patterns. AWS encourages users to share their experiences and performance gains when applying these new strategies to their data.

For those interested in exploring these new compaction strategies further, AWS offers detailed documentation and resources. The Amazon S3 Tables product page and the S3 Tables maintenance documentation provide valuable insights. Users can also start testing the new strategies on their own tables using the S3 Tables API or AWS Glue optimizations.

In summary, the introduction of sort and z-order compaction strategies for Apache Iceberg on Amazon S3 marks a significant enhancement in data management and query performance. These strategies offer users more efficient data organization, reduced query latency, and potentially lower costs, making them a valuable addition to AWS’s suite of data management tools.

For more Information, Refer to this article.