Iceberg AWS Integrations🔗

Iceberg provides integration with different AWS services through the iceberg-aws module. This section describes how to use Iceberg with AWS.

Enabling AWS Integration🔗

The iceberg-aws module is bundled with Spark and Flink engine runtimes for all versions from 0.11.0 onwards. However, the AWS clients are not bundled so that you can use the same client version as your application. You will need to provide the AWS v2 SDK because that is what Iceberg depends on. You can choose to use the AWS SDK bundle, or individual AWS client packages (Glue, S3, DynamoDB, KMS, STS) if you would like to have a minimal dependency footprint.

All the default AWS clients use the Apache HTTP Client for HTTP connection management. This dependency is not part of the AWS SDK bundle and needs to be added separately. To choose a different HTTP client library such as URL Connection HTTP Client, see the section client customization for more details.

All the AWS module features can be loaded through custom catalog properties, you can go to the documentations of each engine to see how to load a custom catalog. Here are some examples.

Spark🔗

For example, to use AWS features with Spark 3.4 (with scala 2.12) and AWS clients (which is packaged in the iceberg-aws-bundle), you can start the Spark SQL shell with:

# start Spark SQL client shell
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.11.0,org.apache.iceberg:iceberg-aws-bundle:1.11.0 \
    --conf spark.sql.defaultCatalog=my_catalog \
    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \
    --conf spark.sql.catalog.my_catalog.type=glue \
    --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO

As you can see, In the shell command, we use --packages to specify the additional iceberg-aws-bundle that contains all relevant AWS dependencies.

Flink🔗

To use AWS module with Flink, you can download the necessary dependencies and specify them when starting the Flink SQL client:

# download Iceberg dependency
ICEBERG_VERSION=1.11.0
MAVEN_URL=https://repo1.maven.org/maven2
ICEBERG_MAVEN_URL=$MAVEN_URL/org/apache/iceberg

wget $ICEBERG_MAVEN_URL/iceberg-flink-runtime/$ICEBERG_VERSION/iceberg-flink-runtime-$ICEBERG_VERSION.jar

wget $ICEBERG_MAVEN_URL/iceberg-aws-bundle/$ICEBERG_VERSION/iceberg-aws-bundle-$ICEBERG_VERSION.jar

# start Flink SQL client shell
/path/to/bin/sql-client.sh embedded \
    -j iceberg-flink-runtime-$ICEBERG_VERSION.jar \
    -j iceberg-aws-bundle-$ICEBERG_VERSION.jar \
    shell

With those dependencies, you can create a Flink catalog like the following:

CREATE CATALOG my_catalog WITH (
  'type'='iceberg',
  'warehouse'='s3://my-bucket/my/key/prefix',
  'catalog-type'='glue',
  'io-impl'='org.apache.iceberg.aws.s3.S3FileIO'
);

You can also specify the catalog configurations in sql-client-defaults.yaml to preload it:

catalogs: 
  - name: my_catalog
    type: iceberg
    warehouse: s3://my-bucket/my/key/prefix
    catalog-type: glue
    io-impl: org.apache.iceberg.aws.s3.S3FileIO

Hive🔗

To use AWS module with Hive, you can download the necessary dependencies similar to the Flink example, and then add them to the Hive classpath or add the jars at runtime in CLI:

add jar /my/path/to/iceberg-hive-runtime.jar;
add jar /my/path/to/aws/bundle.jar;

With those dependencies, you can register a Glue catalog and create external tables in Hive at runtime in CLI by:

SET iceberg.engine.hive.enabled=true;
SET hive.vectorized.execution.enabled=false;
SET iceberg.catalog.glue.type=glue;
SET iceberg.catalog.glue.warehouse=s3://my-bucket/my/key/prefix;

-- suppose you have an Iceberg table database_a.table_a created by GlueCatalog
CREATE EXTERNAL TABLE database_a.table_a
STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
TBLPROPERTIES ('iceberg.catalog'='glue');

You can also preload the catalog by setting the configurations above in hive-site.xml.

Catalogs🔗

There are multiple different options that users can choose to build an Iceberg catalog with AWS.

Glue Catalog🔗

Iceberg enables the use of AWS Glue as the Catalog implementation. When used, an Iceberg namespace is stored as a Glue Database, an Iceberg table is stored as a Glue Table, and every Iceberg table version is stored as a Glue TableVersion. You can start using Glue catalog by specifying the catalog-impl as org.apache.iceberg.aws.glue.GlueCatalog or by setting catalog-type as glue, just like what is shown in the enabling AWS integration section above. More details about loading the catalog can be found in individual engine pages, such as Spark and Flink.

Glue Catalog ID🔗

There is a unique Glue metastore in each AWS account and each AWS region. By default, GlueCatalog chooses the Glue metastore to use based on the user's default AWS client credential and region setup. You can specify the Glue catalog ID through glue.id catalog property to point to a Glue catalog in a different AWS account. The Glue catalog ID is your numeric AWS account ID. If the Glue catalog is in a different region, you should configure your AWS client to point to the correct region, see more details in AWS client customization.

Skip Archive🔗

AWS Glue has the ability to archive older table versions and a user can roll back the table to any historical version if needed. By default, the Iceberg Glue Catalog will skip the archival of older table versions. If a user wishes to archive older table versions, they can set glue.skip-archive to false. Do note for streaming ingestion into Iceberg tables, setting glue.skip-archive to false will quickly create a lot of Glue table versions. For more details, please read Glue Quotas and the UpdateTable API.

Skip Name Validation🔗

Allow user to skip name validation for table name and namespaces. It is recommended to stick to Glue best practices to make sure operations are Hive compatible. This is only added for users that have existing conventions using non-standard characters. When database name and table name validation are skipped, there is no guarantee that downstream systems would all support the names.

Optimistic Locking🔗

By default, Iceberg uses Glue's optimistic locking for concurrent updates to a table. With optimistic locking, each table has a version id. If users retrieve the table metadata, Iceberg records the version id of that table.