Spark DDL🔗

To use Iceberg in Spark, first configure Spark catalogs. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations.

`CREATE TABLE`🔗

Spark 3 can create tables in any Iceberg catalog with the clause USING iceberg:

CREATE TABLE prod.db.sample (
    id bigint NOT NULL COMMENT 'unique id',
    data string)
USING iceberg;

Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of type compatibility on creating table for details.

Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including:

PARTITIONED BY (partition-expressions) to configure partitioning
LOCATION '(fully-qualified-uri)' to set the table location
COMMENT 'table documentation' to set a table description
TBLPROPERTIES ('key'='value', ...) to set table configuration

Create commands may also set the default format with the USING clause. This is only supported for SparkCatalog because Spark handles the USING clause differently for the built-in catalog.

CREATE TABLE ... LIKE ... syntax is not supported.

`PARTITIONED BY`🔗

To create a partitioned table, use PARTITIONED BY:

CREATE TABLE prod.db.sample (
    id bigint,
    data string,
    category string)
USING iceberg
PARTITIONED BY (category);

The PARTITIONED BY clause supports transform expressions to create hidden partitions.

CREATE TABLE prod.db.sample (
    id bigint,
    data string,
    category string,
    ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category);

Supported transformations are:

year(ts): partition by year
month(ts): partition by month
day(ts) or date(ts): equivalent to dateint partitioning
hour(ts) or date_hour(ts): equivalent to dateint and hour partitioning
bucket(N, col): partition by hashed value mod N buckets
truncate(L, col): partition by value truncated to L
- Strings are truncated to the given length
- Integers and longs truncate to bins: truncate(10, i) produces partitions 0, 10, 20, 30, ...

Note: Old syntax of years(ts), months(ts), days(ts) and hours(ts) are also supported for compatibility.

`CREATE TABLE ... AS SELECT`🔗

Iceberg supports CTAS as an atomic operation when using a SparkCatalog. CTAS is supported, but is not atomic when using SparkSessionCatalog.

CREATE TABLE prod.db.sample
USING iceberg
AS SELECT ...

The newly created table won't inherit the partition spec and table properties from the source table in SELECT, you can use PARTITIONED BY and TBLPROPERTIES in CTAS to declare partition spec and table properties for the new table.

CREATE TABLE prod.db.sample
USING iceberg
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT ...

`REPLACE TABLE ... AS SELECT`🔗

Iceberg supports RTAS as an atomic operation when using a SparkCatalog. RTAS is supported, but is not atomic when using SparkSessionCatalog.

Atomic table replacement creates a new snapshot with the results of the SELECT query, but keeps table history.

REPLACE TABLE prod.db.sample
USING iceberg
AS SELECT ...

REPLACE TABLE prod.db.sample
USING iceberg
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT ...

CREATE OR REPLACE TABLE prod.db.sample
USING iceberg
AS SELECT ...

The schema and partition spec will be replaced if changed. To avoid modifying the table's schema and partitioning, use INSERT OVERWRITE instead of REPLACE TABLE. The new table properties in the REPLACE TABLE command will be merged with any existing table properties. The existing table properties will be updated if changed else they are preserved.

`DROP TABLE`🔗

The drop table behavior changed in 0.14.

Prior to 0.14, running DROP TABLE would remove the table from the catalog and delete the table contents as well.

From 0.14 onwards, DROP TABLE would only remove the table from the catalog. In order to delete the table contents DROP TABLE PURGE should be used.

`DROP TABLE`🔗

To drop the table from the catalog, run:

DROP TABLE prod.db.sample;

`DROP TABLE PURGE`🔗

To drop the table from the catalog and delete the table's contents, run:

DROP TABLE prod.db.sample PURGE;

`ALTER TABLE`🔗

Iceberg has full ALTER TABLE support in Spark 3, including:

Renaming a table
Setting or removing table properties
Adding, deleting, and renaming columns
Adding, deleting, and renaming nested fields
Reordering top-level columns and nested struct fields
Widening the type of int, float, and decimal fields
Making required columns optional

In addition, SQL extensions can be used to add support for partition evolution and setting a table's write order

`ALTER TABLE ... RENAME TO`🔗

ALTER TABLE prod.db.sample RENAME TO prod.db.new_name;

Spark DDL🔗

CREATE TABLE🔗

PARTITIONED BY🔗

CREATE TABLE ... AS SELECT🔗

REPLACE TABLE ... AS SELECT🔗

DROP TABLE🔗

DROP TABLE🔗

DROP TABLE PURGE🔗

ALTER TABLE🔗

ALTER TABLE ... RENAME TO🔗

ALTER TABLE ... SET TBLPROPERTIES

`CREATE TABLE`🔗

`PARTITIONED BY`🔗

`CREATE TABLE ... AS SELECT`🔗

`REPLACE TABLE ... AS SELECT`🔗

`DROP TABLE`🔗

`DROP TABLE`🔗

`DROP TABLE PURGE`🔗

`ALTER TABLE`🔗

`ALTER TABLE ... RENAME TO`🔗

`ALTER TABLE ... SET TBLPROPERTIES`