Spark DDL🔗
To use Iceberg in Spark, first configure Spark catalogs. Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations.
CREATE TABLE🔗
Spark 3 can create tables in any Iceberg catalog with the clause USING iceberg:
Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of type compatibility on creating table for details.
Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including:
PARTITIONED BY (partition-expressions)to configure partitioningLOCATION '(fully-qualified-uri)'to set the table locationCOMMENT 'table documentation'to set a table descriptionTBLPROPERTIES ('key'='value', ...)to set table configuration
Create commands may also set the default format with the USING clause. This is only supported for SparkCatalog because Spark handles the USING clause differently for the built-in catalog.
CREATE TABLE ... LIKE ... syntax is not supported.
PARTITIONED BY🔗
To create a partitioned table, use PARTITIONED BY:
CREATE TABLE prod.db.sample (
id bigint,
data string,
category string)
USING iceberg
PARTITIONED BY (category);
The PARTITIONED BY clause supports transform expressions to create hidden partitions.
CREATE TABLE prod.db.sample (
id bigint,
data string,
category string,
ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category);
Supported transformations are:
year(ts): partition by yearmonth(ts): partition by monthday(ts)ordate(ts): equivalent to dateint partitioninghour(ts)ordate_hour(ts): equivalent to dateint and hour partitioningbucket(N, col): partition by hashed value mod N bucketstruncate(L, col): partition by value truncated to L- Strings are truncated to the given length
- Integers and longs truncate to bins:
truncate(10, i)produces partitions 0, 10, 20, 30, ...
Note: Old syntax of years(ts), months(ts), days(ts) and hours(ts) are also supported for compatibility.
CREATE TABLE ... AS SELECT🔗
Iceberg supports CTAS as an atomic operation when using a SparkCatalog. CTAS is supported, but is not atomic when using SparkSessionCatalog.
The newly created table won't inherit the partition spec and table properties from the source table in SELECT, you can use PARTITIONED BY and TBLPROPERTIES in CTAS to declare partition spec and table properties for the new table.
CREATE TABLE prod.db.sample
USING iceberg
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT ...
REPLACE TABLE ... AS SELECT🔗
Iceberg supports RTAS as an atomic operation when using a SparkCatalog. RTAS is supported, but is not atomic when using SparkSessionCatalog.
Atomic table replacement creates a new snapshot with the results of the SELECT query, but keeps table history.
REPLACE TABLE prod.db.sample
USING iceberg
PARTITIONED BY (part)
TBLPROPERTIES ('key'='value')
AS SELECT ...
The schema and partition spec will be replaced if changed. To avoid modifying the table's schema and partitioning, use INSERT OVERWRITE instead of REPLACE TABLE.
The new table properties in the REPLACE TABLE command will be merged with any existing table properties. The existing table properties will be updated if changed else they are preserved.
DROP TABLE🔗
The drop table behavior changed in 0.14.
Prior to 0.14, running DROP TABLE would remove the table from the catalog and delete the table contents as well.
From 0.14 onwards, DROP TABLE would only remove the table from the catalog.
In order to delete the table contents DROP TABLE PURGE should be used.
DROP TABLE🔗
To drop the table from the catalog, run:
DROP TABLE PURGE🔗
To drop the table from the catalog and delete the table's contents, run:
ALTER TABLE🔗
Iceberg has full ALTER TABLE support in Spark 3, including:
- Renaming a table
- Setting or removing table properties
- Adding, deleting, and renaming columns
- Adding, deleting, and renaming nested fields
- Reordering top-level columns and nested struct fields
- Widening the type of
int,float, anddecimalfields - Making required columns optional
In addition, SQL extensions can be used to add support for partition evolution and setting a table's write order