Skip to content

Spark: Correct partition transform functions to match spec.md#8192

Merged
nastra merged 12 commits into
apache:masterfrom
clettieri:docs-fix-partition-transforms
Sep 25, 2023
Merged

Spark: Correct partition transform functions to match spec.md#8192
nastra merged 12 commits into
apache:masterfrom
clettieri:docs-fix-partition-transforms

Conversation

@clettieri

@clettieri clettieri commented Jul 31, 2023

Copy link
Copy Markdown
Contributor

Given the spec for the time transform functions, this PR adds the singular (instead of plural) names of the transform functions.

The spec specifies year, month, hour, day, but previously Spark only supported the plural version of these words (years, months, etc).

Hopefully this saves someone else a few cycles of debugging :)

@Fokko

Fokko commented Jul 31, 2023

Copy link
Copy Markdown
Contributor

@clettieri Thanks for raising this. This is a known issue, and I also run into it. However, maybe it is better to add day as an option to Spark since the spec is considered the source of truth.

Comment thread format/spec.md Outdated
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` |
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz` | `int` |
| **`hour`** | Extract a timestamp hour, as hours from 1970-01-01 00:00:00 | `timestamp`, `timestamptz` | `int` |
| **`years`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz` | `int` |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might impact Trino, Dremio and other engines which has naming conventions according to spec.

https://trino.io/docs/current/connector/iceberg.html#partitioned-tables
https://docs.dremio.com/cloud/reference/sql/commands/create-table/

IMO, better to change the spark transform name and deprecate the old ones.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that.

Would it be useful to keep the unintended "backward" compatibility though? Or do you think just force the Spark API to match the spec?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to maintain backward compatibility. I would suggest adding the singular options and also keeping the plural ones as well. They don't conflict and avoids people to run into errors. Also curious about what others think of it.

@clettieri clettieri Aug 1, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added singular options to each Spark3 version. How does this look?

I can add some tests as well if you think that would be useful.

Also, should I change the prefix of this PR to Spark: ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add some tests as well if you think that would be useful.

You can add a test in TestAlterTablePartitionFields to make sure that we don't break this in the future. It could be something similar as testSparkTableAddDropPartitions where you use the singular transform names.

Also, should I change the prefix of this PR to Spark: ?

Yes please :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tests, what if I replicate these but for some or all of year, month, hour, day ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be perfect

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the tests.

I ran /gradlew :iceberg-spark:iceberg-spark-extensions-3.4_2.12:test --tests TestAlterTablePartitionFields locally and saw things pass. LMK if I can do anything else here (I'll try to catch up on the contributing guidelines tomorrow). Thanks!

@clettieri clettieri force-pushed the docs-fix-partition-transforms branch from 433c8bf to 10ddc9b Compare August 1, 2023 13:41
@github-actions github-actions Bot added the spark label Aug 1, 2023
Comment thread spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java Outdated
Comment thread spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/Spark3Util.java Outdated
@clettieri clettieri changed the title Docs: Correct partition transform functions on spec.md Spark: Correct partition transform functions on spec.md Aug 1, 2023
@clettieri clettieri changed the title Spark: Correct partition transform functions on spec.md Spark: Correct partition transform functions to match spec.md Aug 1, 2023

@Fokko Fokko left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also ran into this, more than one :3 Would be good to get this in! Thanks for working on this @clettieri

@Fokko Fokko added this to the Iceberg 1.4.0 milestone Aug 2, 2023
@RussellSpitzer

Copy link
Copy Markdown
Member

I would want @rdblue to check this. If I remember correctly we originally did this because the base implementation happened to be wrong so we had to add another implementation that took account for 0 correctly.

The other issue here is that the non-plural names are Spark native functions, is that an issue?

@nastra nastra requested a review from rdblue September 13, 2023 05:58
PartitionSpec expected =
PartitionSpec.builderFor(table.schema()).withSpecId(1).year("ts").build();

Assert.assertEquals("Should have new spec field", expected, table.spec());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to not add more JUnit4-style assertions (even if the underlying class already uses them). Can you please convert those to AssertJ? That will make migration to JUnit5 easier. See https://iceberg.apache.org/contribute/#testing for some details

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra updated the tests, let me know what you think

@aokolnychyi

Copy link
Copy Markdown
Contributor

@clettieri, sorry it took so long to get back to this PR.

Could you, please, add tests that the new name is also supported in CREATE TABLE statements too? I believe this change only covers ALTER TABLE. In addition, this PR should be rebased and include Spark 3.5.

@clettieri

clettieri commented Sep 20, 2023

Copy link
Copy Markdown
Contributor Author

@clettieri, sorry it took so long to get back to this PR.

Could you, please, add tests that the new name is also supported in CREATE TABLE statements too? I believe this change only covers ALTER TABLE. In addition, this PR should be rebased and include Spark 3.5.

Hey @aokolnychyi, I can rebase and include Spark 3.5 👍🏼 .

Regarding the CREATE TABLE tests though, I don't see any for the previous transform functions and am unclear what new functionality they would test compared to what we have now. The current tests seem sufficient IMO to validate the transform functions can be called. Am I missing something here? Should I create a new test class to verify creating tables with these transform functions succeed?

Thanks!

@clettieri clettieri force-pushed the docs-fix-partition-transforms branch from 569e5bf to 0b9c76e Compare September 20, 2023 17:59
@aokolnychyi

Copy link
Copy Markdown
Contributor

@clettieri, what about TestCreateTable? Unlike what is covered by new tests, we should also check the new syntax is supported by Spark without extensions.

@aokolnychyi

Copy link
Copy Markdown
Contributor

Looks like there are some spotless issues that fail the build.

@clettieri

Copy link
Copy Markdown
Contributor Author

Looks like there are some spotless issues that fail the build.

Yep, I see that. I'll try to get to that on Monday. I'm a bit new to the JVM/Spark world and was having some trouble switching Spark environments to run Gradle locally. :)

@aokolnychyi

Copy link
Copy Markdown
Contributor

Thank you, @clettieri!

@rdblue

rdblue commented Sep 24, 2023

Copy link
Copy Markdown
Contributor

I fixed a typo and opened clettieri#1 with the spotless changes.

@clettieri

Copy link
Copy Markdown
Contributor Author

I fixed a typo and opened clettieri#1 with the spotless changes.

Thank you for saving me the time this morning :)

@nastra

nastra commented Sep 25, 2023

Copy link
Copy Markdown
Contributor

thanks @clettieri

@nastra nastra merged commit c0bed74 into apache:master Sep 25, 2023
@ajantha-bhat

ajantha-bhat commented Sep 25, 2023

Copy link
Copy Markdown
Member

Thanks for fixing. I couldn't take a look at it again before.
LGTM.

But we missed to update the documentation in https://github.com/apache/iceberg/blob/master/docs/spark-ddl.md.

@nk1506 would you like to work on it?

nk1506 added a commit to nk1506/iceberg that referenced this pull request Sep 25, 2023
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
nk1506 added a commit to nk1506/iceberg that referenced this pull request Sep 25, 2023
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
nk1506 added a commit to nk1506/iceberg that referenced this pull request Sep 25, 2023
This particular apache#8192 has fixed the code but it seems documented is not in sync. Hence the follow up PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants