Spark: Initial support for 3.2 by aokolnychyi · Pull Request #3335 · apache/iceberg

aokolnychyi · 2021-10-21T01:42:53Z

This PR adds initial support for Spark 3.2 and contains 3 commits. The first one includes the initial move of classes and build-related changes. The second contains changes for the core functionality. The third one contains changes for extensions.

TODO items:

Support multiple Scala versions
Support row-level commands

kbendick

Still in the process of going through all of this but thank you so much for doing this!

kbendick · 2021-10-21T01:48:42Z

+          key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
+          restore-keys: ${{ runner.os }}-gradle
+      - run: echo -e "$(ip addr show eth0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)\t$(hostname -f) $(hostname -s)" | sudo tee -a /etc/hosts
+      - run: ./gradlew -DsparkVersions=${{ matrix.spark }} -DhiveVersions= -DflinkVersions= :iceberg-spark:iceberg-spark_3.2:check :iceberg-spark:iceberg-spark_3.2-extensions:check :iceberg-spark:iceberg-spark_3.2-runtime:check -Pquick=true -x javadoc


Unrelated: Are the -DflinkVersions= really necessary? I assume so. That's kind of a pain. I might take a stab at getting rid of that.

I copied it from Spark 3.0. I think it is done to disable Flink tests.

It might be? The default is to include Flink 1.13 so that we can run things like assemble to trigger compilation of all modules. Then we remove Flink by specifically setting the versions to an empty string where we don't want to even load the project, like in the "core" tests job where we want to run check but exclude Flink. I think otherwise we'd have to remove specific targets for Flink or call check for each module.

Here, since we are already calling check for each module individually we may not need it.

I am keeping it as is for consistency now. If we want to update, we should probably update all jobs.

pan3793 · 2021-10-21T02:18:52Z

 }

+if (sparkVersions.contains("3.2")) {
+  include ':iceberg-spark:spark_3.2'


Does this naming strategy conflict with scala?

This PR does not take care of multiple Scala versions. It is a follow-up item and I'd be glad if someone could work on that.

It's not about the support of multiple Scala versions, some applications like mvnrepository.com will identify the Scala version by such pattern from package name.

For reference, Elasticsearch Spark use suffix spark-16 spark-30, https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-30

If possible, I suggest not to break this rule, otherwise it will confuse users and some applications.

I assumed if we named it as spark_3.2_2.12, it would work. However, I have no issues updating to use spark-32 instead. I'll make the change.

I think we should get way from the existing naming and start using spark_3.2_2.12. That shouldn't have a Scala problem because the Scala version is the last.

This is a good idea. Scala 2.13 support will be coming pretty soon to Spark (I think the tests have been running on 2.13 for a bit).

How do you think spark-3.2_2.12? Usually, the package name is preferred to use - instead of _.

I'll use spark-3.2_2.12. That's a good compromise.

wypoon · 2021-10-21T22:42:38Z

Can we please keep the subdirectory naming the same? i.e.,

spark/
  v3.0/
    spark3/
    spark3-extensions/
    spark3-runtime/
  v3.2/
    spark3/
    spark3-extensions/
    spark3-runtime/

The project names can be different (:iceberg-spark:iceberg-spark3 and :iceberg-spark:iceberg-spark32, etc).
It'd be simpler to port as well as to diff code between Spark 3 versions.
(For consistency, I also prefer :iceberg-spark:iceberg-spark32, :iceberg-spark:iceberg-spark32-extensions and :iceberg-spark:iceberg-spark32-runtime to :iceberg-spark:iceberg-spark-32, :iceberg-spark:iceberg-spark-32-extensions and :iceberg-spark:iceberg-spark-32-runtime, but that's a minor suggestion.)

rdblue · 2021-10-21T23:08:07Z

+          key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
+          restore-keys: ${{ runner.os }}-gradle
+      - run: echo -e "$(ip addr show eth0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)\t$(hostname -f) $(hostname -s)" | sudo tee -a /etc/hosts
+      - run: ./gradlew -DsparkVersions=${{ matrix.spark }} -DhiveVersions= -DflinkVersions= :iceberg-spark:iceberg-spark-32:check :iceberg-spark:iceberg-spark-32-extensions:check :iceberg-spark:iceberg-spark-32-runtime:check -Pquick=true -x javadoc


I'd prefer to use the spark_3.2_2.12 naming scheme for modules. That would allow us to parameterize this with another reference to ${{ matrix.spark }} when calling the Spark version module's check task. Similarly, we'd be able to support Scala version in a follow-up.

I think the main argument for a module named iceberg-spark32 is consistency with the existing module names, but I don't think they are something we want to preserve now that we have more than modules for Spark major versions. It is more clear to use _3.2_2.13 everywhere that we can, vs relying on people knowing that 32 is actually Spark 3.2.

I'm fine with using _3.2_2.12 and _3.2_2.13 names for the modules.
For a developer running gradle locally on their machine, is there a single ./gradlew command that they can use with -DsparkVersions=3.0,3.2 or -DsparkVersions=3.2 that builds all the applicable Spark 3 modules or runs their tests (less general than build or check or build -x test)?

Regardless of the naming of the modules, can we still keep the subdirectory names under each Spark 3 version uniform? It's independent of the module names and I think it'd be good if the subdirectory names are the same for each version.

I'll drop versions from folder names to keep things consistent and potentially simplify cherry-picks.

I dropped versions for 3.2. I'll do the same for other versions to keep this PR smaller.

Yeah, I agree with keeping the same folder names to make moving changes around more easy.

rdblue · 2021-10-21T23:09:15Z

+  project(':iceberg-spark:spark-32').projectDir = file('spark/v3.2/spark-32')
+  project(':iceberg-spark:spark-32').name = 'iceberg-spark-32'
+  project(':iceberg-spark:spark-32-extensions').projectDir = file('spark/v3.2/spark-32-extensions')
+  project(':iceberg-spark:spark-32-extensions').name = 'iceberg-spark-32-extensions'


I thought we were going to merge extensions into the base module? Is that a follow-up?

I also thought about this but was not sure if it is worth the effort and whether it will make cherry-picks harder.

Right. We should either do it everywhere or nowhere. I'm leaning toward everywhere, but it doesn't matter much to me.

…d to spark/v3.0

wypoon · 2021-10-22T00:37:33Z

I created #3345 simply to enable easier visualization of the diff between the v3.0 code and the v3.2 code. Changes in the build are not reflected (the build change in #3345 is just a quick hack to enable building v3.0 against Spark 3.2).

wypoon · 2021-10-22T00:45:31Z

+    });
+
+    conf.forEach((confKey, confValue) -> {
+      if (!confKey.startsWith(SparkSQLProperties.PREFIX) && !sqlConf.isModifiable(confKey)) {


Isn't the Spark 3.2 equivalent of SQLConf.staticConfKeys().contains(confKey) simply SQLConf.isStaticConfigKey(confKey)?
This appears in AvroDataTest.java and IcebergSourceBenchmark.java as well.

Good point, I missed this. Updated all 3 places.

wypoon · 2021-10-22T03:54:11Z

+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        jvm: [11]


Is there a reason not to include Java 8? Afaik, Spark 3.2 is still supported on Java 8, just deprecated for Java 8 prior to version 8u201.

I had issues with it internally but it may work with OSS Spark. Added, let's see.

rdblue · 2021-10-22T19:45:57Z

Thanks @aokolnychyi! And thanks for reviewing, everyone!

aokolnychyi · 2021-10-22T22:25:56Z

Thanks, @wypoon @pan3793 @kbendick @rdblue!

Let me know if anyone has time to pick up the Scala version parameterization. I guess Spark 3.2 supports 2.12 and 2.13.

wypoon · 2021-10-22T23:14:04Z

@aokolnychyi I saw that in v3.2, quite a number of rules have been removed from IcebergSparkSessionExtensions (and the classes removed): AlignRowLevelOperations, RowLevelOperationsPredicateCheck, OptimizeConditionsInRowLevelOperations, PullupCorrelatedPredicatesInRowLevelOperations, RewriteDelete, RewriteUpdate, and RewriteMergeInto. I also saw that TestCopyOnWriteDelete, TestCopyOnWriteMerge and TestCopyOnWriteUpdate are all being skipped entirely. Do you plan follow up work rewriting support for delete/update/merge using Spark 3.2 features? As of this change, what functionality works in 3.0 but not in 3.2?
I understand that some of the removed classes are things that are no longer needed because new API is now available in Spark 3.2 (some already in 3.1), e.g., ExtendedBatchScanExec and the dynamic file filter stuff. But I'd like to understand the delete/update/merge support situation (please pardon ignorance, I'm not familiar with this area yet).

aokolnychyi · 2021-10-23T17:24:06Z

@wypoon, sure! Like I stated in the PR description, this change did not cover multiple Scala versions support and row-level commands. I am reworking row-level commands at this moment to also support merge-on-read and leverage native dynamic filtering in Spark. If I manage to make it work, then the implementation will be different and I did not want to invest time into porting existing extensions if we may rework them. To sum up, expect to see PRs for row-level commands and other 3.2 related things next week.

wypoon · 2021-10-23T19:40:19Z

@aokolnychyi thanks for elaborating.

rawataaryan9 · 2021-10-27T06:51:17Z

@aokolnychyi, Just wanted to get a better understanding with respect to row-level command support. In 3.2, are we scoping the changes within iceberg code base itself (PR you are planning this week) and then from 3.3 onwards the changes would go inside the spark code base which would be on the lines of the SPIP and the draft PR. I might be missing information here and wanted to better understand the approach.

github-actions Bot added build INFRA spark labels Oct 21, 2021

aokolnychyi force-pushed the oss-3.2-initial branch from 23ea69d to 58ce778 Compare October 21, 2021 01:48

kbendick reviewed Oct 21, 2021

View reviewed changes

aokolnychyi force-pushed the oss-3.2-initial branch from 58ce778 to da40f9f Compare October 21, 2021 02:00

pan3793 reviewed Oct 21, 2021

View reviewed changes

aokolnychyi force-pushed the oss-3.2-initial branch 2 times, most recently from b585d3e to 77c0ab0 Compare October 21, 2021 03:20

aokolnychyi added 3 commits October 20, 2021 20:23

Initial copy & build changes

044a22c

Migrate non-extensions code

5b35e40

Fix extensions

7644ddc

aokolnychyi force-pushed the oss-3.2-initial branch from 77c0ab0 to 7644ddc Compare October 21, 2021 03:24

rdblue reviewed Oct 21, 2021

View reviewed changes

wypoon added a commit to wypoon/iceberg that referenced this pull request Oct 22, 2021

[POC] Anton Okolnychyi's addition of spark/v3.2 in apache#3335 applie…

0d440e4

…d to spark/v3.0

wypoon reviewed Oct 22, 2021

View reviewed changes

aokolnychyi added 2 commits October 21, 2021 23:26

Review comments

a8cb59c

Fix import

e7b0417

rdblue approved these changes Oct 22, 2021

View reviewed changes

rdblue merged commit f5a7537 into apache:master Oct 22, 2021

aokolnychyi mentioned this pull request Oct 23, 2021

Spark: Use same folder structure for all versions #3360

Merged

KnightChess mentioned this pull request Nov 20, 2021

Spark 3.2.0 java.lang.IncompatibleClassChangeError when using IcebergSparkSessionExtensions #3585

Closed

racevedoo mentioned this pull request Nov 29, 2021

Broken dependency resolution in 0.13.0-SNAPSHOT and Spark 3.2 #3620

Closed

dramaticlly mentioned this pull request Jul 7, 2022

Missing DynamicFileFilter when use iceberg in Spark 3.2 #5224

Closed

Uh oh!

Conversation

aokolnychyi commented Oct 21, 2021

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wypoon commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wypoon commented Oct 22, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 22, 2021

Uh oh!

aokolnychyi commented Oct 22, 2021

Uh oh!

wypoon commented Oct 22, 2021

Uh oh!

aokolnychyi commented Oct 23, 2021

Uh oh!

wypoon commented Oct 23, 2021

Uh oh!

rawataaryan9 commented Oct 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

wypoon commented Oct 21, 2021 •

edited

Loading