Skip to content

Spark: Initial support for 3.2#3335

Merged
rdblue merged 5 commits into
apache:masterfrom
aokolnychyi:oss-3.2-initial
Oct 22, 2021
Merged

Spark: Initial support for 3.2#3335
rdblue merged 5 commits into
apache:masterfrom
aokolnychyi:oss-3.2-initial

Conversation

@aokolnychyi

Copy link
Copy Markdown
Contributor

This PR adds initial support for Spark 3.2 and contains 3 commits. The first one includes the initial move of classes and build-related changes. The second contains changes for the core functionality. The third one contains changes for extensions.

TODO items:

  • Support multiple Scala versions
  • Support row-level commands

@kbendick kbendick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still in the process of going through all of this but thank you so much for doing this!

Comment thread .github/workflows/spark-ci.yml Outdated
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
restore-keys: ${{ runner.os }}-gradle
- run: echo -e "$(ip addr show eth0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)\t$(hostname -f) $(hostname -s)" | sudo tee -a /etc/hosts
- run: ./gradlew -DsparkVersions=${{ matrix.spark }} -DhiveVersions= -DflinkVersions= :iceberg-spark:iceberg-spark_3.2:check :iceberg-spark:iceberg-spark_3.2-extensions:check :iceberg-spark:iceberg-spark_3.2-runtime:check -Pquick=true -x javadoc

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated: Are the -DflinkVersions= really necessary? I assume so. That's kind of a pain. I might take a stab at getting rid of that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied it from Spark 3.0. I think it is done to disable Flink tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be? The default is to include Flink 1.13 so that we can run things like assemble to trigger compilation of all modules. Then we remove Flink by specifically setting the versions to an empty string where we don't want to even load the project, like in the "core" tests job where we want to run check but exclude Flink. I think otherwise we'd have to remove specific targets for Flink or call check for each module.

Here, since we are already calling check for each module individually we may not need it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am keeping it as is for consistency now. If we want to update, we should probably update all jobs.

Comment thread settings.gradle Outdated
}

if (sparkVersions.contains("3.2")) {
include ':iceberg-spark:spark_3.2'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this naming strategy conflict with scala?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not take care of multiple Scala versions. It is a follow-up item and I'd be glad if someone could work on that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not about the support of multiple Scala versions, some applications like mvnrepository.com will identify the Scala version by such pattern from package name.

For reference, Elasticsearch Spark use suffix spark-16 spark-30, https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-30

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I suggest not to break this rule, otherwise it will confuse users and some applications.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed if we named it as spark_3.2_2.12, it would work. However, I have no issues updating to use spark-32 instead. I'll make the change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get way from the existing naming and start using spark_3.2_2.12. That shouldn't have a Scala problem because the Scala version is the last.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. Scala 2.13 support will be coming pretty soon to Spark (I think the tests have been running on 2.13 for a bit).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you think spark-3.2_2.12? Usually, the package name is preferred to use - instead of _.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use spark-3.2_2.12. That's a good compromise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@aokolnychyi aokolnychyi force-pushed the oss-3.2-initial branch 2 times, most recently from b585d3e to 77c0ab0 Compare October 21, 2021 03:20
@wypoon

wypoon commented Oct 21, 2021

Copy link
Copy Markdown
Contributor

Can we please keep the subdirectory naming the same? i.e.,

spark/
  v3.0/
    spark3/
    spark3-extensions/
    spark3-runtime/
  v3.2/
    spark3/
    spark3-extensions/
    spark3-runtime/

The project names can be different (:iceberg-spark:iceberg-spark3 and :iceberg-spark:iceberg-spark32, etc).
It'd be simpler to port as well as to diff code between Spark 3 versions.
(For consistency, I also prefer :iceberg-spark:iceberg-spark32, :iceberg-spark:iceberg-spark32-extensions and :iceberg-spark:iceberg-spark32-runtime to :iceberg-spark:iceberg-spark-32, :iceberg-spark:iceberg-spark-32-extensions and :iceberg-spark:iceberg-spark-32-runtime, but that's a minor suggestion.)

Comment thread .github/workflows/spark-ci.yml Outdated
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
restore-keys: ${{ runner.os }}-gradle
- run: echo -e "$(ip addr show eth0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)\t$(hostname -f) $(hostname -s)" | sudo tee -a /etc/hosts
- run: ./gradlew -DsparkVersions=${{ matrix.spark }} -DhiveVersions= -DflinkVersions= :iceberg-spark:iceberg-spark-32:check :iceberg-spark:iceberg-spark-32-extensions:check :iceberg-spark:iceberg-spark-32-runtime:check -Pquick=true -x javadoc

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to use the spark_3.2_2.12 naming scheme for modules. That would allow us to parameterize this with another reference to ${{ matrix.spark }} when calling the Spark version module's check task. Similarly, we'd be able to support Scala version in a follow-up.

I think the main argument for a module named iceberg-spark32 is consistency with the existing module names, but I don't think they are something we want to preserve now that we have more than modules for Spark major versions. It is more clear to use _3.2_2.13 everywhere that we can, vs relying on people knowing that 32 is actually Spark 3.2.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using _3.2_2.12 and _3.2_2.13 names for the modules.
For a developer running gradle locally on their machine, is there a single ./gradlew command that they can use with -DsparkVersions=3.0,3.2 or -DsparkVersions=3.2 that builds all the applicable Spark 3 modules or runs their tests (less general than build or check or build -x test)?

Regardless of the naming of the modules, can we still keep the subdirectory names under each Spark 3 version uniform? It's independent of the module names and I think it'd be good if the subdirectory names are the same for each version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll drop versions from folder names to keep things consistent and potentially simplify cherry-picks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped versions for 3.2. I'll do the same for other versions to keep this PR smaller.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with keeping the same folder names to make moving changes around more easy.

Comment thread settings.gradle Outdated
project(':iceberg-spark:spark-32').projectDir = file('spark/v3.2/spark-32')
project(':iceberg-spark:spark-32').name = 'iceberg-spark-32'
project(':iceberg-spark:spark-32-extensions').projectDir = file('spark/v3.2/spark-32-extensions')
project(':iceberg-spark:spark-32-extensions').name = 'iceberg-spark-32-extensions'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to merge extensions into the base module? Is that a follow-up?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about this but was not sure if it is worth the effort and whether it will make cherry-picks harder.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We should either do it everywhere or nowhere. I'm leaning toward everywhere, but it doesn't matter much to me.

wypoon added a commit to wypoon/iceberg that referenced this pull request Oct 22, 2021
@wypoon

wypoon commented Oct 22, 2021

Copy link
Copy Markdown
Contributor

I created #3345 simply to enable easier visualization of the diff between the v3.0 code and the v3.2 code. Changes in the build are not reflected (the build change in #3345 is just a quick hack to enable building v3.0 against Spark 3.2).

});

conf.forEach((confKey, confValue) -> {
if (!confKey.startsWith(SparkSQLProperties.PREFIX) && !sqlConf.isModifiable(confKey)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the Spark 3.2 equivalent of SQLConf.staticConfKeys().contains(confKey) simply SQLConf.isStaticConfigKey(confKey)?
This appears in AvroDataTest.java and IcebergSourceBenchmark.java as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I missed this. Updated all 3 places.

Comment thread .github/workflows/spark-ci.yml Outdated
runs-on: ubuntu-latest
strategy:
matrix:
jvm: [11]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to include Java 8? Afaik, Spark 3.2 is still supported on Java 8, just deprecated for Java 8 prior to version 8u201.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had issues with it internally but it may work with OSS Spark. Added, let's see.

@rdblue rdblue merged commit f5a7537 into apache:master Oct 22, 2021
@rdblue

rdblue commented Oct 22, 2021

Copy link
Copy Markdown
Contributor

Thanks @aokolnychyi! And thanks for reviewing, everyone!

@aokolnychyi

Copy link
Copy Markdown
Contributor Author

Thanks, @wypoon @pan3793 @kbendick @rdblue!

Let me know if anyone has time to pick up the Scala version parameterization. I guess Spark 3.2 supports 2.12 and 2.13.

@wypoon

wypoon commented Oct 22, 2021

Copy link
Copy Markdown
Contributor

@aokolnychyi I saw that in v3.2, quite a number of rules have been removed from IcebergSparkSessionExtensions (and the classes removed): AlignRowLevelOperations, RowLevelOperationsPredicateCheck, OptimizeConditionsInRowLevelOperations, PullupCorrelatedPredicatesInRowLevelOperations, RewriteDelete, RewriteUpdate, and RewriteMergeInto. I also saw that TestCopyOnWriteDelete, TestCopyOnWriteMerge and TestCopyOnWriteUpdate are all being skipped entirely. Do you plan follow up work rewriting support for delete/update/merge using Spark 3.2 features? As of this change, what functionality works in 3.0 but not in 3.2?
I understand that some of the removed classes are things that are no longer needed because new API is now available in Spark 3.2 (some already in 3.1), e.g., ExtendedBatchScanExec and the dynamic file filter stuff. But I'd like to understand the delete/update/merge support situation (please pardon ignorance, I'm not familiar with this area yet).

@aokolnychyi

Copy link
Copy Markdown
Contributor Author

@wypoon, sure! Like I stated in the PR description, this change did not cover multiple Scala versions support and row-level commands. I am reworking row-level commands at this moment to also support merge-on-read and leverage native dynamic filtering in Spark. If I manage to make it work, then the implementation will be different and I did not want to invest time into porting existing extensions if we may rework them. To sum up, expect to see PRs for row-level commands and other 3.2 related things next week.

@wypoon

wypoon commented Oct 23, 2021

Copy link
Copy Markdown
Contributor

@aokolnychyi thanks for elaborating.

@rawataaryan9

Copy link
Copy Markdown
Contributor

@aokolnychyi, Just wanted to get a better understanding with respect to row-level command support. In 3.2, are we scoping the changes within iceberg code base itself (PR you are planning this week) and then from 3.3 onwards the changes would go inside the spark code base which would be on the lines of the SPIP and the draft PR. I might be missing information here and wanted to better understand the approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants