Spark: Initial support for 3.2#3335
Conversation
23ea69d to
58ce778
Compare
kbendick
left a comment
There was a problem hiding this comment.
Still in the process of going through all of this but thank you so much for doing this!
| key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }} | ||
| restore-keys: ${{ runner.os }}-gradle | ||
| - run: echo -e "$(ip addr show eth0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)\t$(hostname -f) $(hostname -s)" | sudo tee -a /etc/hosts | ||
| - run: ./gradlew -DsparkVersions=${{ matrix.spark }} -DhiveVersions= -DflinkVersions= :iceberg-spark:iceberg-spark_3.2:check :iceberg-spark:iceberg-spark_3.2-extensions:check :iceberg-spark:iceberg-spark_3.2-runtime:check -Pquick=true -x javadoc |
There was a problem hiding this comment.
Unrelated: Are the -DflinkVersions= really necessary? I assume so. That's kind of a pain. I might take a stab at getting rid of that.
There was a problem hiding this comment.
I copied it from Spark 3.0. I think it is done to disable Flink tests.
There was a problem hiding this comment.
It might be? The default is to include Flink 1.13 so that we can run things like assemble to trigger compilation of all modules. Then we remove Flink by specifically setting the versions to an empty string where we don't want to even load the project, like in the "core" tests job where we want to run check but exclude Flink. I think otherwise we'd have to remove specific targets for Flink or call check for each module.
Here, since we are already calling check for each module individually we may not need it.
There was a problem hiding this comment.
I am keeping it as is for consistency now. If we want to update, we should probably update all jobs.
58ce778 to
da40f9f
Compare
| } | ||
|
|
||
| if (sparkVersions.contains("3.2")) { | ||
| include ':iceberg-spark:spark_3.2' |
There was a problem hiding this comment.
Does this naming strategy conflict with scala?
There was a problem hiding this comment.
This PR does not take care of multiple Scala versions. It is a follow-up item and I'd be glad if someone could work on that.
There was a problem hiding this comment.
It's not about the support of multiple Scala versions, some applications like mvnrepository.com will identify the Scala version by such pattern from package name.
For reference, Elasticsearch Spark use suffix spark-16 spark-30, https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-30
There was a problem hiding this comment.
If possible, I suggest not to break this rule, otherwise it will confuse users and some applications.
There was a problem hiding this comment.
I assumed if we named it as spark_3.2_2.12, it would work. However, I have no issues updating to use spark-32 instead. I'll make the change.
There was a problem hiding this comment.
I think we should get way from the existing naming and start using spark_3.2_2.12. That shouldn't have a Scala problem because the Scala version is the last.
There was a problem hiding this comment.
This is a good idea. Scala 2.13 support will be coming pretty soon to Spark (I think the tests have been running on 2.13 for a bit).
There was a problem hiding this comment.
How do you think spark-3.2_2.12? Usually, the package name is preferred to use - instead of _.
There was a problem hiding this comment.
I'll use spark-3.2_2.12. That's a good compromise.
b585d3e to
77c0ab0
Compare
77c0ab0 to
7644ddc
Compare
|
Can we please keep the subdirectory naming the same? i.e., The project names can be different ( |
| key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }} | ||
| restore-keys: ${{ runner.os }}-gradle | ||
| - run: echo -e "$(ip addr show eth0 | grep "inet\b" | awk '{print $2}' | cut -d/ -f1)\t$(hostname -f) $(hostname -s)" | sudo tee -a /etc/hosts | ||
| - run: ./gradlew -DsparkVersions=${{ matrix.spark }} -DhiveVersions= -DflinkVersions= :iceberg-spark:iceberg-spark-32:check :iceberg-spark:iceberg-spark-32-extensions:check :iceberg-spark:iceberg-spark-32-runtime:check -Pquick=true -x javadoc |
There was a problem hiding this comment.
I'd prefer to use the spark_3.2_2.12 naming scheme for modules. That would allow us to parameterize this with another reference to ${{ matrix.spark }} when calling the Spark version module's check task. Similarly, we'd be able to support Scala version in a follow-up.
I think the main argument for a module named iceberg-spark32 is consistency with the existing module names, but I don't think they are something we want to preserve now that we have more than modules for Spark major versions. It is more clear to use _3.2_2.13 everywhere that we can, vs relying on people knowing that 32 is actually Spark 3.2.
There was a problem hiding this comment.
I'm fine with using _3.2_2.12 and _3.2_2.13 names for the modules.
For a developer running gradle locally on their machine, is there a single ./gradlew command that they can use with -DsparkVersions=3.0,3.2 or -DsparkVersions=3.2 that builds all the applicable Spark 3 modules or runs their tests (less general than build or check or build -x test)?
Regardless of the naming of the modules, can we still keep the subdirectory names under each Spark 3 version uniform? It's independent of the module names and I think it'd be good if the subdirectory names are the same for each version.
There was a problem hiding this comment.
I'll drop versions from folder names to keep things consistent and potentially simplify cherry-picks.
There was a problem hiding this comment.
I dropped versions for 3.2. I'll do the same for other versions to keep this PR smaller.
There was a problem hiding this comment.
Yeah, I agree with keeping the same folder names to make moving changes around more easy.
| project(':iceberg-spark:spark-32').projectDir = file('spark/v3.2/spark-32') | ||
| project(':iceberg-spark:spark-32').name = 'iceberg-spark-32' | ||
| project(':iceberg-spark:spark-32-extensions').projectDir = file('spark/v3.2/spark-32-extensions') | ||
| project(':iceberg-spark:spark-32-extensions').name = 'iceberg-spark-32-extensions' |
There was a problem hiding this comment.
I thought we were going to merge extensions into the base module? Is that a follow-up?
There was a problem hiding this comment.
I also thought about this but was not sure if it is worth the effort and whether it will make cherry-picks harder.
There was a problem hiding this comment.
Right. We should either do it everywhere or nowhere. I'm leaning toward everywhere, but it doesn't matter much to me.
…d to spark/v3.0
| }); | ||
|
|
||
| conf.forEach((confKey, confValue) -> { | ||
| if (!confKey.startsWith(SparkSQLProperties.PREFIX) && !sqlConf.isModifiable(confKey)) { |
There was a problem hiding this comment.
Isn't the Spark 3.2 equivalent of SQLConf.staticConfKeys().contains(confKey) simply SQLConf.isStaticConfigKey(confKey)?
This appears in AvroDataTest.java and IcebergSourceBenchmark.java as well.
There was a problem hiding this comment.
Good point, I missed this. Updated all 3 places.
| runs-on: ubuntu-latest | ||
| strategy: | ||
| matrix: | ||
| jvm: [11] |
There was a problem hiding this comment.
Is there a reason not to include Java 8? Afaik, Spark 3.2 is still supported on Java 8, just deprecated for Java 8 prior to version 8u201.
There was a problem hiding this comment.
I had issues with it internally but it may work with OSS Spark. Added, let's see.
|
Thanks @aokolnychyi! And thanks for reviewing, everyone! |
|
@aokolnychyi I saw that in v3.2, quite a number of rules have been removed from |
|
@wypoon, sure! Like I stated in the PR description, this change did not cover multiple Scala versions support and row-level commands. I am reworking row-level commands at this moment to also support merge-on-read and leverage native dynamic filtering in Spark. If I manage to make it work, then the implementation will be different and I did not want to invest time into porting existing extensions if we may rework them. To sum up, expect to see PRs for row-level commands and other 3.2 related things next week. |
|
@aokolnychyi thanks for elaborating. |
|
@aokolnychyi, Just wanted to get a better understanding with respect to row-level command support. In 3.2, are we scoping the changes within iceberg code base itself (PR you are planning this week) and then from 3.3 onwards the changes would go inside the spark code base which would be on the lines of the SPIP and the draft PR. I might be missing information here and wanted to better understand the approach. |
This PR adds initial support for Spark 3.2 and contains 3 commits. The first one includes the initial move of classes and build-related changes. The second contains changes for the core functionality. The third one contains changes for extensions.
TODO items: