Skip to content

Spark 3.4: Support fanout writers in SparkPositionDeltaWrite#7703

Merged
aokolnychyi merged 3 commits into
apache:masterfrom
aokolnychyi:mor-fanout-dist-and-ordering
May 26, 2023
Merged

Spark 3.4: Support fanout writers in SparkPositionDeltaWrite#7703
aokolnychyi merged 3 commits into
apache:masterfrom
aokolnychyi:mor-fanout-dist-and-ordering

Conversation

@aokolnychyi

Copy link
Copy Markdown
Contributor

This PR adds support for fanout writers in SparkPositionDeltaWrite.

@github-actions github-actions Bot added the spark label May 25, 2023
@aokolnychyi

Copy link
Copy Markdown
Contributor Author

}

public SparkWriteRequirements positionDeltaRequirements(Command command) {
if (ignoreTableDistributionAndOrdering()) {

@aokolnychyi aokolnychyi May 25, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches CoW and insert requirements now (defined above).

@@ -130,15 +130,16 @@ private static Distribution copyOnWriteDeleteUpdateDistribution(

/** Builds requirements for merge-on-read DELETE, UPDATE, MERGE operations. */
public static SparkWriteRequirements positionDeltaRequirements(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a note at the top that this class is for internal purposes and has no API guarantees.

long targetFileSize = context.targetDataFileSize();

if (fanoutEnabled && !inputOrdered) {
return new FanoutDataWriter<>(writers, files, io, targetFileSize);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use the fanout writer if input is ordered. The local sort always include partition columns.

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me


// the spec requires position deletes to be ordered by file and position
// use a fanout writer if the input is unordered no matter whether fanout writers are enabled
// clustered writers only validate records for the same spec/paritition are co-located,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: clustered writers only validate records for the same spec/paritition are co-located which is not enough for position deletes => clustered writers assume that the position deletes are already ordered by file and position?

(the validation seems a bit orthogonal to previous sentence. I dont mind putting it in a as well somewhere, but doesnt seem its in the other comment in newDataWriter, so hence thinking its better to omit)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let me update.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// -------------------------------------------------------------------------
// delete mode is NOT SET -> CLUSTER BY _spec_id, _partition, _file +
// LOCALLY ORDERED BY _spec_id, _partition, _file, _pos
// delete mode is NOT SET (fanout) -> CLUSTER BY _spec_id, _partition, _file + empty ordering

@szehon-ho szehon-ho May 25, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably annoying to add another example with ORDERED BY, but is it still correct with "ORDERED" above? Is empty_ordering not correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a table sort order, we still respect it by default and fanout writer config has no effect.

@szehon-ho szehon-ho May 26, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I meant , in this comment as it is ORDERED & UNORDERED, fanout should not be 'empty ordering', but rather whatever the order column was?

Not sure if I'm missing something

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is DELETE. We only have positions, no data, so the order does not matter.

@szehon-ho szehon-ho May 26, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. ok. Well i guess it matters in the end, but the fanoutWriter here will sort it

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks @aokolnychyi

@aokolnychyi aokolnychyi merged commit b430775 into apache:master May 26, 2023
@aokolnychyi

Copy link
Copy Markdown
Contributor Author

Thanks for reviewing, @szehon-ho @amogh-jahagirdar!

rodmeneses pushed a commit to rodmeneses/iceberg that referenced this pull request Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants