Spec: Adds Row Lineage by RussellSpitzer · Pull Request #11130 · apache/iceberg

RussellSpitzer · 2024-09-13T22:26:53Z

Proposal Here :

https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit#heading=h.f2e8ffw3fu7n

Adds Row Lineage to the Spec

End goal is to provide two fields to all rows

_row_id a unique long which identifies every row added to the table
_last_update the sequence number of the last commit to modify the row

Fixes #11129

dyfrgi · 2024-09-16T16:54:28Z

Is there a path for upgrading an existing Iceberg table to use row-lineage?

RussellSpitzer · 2024-09-16T17:16:49Z

Is there a path for upgrading an existing Iceberg table to use row-lineage?

Turning on row-lineage would start tracking for all rows added after that point, i'm not sure we have a way of going back and adding history for previously existing rows. We could if we like, specify that existing rows should be treated as if they were created in the manifest in which they appear but that sounds a bit complicated.

ashvina · 2024-09-25T05:26:10Z

+| **`2147483543  _row_id`**        | `long`        | A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage)                      |
+| **`2147483542  _last_update`**   | `long`        | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) |
+
+### Row Lineage


(I’m not sure if the lineage proposal document is still active for discussion, hence posting this comment here. If it is preferred, I can move this discussion there.)

The term row lineage has limited scope -- it refers to a "sequence number" which indicates when a row was created or modified. However, it does not reference the source of a modified row. I.e. it does not provide details about evolution of a row (the original row which was modified). Is this definition of "evolution" sufficient?

Also, the proposal document mentions that the impact on reads should be minimal. If there are any available metrics, specially increase in data file size, it might be beneficial to include them in this document so that users can understand the impact of enabling this feature.

Every "row_id" can be used to track the creation of a row by checking the "row_id" high water mark of each snapshot in the table history. This allows a user (with sufficent snapshot history) to determine when any particular row was initially added to the table. The second field last-updated-seq points to the update in which the row was last modified.

Together these allow you determine when a row was made and when it was last changed. The origin of a modified row is always the row with the exact same _row_id in the commit before last-updated-seq.

Impact on read should be 0 since these columns do not need to actually be materialized by scans. Impact on merge statements/copy statements should be slightly increased because more data has to go through the compute engine although this will differ in efficiency based on the engine.

On file size this should be relatively low impact but we can do some benchmarks once the reference implementation is done. For use cases without row-level-updates it would be very very cheap since any materialized row_id and last-updated-sequence values should be either very very similar (and compressible) or identical.

Changes from a thorough review

flyrain

In case of rollback, there could be the same row id pointing to the different rows. These rows are in different branches, which may be fine until we try to merge branches. With that, we may need to rewrite the data files in case of cherry-picking or adding data file from another branch.

flyrain · 2024-09-27T16:33:08Z


+#### First Row ID Assignment
+
+Row ID inheritance is used when row lineage is enabled. When not enabled, a manifest's `first_row_id` must always be set to `null`. The rest of this section applies when row lineage is enabled.


Do we allow to disable row lineage for a table? If it is allowed, should we rewrite the manifest files and data files when we disable it?

@rdblue also has some thoughts on this. I don't have a problem with enabling and disabling and having slightly odd behavior when that happens. I think it's a pretty unlikely situation to have it flip back and forth, for users who do that they can expect some odd events.

I don't think that we should allow disabling it. That would create strange situations and we would not be able to trust the metadata. There may be a way to handle this, but I don't want to block the initial feature taking the time to design it.

I would assume we would treat any commits in the gap the same as we would treat writes before tracking was on.

A related question: if we revert the table to a snapshot before enabling the row lineage, should we disable row lineage? If not, what about next_row_id?

Can we be more explicit in the spec about disabling row lineage is not allowed at this moment? I think the engine/catalog should guard it so that users won't accidentally disable it.

A table snapshot reversion is the same as resetting current-snapshot. In this case we don't change next_row_id and we would not disable row_lineage. Turning on row_lineage is not a snapshot operation. I'll write that explicitly.

flyrain · 2024-09-27T17:00:15Z

+
+#### Row lineage assignment
+
+Row lineage fields are written when row lineage is enabled. When not enabled, row lineage fields (`_row_id` and `_last_update`) must not be written to data files. The rest of this section applies when row lineage is enabled.


The same clean-up question here. Do we rewrite data files in case of disabling row lineage or we disallow disabling?

I think disabling is allowed it just means we stop changing any of the metadata and lineage information may be possibly lost by clients which don't support row-lineage. I don't think we need to prevent this.

rdblue · 2024-10-07T23:35:26Z

+| _optional_ | _optional_ | _optional_ | **`metadata-log`**          | A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. |
+| _optional_ | _required_ | _required_ | **`sort-orders`**           | A list of sort orders, stored as full sort order objects.                                                                                                                                                                                                                                                                                                                                        |
+| _optional_ | _required_ | _required_ | **`default-sort-order-id`** | Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files.                                                                                                                                                                                                                                  |
+|            | _optional_ | _optional_ | **`refs`**                  | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null.                                                                                                                              |


Not related to this PR, but a general v3 question: Should we make refs required in v3?

@aokolnychyi, @danielcweeks, @jackye1995, @RussellSpitzer, @flyrain, what do you think?

I don't think there are any negatives to that. What was the reason it wasn't required in V3?

What was the reason it wasn't required in V3?

@RussellSpitzer Do you mean V2? I think it was largely because branching/tagging came after the voting/adoption of V2 and so for compatibility purposes it needed to be optional for writers.

I'm +1 on making it required in V3 though. I think in general it's good to standardize on fields in newer format versions, when those fields are fairly adopted in the previous version and it's not that much of a metadata overhead to write it or maintain (for a user not using branches/tags the worst case is a just a mapping of main to the details of main). It makes it easier for developers to assume which metadata properties exist or not.

Co-authored-by: Ryan Blue <blue@apache.org>

…wLineage

amogh-jahagirdar · 2024-10-09T02:34:28Z

+| _optional_ | _optional_ | _optional_ | **`metadata-log`**          | A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. |
+| _optional_ | _required_ | _required_ | **`sort-orders`**           | A list of sort orders, stored as full sort order objects.                                                                                                                                                                                                                                                                                                                                        |
+| _optional_ | _required_ | _required_ | **`default-sort-order-id`** | Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files.                                                                                                                                                                                                                                  |
+|            | _optional_ | _optional_ | **`refs`**                  | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null.                                                                                                                              |


What was the reason it wasn't required in V3?

@RussellSpitzer Do you mean V2? I think it was largely because branching/tagging came after the voting/adoption of V2 and so for compatibility purposes it needed to be optional for writers.

I'm +1 on making it required in V3 though. I think in general it's good to standardize on fields in newer format versions, when those fields are fairly adopted in the previous version and it's not that much of a metadata overhead to write it or maintain (for a user not using branches/tags the worst case is a just a mapping of main to the details of main). It makes it easier for developers to assume which metadata properties exist or not.

sumedhsakdeo · 2024-10-09T17:01:05Z

+| **`2147483545  pos`**            | `long`        | Ordinal position of a row, used in position-based delete files                                          |
+| **`2147483544  row`**            | `struct<...>` | Deleted row values, used in position-based delete files                                                 |
+| **`2147483543  _row_id`**        | `long`        | A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage)                      |
+| **`2147483542  _last_updated_seq`**   | `long`        | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) |


I am wondering if we should assign sequence number regardless of if row-lineage is enabled or not.

The goal here is to not make it a requirement for engines which are not compatible with row-lineage.

Do we need to engine side changes for the last_updated_seq, or it can be done completely in the iceberg-core layer?

It's a change in the underlying parquet files so all engines will need to participate.

…wLineage

aokolnychyi · 2024-10-16T23:18:58Z


 The set of metadata columns is:

-| Field id, name              | Type          | Description |


Any chance we can avoid unnecessary changes? I usually use a text editor. I believe there are ways to disable auto-formatting in IntelliJ as well.

aokolnychyi · 2024-10-16T23:30:24Z

+
+In v3 and later, an Iceberg table can track row lineage fields for all newly created rows.  Row lineage is enabled by setting the field `row-lineage` to true in the table's metadata. When enabled, engines must maintain the `next-row-id` table field and the following row-level fields when writing data files:
+
+* `_row_id` a unique long identifier for every row within the table. The value is assigned via inheritance when a row is first added to the table and the existing value is explicitly written when the row is written to a new file.


written to a new file -> copied to a new file or something like below written to a different data file?

I think I disagree with the "copied" language. When "the row" is written (even if it is modified) the row ID should be preserved. To me, "copied" implies that the row is not modified, opening a question of how to handle row modification.

aokolnychyi · 2024-10-16T23:34:51Z

+
+When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.
+
+A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. When reading, such files must set both columns to null for all rows.


If we allow writers to skip these fields in new data files, does it mean we loose the ability to distinguish a data file added prior to enabling row lineage and a data file with all new rows? Do we even care about that?

Probably not, we still have first_row_id on the data file. Never mind.

aokolnychyi · 2024-10-16T23:35:55Z

+
+When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.
+
+A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. When reading, such files must set both columns to null for all rows.


such files must set both column -> shouldn't this be the reader's responsibility?

Yes, let me reword that

RussellSpitzer · 2024-10-22T16:40:02Z

@rdblue
@nastra
@sumedhsakdeo
@flyrain
@stevenzwu
@wgtmac
@aokolnychyi
@ashvina
@amogh-jahagirdar

Ping everyone, we've had the vote on the Mailing list and I'd like to wrap this up and merge soon if possible. Please ping with any additional feedback

rdblue · 2024-10-22T23:44:30Z

+1 to merge this since the vote has passed. We can do minor cleanup as we go right?

amogh-jahagirdar

I'm good with the spec definition here as is, if there's stylistic/formatting cleanup we could do follow ups.

sumedhsakdeo

Thanks @RussellSpitzer for a very clear description of updates to the spec for row lineage. It looks great! Left some questions for my clarifications.

sumedhsakdeo · 2024-10-23T05:35:25Z

+
+When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.
+
+A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. If the columns are missing, readers should treat both columns as if they exist and are set to null for all rows.


readers should treat both columns as if they exist and are set to null for all rows

Clarifying, if we are also saying here, that _last_updated_sequence_number and _row_id are reserved column names in a table created with v3 spec.

There are no reserved column names, only IDs. And this does update the table of reserved IDs.

sumedhsakdeo · 2024-10-23T05:53:36Z

+
+Values for `_row_id` and `_last_updated_sequence_number` are either read from the data file or assigned at read time. As a result on read, rows in a table always have non-null values for these fields when lineage is enabled.
+
+When an existing row is moved to a different data file for any reason, writers are required to write `_row_id` and `_last_updated_sequence_number` according to the following rules:


When a user does INSERT OVERWRITE of an entire partition / table, some rows might be overwritten implicitly, as the operation is not copy-on-write, unlike MERGE INTO or UPDATE. For such cases, are we saying the rows are treated as new rows, and existing row _row_id or _last_updated_sequence_number will not be carried forward?

Yes. INSERT OVERWRITE is an INSERT and the rows should be treated as new rows.

I know there are cases here where users have historically built patterns around INSERT OVERWRITE when MERGE was not available. For example, the read-union-overwrite pattern was heavily used at Netflix to add new data to existing partitions. The problem is that engines can't detect the intent and carry row information through. These patterns also can't be optimized by engines, so I think the best choice is to use the INSERT semantics here.

In addition, the Iceberg community has discouraged using INSERT OVERWRITE for years because of the challenges with implicit data overwrites. Implicitly overwriting a directory of data means that the physical layout needs to implicitly align with writes. That's not a good pattern to use.

sumedhsakdeo · 2024-10-23T06:04:31Z

+
+Files `data2` and `data3` are written with `null` for `first_row_id` and are assigned `first_row_id` at read time based on the manifest's `first_row_id` and the `record_count` of previously listed ADDED files in this manifest: (1,000 + 0) and (1,000 + 50).
+
+When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225:


+1, examples made it easy to follow.

sumedhsakdeo · 2024-10-23T06:06:58Z

+All files that were added before `row-lineage` was enabled should propagate null for all of the `row-lineage` related
+fields. The values for `_row_id` and `_last_updated_sequence_number` should always return null and when these rows are copied, 
+null should be explicitly written. After this point, rows are treated as if they were just created 
+and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows.


For completeness, should we add line on expected behavior if disabling row lineage after enabling it for some time.

It is not possible to disable row lineage after enabling it.

Should we call that out in the spec? I know it is merged already. May be a followup PR?

rdblue · 2024-10-23T21:59:43Z

Merged! Thanks for the awesome work on this, @RussellSpitzer!

Co-authored-by: Ryan Blue <blue@apache.org>

Spec: Adds Row Lineage

3cdeb9a

github-actions Bot added the Specification Issues that may introduce spec changes. label Sep 13, 2024

RussellSpitzer commented Sep 13, 2024

View reviewed changes

Comment thread format/spec.md Outdated

RussellSpitzer commented Sep 13, 2024

View reviewed changes

Comment thread format/spec.md Outdated

Change Reserved Field Ids

8b9ff29

stevenzwu reviewed Sep 19, 2024

View reviewed changes

Comment thread format/spec.md

Comment thread format/spec.md Outdated

Comment thread format/spec.md Outdated

Comment thread format/spec.md Outdated

Comment thread format/spec.md Outdated

Comment thread format/spec.md Outdated

Reviewer Comments, column renames

69efedc

rdblue reviewed Sep 23, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Sep 23, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Sep 23, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Sep 23, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Sep 23, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Sep 23, 2024

View reviewed changes

Comment thread format/spec.md Outdated

stevenzwu reviewed Sep 24, 2024

View reviewed changes

Comment thread format/spec.md

CJDrew mentioned this pull request Sep 24, 2024

Flink SQL with Iceberg snapshots doesn't react if table has upsert #9948

Closed

ashvina reviewed Sep 25, 2024

View reviewed changes

nastra added this to the Iceberg V3 Spec milestone Sep 25, 2024

rdblue reviewed Sep 26, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Sep 26, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Sep 26, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue and others added 2 commits September 26, 2024 16:24

Rearrange some sections and clarify.

7ba4d25

Merge pull request #4 from rdblue/RowLineage

bcfd8f7

Changes from a thorough review

flyrain reviewed Sep 27, 2024

View reviewed changes

RussellSpitzer added 2 commits September 27, 2024 14:19

Clarifications and More Review Comments

dccb597

Further reviewer comments

f3ead5b

rdblue reviewed Oct 7, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Oct 7, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Oct 7, 2024

View reviewed changes

Comment thread format/spec.md Outdated

rdblue reviewed Oct 7, 2024

View reviewed changes

Comment thread format/spec.md Outdated

RussellSpitzer and others added 3 commits October 8, 2024 11:51

Apply suggestions from code review

baba71e

Co-authored-by: Ryan Blue <blue@apache.org>

Additional review changes

6bd6f47

Merge remote-tracking branch 'refs/remotes/origin/RowLineage' into Ro…

71aeb9b

…wLineage

amogh-jahagirdar reviewed Oct 9, 2024

View reviewed changes

Review Comment

ed9aebd

sumedhsakdeo reviewed Oct 9, 2024

View reviewed changes

RussellSpitzer commented Oct 9, 2024

View reviewed changes

Comment thread format/spec.md Outdated

RussellSpitzer commented Oct 9, 2024

View reviewed changes

Comment thread format/spec.md Outdated

Update Reserved Field IDs to not Overlap with Existing Fields

ffd08b3

rdblue mentioned this pull request Oct 9, 2024

Spec v3: Add deletion vectors to the table spec #11240

Merged

nastra reviewed Oct 10, 2024

View reviewed changes

Comment thread format/spec.md Outdated

Comment thread format/spec.md Outdated

Comment thread format/spec.md Outdated

Comment thread format/spec.md

stevenzwu reviewed Oct 10, 2024

View reviewed changes

Comment thread format/spec.md Outdated

RussellSpitzer added 2 commits October 14, 2024 16:55

Additional Reviewer Comments

28f4090

Merge remote-tracking branch 'refs/remotes/origin/RowLineage' into Ro…

7062ca3

…wLineage

aokolnychyi reviewed Oct 16, 2024

View reviewed changes

More Reviewer Comments

943e7c4

amogh-jahagirdar approved these changes Oct 22, 2024

View reviewed changes

sumedhsakdeo approved these changes Oct 23, 2024

View reviewed changes

stevenzwu approved these changes Oct 23, 2024

View reviewed changes

rdblue approved these changes Oct 23, 2024

View reviewed changes

rdblue merged commit 02a988b into apache:main Oct 23, 2024

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Spec: Adds Row Lineage (apache#11130)

eb0357c

Co-authored-by: Ryan Blue <blue@apache.org>


		#### First Row ID Assignment

		Row ID inheritance is used when row lineage is enabled. When not enabled, a manifest's `first_row_id` must always be set to `null`. The rest of this section applies when row lineage is enabled.


		#### Row lineage assignment

		Row lineage fields are written when row lineage is enabled. When not enabled, row lineage fields (`_row_id` and `_last_update`) must not be written to data files. The rest of this section applies when row lineage is enabled.


		The set of metadata columns is:

		\| Field id, name \| Type \| Description \|


		In v3 and later, an Iceberg table can track row lineage fields for all newly created rows. Row lineage is enabled by setting the field `row-lineage` to true in the table's metadata. When enabled, engines must maintain the `next-row-id` table field and the following row-level fields when writing data files:

		* `_row_id` a unique long identifier for every row within the table. The value is assigned via inheritance when a row is first added to the table and the existing value is explicitly written when the row is written to a new file.


		When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.

		A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. When reading, such files must set both columns to null for all rows.


		Values for `_row_id` and `_last_updated_sequence_number` are either read from the data file or assigned at read time. As a result on read, rows in a table always have non-null values for these fields when lineage is enabled.

		When an existing row is moved to a different data file for any reason, writers are required to write `_row_id` and `_last_updated_sequence_number` according to the following rules:


		Files `data2` and `data3` are written with `null` for `first_row_id` and are assigned `first_row_id` at read time based on the manifest's `first_row_id` and the `record_count` of previously listed ADDED files in this manifest: (1,000 + 0) and (1,000 + 50).

		When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225:

Uh oh!

Conversation

RussellSpitzer commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dyfrgi commented Sep 16, 2024

Uh oh!

RussellSpitzer commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

RussellSpitzer commented Sep 13, 2024 •

edited

Loading

RussellSpitzer commented Sep 16, 2024 •

edited

Loading

flyrain Oct 14, 2024 •

edited

Loading

aokolnychyi Oct 16, 2024 •

edited

Loading

aokolnychyi Oct 16, 2024 •

edited

Loading