Skip to content

Spec: Adds Row Lineage#11130

Merged
rdblue merged 15 commits into
apache:mainfrom
RussellSpitzer:RowLineage
Oct 23, 2024
Merged

Spec: Adds Row Lineage#11130
rdblue merged 15 commits into
apache:mainfrom
RussellSpitzer:RowLineage

Conversation

@RussellSpitzer

@RussellSpitzer RussellSpitzer commented Sep 13, 2024

Copy link
Copy Markdown
Member

Proposal Here :

https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit#heading=h.f2e8ffw3fu7n

Adds Row Lineage to the Spec

End goal is to provide two fields to all rows

_row_id a unique long which identifies every row added to the table
_last_update the sequence number of the last commit to modify the row

Fixes #11129

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Sep 13, 2024
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
@dyfrgi

dyfrgi commented Sep 16, 2024

Copy link
Copy Markdown

Is there a path for upgrading an existing Iceberg table to use row-lineage?

@RussellSpitzer

RussellSpitzer commented Sep 16, 2024

Copy link
Copy Markdown
Member Author

Is there a path for upgrading an existing Iceberg table to use row-lineage?

Turning on row-lineage would start tracking for all rows added after that point, i'm not sure we have a way of going back and adding history for previously existing rows. We could if we like, specify that existing rows should be treated as if they were created in the manifest in which they appear but that sounds a bit complicated.

Comment thread format/spec.md
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md
Comment thread format/spec.md
| **`2147483543 _row_id`** | `long` | A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage) |
| **`2147483542 _last_update`** | `long` | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) |

### Row Lineage

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I’m not sure if the lineage proposal document is still active for discussion, hence posting this comment here. If it is preferred, I can move this discussion there.)

The term row lineage has limited scope -- it refers to a "sequence number" which indicates when a row was created or modified. However, it does not reference the source of a modified row. I.e. it does not provide details about evolution of a row (the original row which was modified). Is this definition of "evolution" sufficient?

Also, the proposal document mentions that the impact on reads should be minimal. If there are any available metrics, specially increase in data file size, it might be beneficial to include them in this document so that users can understand the impact of enabling this feature.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every "row_id" can be used to track the creation of a row by checking the "row_id" high water mark of each snapshot in the table history. This allows a user (with sufficent snapshot history) to determine when any particular row was initially added to the table. The second field last-updated-seq points to the update in which the row was last modified.

Together these allow you determine when a row was made and when it was last changed. The origin of a modified row is always the row with the exact same _row_id in the commit before last-updated-seq.

Impact on read should be 0 since these columns do not need to actually be materialized by scans. Impact on merge statements/copy statements should be slightly increased because more data has to go through the compute engine although this will differ in efficiency based on the engine.

On file size this should be relatively low impact but we can do some benchmarks once the reference implementation is done. For use cases without row-level-updates it would be very very cheap since any materialized row_id and last-updated-sequence values should be either very very similar (and compressible) or identical.

@nastra nastra added this to the Iceberg V3 Spec milestone Sep 25, 2024
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated

@flyrain flyrain left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of rollback, there could be the same row id pointing to the different rows. These rows are in different branches, which may be fine until we try to merge branches. With that, we may need to rewrite the data files in case of cherry-picking or adding data file from another branch.

Comment thread format/spec.md
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated

#### First Row ID Assignment

Row ID inheritance is used when row lineage is enabled. When not enabled, a manifest's `first_row_id` must always be set to `null`. The rest of this section applies when row lineage is enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow to disable row lineage for a table? If it is allowed, should we rewrite the manifest files and data files when we disable it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue also has some thoughts on this. I don't have a problem with enabling and disabling and having slightly odd behavior when that happens. I think it's a pretty unlikely situation to have it flip back and forth, for users who do that they can expect some odd events.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should allow disabling it. That would create strange situations and we would not be able to trust the metadata. There may be a way to handle this, but I don't want to block the initial feature taking the time to design it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume we would treat any commits in the gap the same as we would treat writes before tracking was on.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question: if we revert the table to a snapshot before enabling the row lineage, should we disable row lineage? If not, what about next_row_id?

@flyrain flyrain Oct 14, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more explicit in the spec about disabling row lineage is not allowed at this moment? I think the engine/catalog should guard it so that users won't accidentally disable it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A table snapshot reversion is the same as resetting current-snapshot. In this case we don't change next_row_id and we would not disable row_lineage. Turning on row_lineage is not a snapshot operation. I'll write that explicitly.

Comment thread format/spec.md Outdated
Comment thread format/spec.md
Comment thread format/spec.md Outdated

#### Row lineage assignment

Row lineage fields are written when row lineage is enabled. When not enabled, row lineage fields (`_row_id` and `_last_update`) must not be written to data files. The rest of this section applies when row lineage is enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same clean-up question here. Do we rewrite data files in case of disabling row lineage or we disallow disabling?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think disabling is allowed it just means we stop changing any of the metadata and lineage information may be possibly lost by clients which don't support row-lineage. I don't think we need to prevent this.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md
| _optional_ | _optional_ | _optional_ | **`metadata-log`** | A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. |
| _optional_ | _required_ | _required_ | **`sort-orders`** | A list of sort orders, stored as full sort order objects. |
| _optional_ | _required_ | _required_ | **`default-sort-order-id`** | Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. |
| | _optional_ | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but a general v3 question: Should we make refs required in v3?

@aokolnychyi, @danielcweeks, @jackye1995, @RussellSpitzer, @flyrain, what do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are any negatives to that. What was the reason it wasn't required in V3?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason it wasn't required in V3?

@RussellSpitzer Do you mean V2? I think it was largely because branching/tagging came after the voting/adoption of V2 and so for compatibility purposes it needed to be optional for writers.

I'm +1 on making it required in V3 though. I think in general it's good to standardize on fields in newer format versions, when those fields are fairly adopted in the previous version and it's not that much of a metadata overhead to write it or maintain (for a user not using branches/tags the worst case is a just a mapping of main to the details of main). It makes it easier for developers to assume which metadata properties exist or not.

Comment thread format/spec.md Outdated
Comment thread format/spec.md
| _optional_ | _optional_ | _optional_ | **`metadata-log`** | A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. |
| _optional_ | _required_ | _required_ | **`sort-orders`** | A list of sort orders, stored as full sort order objects. |
| _optional_ | _required_ | _required_ | **`default-sort-order-id`** | Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. |
| | _optional_ | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason it wasn't required in V3?

@RussellSpitzer Do you mean V2? I think it was largely because branching/tagging came after the voting/adoption of V2 and so for compatibility purposes it needed to be optional for writers.

I'm +1 on making it required in V3 though. I think in general it's good to standardize on fields in newer format versions, when those fields are fairly adopted in the previous version and it's not that much of a metadata overhead to write it or maintain (for a user not using branches/tags the worst case is a just a mapping of main to the details of main). It makes it easier for developers to assume which metadata properties exist or not.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
| **`2147483545 pos`** | `long` | Ordinal position of a row, used in position-based delete files |
| **`2147483544 row`** | `struct<...>` | Deleted row values, used in position-based delete files |
| **`2147483543 _row_id`** | `long` | A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage) |
| **`2147483542 _last_updated_seq`** | `long` | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should assign sequence number regardless of if row-lineage is enabled or not.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to not make it a requirement for engines which are not compatible with row-lineage.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to engine side changes for the last_updated_seq, or it can be done completely in the iceberg-core layer?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a change in the underlying parquet files so all engines will need to participate.

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md
Comment thread format/spec.md Outdated
Comment thread format/spec.md

The set of metadata columns is:

| Field id, name | Type | Description |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can avoid unnecessary changes? I usually use a text editor. I believe there are ways to disable auto-formatting in IntelliJ as well.

Comment thread format/spec.md Outdated

In v3 and later, an Iceberg table can track row lineage fields for all newly created rows. Row lineage is enabled by setting the field `row-lineage` to true in the table's metadata. When enabled, engines must maintain the `next-row-id` table field and the following row-level fields when writing data files:

* `_row_id` a unique long identifier for every row within the table. The value is assigned via inheritance when a row is first added to the table and the existing value is explicitly written when the row is written to a new file.

@aokolnychyi aokolnychyi Oct 16, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

written to a new file -> copied to a new file or something like below written to a different data file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I disagree with the "copied" language. When "the row" is written (even if it is modified) the row ID should be preserved. To me, "copied" implies that the row is not modified, opening a question of how to handle row modification.

Comment thread format/spec.md Outdated

When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.

A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. When reading, such files must set both columns to null for all rows.

@aokolnychyi aokolnychyi Oct 16, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow writers to skip these fields in new data files, does it mean we loose the ability to distinguish a data file added prior to enabling row lineage and a data file with all new rows? Do we even care about that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, we still have first_row_id on the data file. Never mind.

Comment thread format/spec.md Outdated

When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.

A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. When reading, such files must set both columns to null for all rows.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

such files must set both column -> shouldn't this be the reader's responsibility?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me reword that

@RussellSpitzer

Copy link
Copy Markdown
Member Author

@rdblue
@nastra
@sumedhsakdeo
@flyrain
@stevenzwu
@wgtmac
@aokolnychyi
@ashvina
@amogh-jahagirdar

Ping everyone, we've had the vote on the Mailing list and I'd like to wrap this up and merge soon if possible. Please ping with any additional feedback

@rdblue

rdblue commented Oct 22, 2024

Copy link
Copy Markdown
Contributor

+1 to merge this since the vote has passed. We can do minor cleanup as we go right?

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the spec definition here as is, if there's stylistic/formatting cleanup we could do follow ups.

@sumedhsakdeo sumedhsakdeo left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RussellSpitzer for a very clear description of updates to the spec for row lineage. It looks great! Left some questions for my clarifications.

Comment thread format/spec.md

When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading.

A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. If the columns are missing, readers should treat both columns as if they exist and are set to null for all rows.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readers should treat both columns as if they exist and are set to null for all rows

Clarifying, if we are also saying here, that _last_updated_sequence_number and _row_id are reserved column names in a table created with v3 spec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no reserved column names, only IDs. And this does update the table of reserved IDs.

Comment thread format/spec.md

Values for `_row_id` and `_last_updated_sequence_number` are either read from the data file or assigned at read time. As a result on read, rows in a table always have non-null values for these fields when lineage is enabled.

When an existing row is moved to a different data file for any reason, writers are required to write `_row_id` and `_last_updated_sequence_number` according to the following rules:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a user does INSERT OVERWRITE of an entire partition / table, some rows might be overwritten implicitly, as the operation is not copy-on-write, unlike MERGE INTO or UPDATE. For such cases, are we saying the rows are treated as new rows, and existing row _row_id or _last_updated_sequence_number will not be carried forward?

@rdblue rdblue Oct 23, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. INSERT OVERWRITE is an INSERT and the rows should be treated as new rows.

I know there are cases here where users have historically built patterns around INSERT OVERWRITE when MERGE was not available. For example, the read-union-overwrite pattern was heavily used at Netflix to add new data to existing partitions. The problem is that engines can't detect the intent and carry row information through. These patterns also can't be optimized by engines, so I think the best choice is to use the INSERT semantics here.

In addition, the Iceberg community has discouraged using INSERT OVERWRITE for years because of the challenges with implicit data overwrites. Implicitly overwriting a directory of data means that the physical layout needs to implicitly align with writes. That's not a good pattern to use.

Comment thread format/spec.md

Files `data2` and `data3` are written with `null` for `first_row_id` and are assigned `first_row_id` at read time based on the manifest's `first_row_id` and the `record_count` of previously listed ADDED files in this manifest: (1,000 + 0) and (1,000 + 50).

When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, examples made it easy to follow.

Comment thread format/spec.md
All files that were added before `row-lineage` was enabled should propagate null for all of the `row-lineage` related
fields. The values for `_row_id` and `_last_updated_sequence_number` should always return null and when these rows are copied,
null should be explicitly written. After this point, rows are treated as if they were just created
and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, should we add line on expected behavior if disabling row lineage after enabling it for some time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not possible to disable row lineage after enabling it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call that out in the spec? I know it is merged already. May be a followup PR?

@rdblue rdblue merged commit 02a988b into apache:main Oct 23, 2024
@rdblue

rdblue commented Oct 23, 2024

Copy link
Copy Markdown
Contributor

Merged! Thanks for the awesome work on this, @RussellSpitzer!

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Co-authored-by: Ryan Blue <blue@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Row Lineage for V3