Spec: Adds Row Lineage#11130
Conversation
|
Is there a path for upgrading an existing Iceberg table to use row-lineage? |
Turning on row-lineage would start tracking for all rows added after that point, i'm not sure we have a way of going back and adding history for previously existing rows. We could if we like, specify that existing rows should be treated as if they were created in the manifest in which they appear but that sounds a bit complicated. |
| | **`2147483543 _row_id`** | `long` | A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage) | | ||
| | **`2147483542 _last_update`** | `long` | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) | | ||
|
|
||
| ### Row Lineage |
There was a problem hiding this comment.
(I’m not sure if the lineage proposal document is still active for discussion, hence posting this comment here. If it is preferred, I can move this discussion there.)
The term row lineage has limited scope -- it refers to a "sequence number" which indicates when a row was created or modified. However, it does not reference the source of a modified row. I.e. it does not provide details about evolution of a row (the original row which was modified). Is this definition of "evolution" sufficient?
Also, the proposal document mentions that the impact on reads should be minimal. If there are any available metrics, specially increase in data file size, it might be beneficial to include them in this document so that users can understand the impact of enabling this feature.
There was a problem hiding this comment.
Every "row_id" can be used to track the creation of a row by checking the "row_id" high water mark of each snapshot in the table history. This allows a user (with sufficent snapshot history) to determine when any particular row was initially added to the table. The second field last-updated-seq points to the update in which the row was last modified.
Together these allow you determine when a row was made and when it was last changed. The origin of a modified row is always the row with the exact same _row_id in the commit before last-updated-seq.
Impact on read should be 0 since these columns do not need to actually be materialized by scans. Impact on merge statements/copy statements should be slightly increased because more data has to go through the compute engine although this will differ in efficiency based on the engine.
On file size this should be relatively low impact but we can do some benchmarks once the reference implementation is done. For use cases without row-level-updates it would be very very cheap since any materialized row_id and last-updated-sequence values should be either very very similar (and compressible) or identical.
Changes from a thorough review
flyrain
left a comment
There was a problem hiding this comment.
In case of rollback, there could be the same row id pointing to the different rows. These rows are in different branches, which may be fine until we try to merge branches. With that, we may need to rewrite the data files in case of cherry-picking or adding data file from another branch.
|
|
||
| #### First Row ID Assignment | ||
|
|
||
| Row ID inheritance is used when row lineage is enabled. When not enabled, a manifest's `first_row_id` must always be set to `null`. The rest of this section applies when row lineage is enabled. |
There was a problem hiding this comment.
Do we allow to disable row lineage for a table? If it is allowed, should we rewrite the manifest files and data files when we disable it?
There was a problem hiding this comment.
@rdblue also has some thoughts on this. I don't have a problem with enabling and disabling and having slightly odd behavior when that happens. I think it's a pretty unlikely situation to have it flip back and forth, for users who do that they can expect some odd events.
There was a problem hiding this comment.
I don't think that we should allow disabling it. That would create strange situations and we would not be able to trust the metadata. There may be a way to handle this, but I don't want to block the initial feature taking the time to design it.
There was a problem hiding this comment.
I would assume we would treat any commits in the gap the same as we would treat writes before tracking was on.
There was a problem hiding this comment.
A related question: if we revert the table to a snapshot before enabling the row lineage, should we disable row lineage? If not, what about next_row_id?
There was a problem hiding this comment.
Can we be more explicit in the spec about disabling row lineage is not allowed at this moment? I think the engine/catalog should guard it so that users won't accidentally disable it.
There was a problem hiding this comment.
A table snapshot reversion is the same as resetting current-snapshot. In this case we don't change next_row_id and we would not disable row_lineage. Turning on row_lineage is not a snapshot operation. I'll write that explicitly.
|
|
||
| #### Row lineage assignment | ||
|
|
||
| Row lineage fields are written when row lineage is enabled. When not enabled, row lineage fields (`_row_id` and `_last_update`) must not be written to data files. The rest of this section applies when row lineage is enabled. |
There was a problem hiding this comment.
The same clean-up question here. Do we rewrite data files in case of disabling row lineage or we disallow disabling?
There was a problem hiding this comment.
I think disabling is allowed it just means we stop changing any of the metadata and lineage information may be possibly lost by clients which don't support row-lineage. I don't think we need to prevent this.
| | _optional_ | _optional_ | _optional_ | **`metadata-log`** | A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. | | ||
| | _optional_ | _required_ | _required_ | **`sort-orders`** | A list of sort orders, stored as full sort order objects. | | ||
| | _optional_ | _required_ | _required_ | **`default-sort-order-id`** | Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | | ||
| | | _optional_ | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. | |
There was a problem hiding this comment.
Not related to this PR, but a general v3 question: Should we make refs required in v3?
@aokolnychyi, @danielcweeks, @jackye1995, @RussellSpitzer, @flyrain, what do you think?
There was a problem hiding this comment.
I don't think there are any negatives to that. What was the reason it wasn't required in V3?
There was a problem hiding this comment.
What was the reason it wasn't required in V3?
@RussellSpitzer Do you mean V2? I think it was largely because branching/tagging came after the voting/adoption of V2 and so for compatibility purposes it needed to be optional for writers.
I'm +1 on making it required in V3 though. I think in general it's good to standardize on fields in newer format versions, when those fields are fairly adopted in the previous version and it's not that much of a metadata overhead to write it or maintain (for a user not using branches/tags the worst case is a just a mapping of main to the details of main). It makes it easier for developers to assume which metadata properties exist or not.
Co-authored-by: Ryan Blue <blue@apache.org>
| | _optional_ | _optional_ | _optional_ | **`metadata-log`** | A list (optional) of timestamp and metadata file location pairs that encodes changes to the previous metadata files for the table. Each time a new metadata file is created, a new entry of the previous metadata file location should be added to the list. Tables can be configured to remove oldest metadata log entries and keep a fixed-size log of the most recent entries after a commit. | | ||
| | _optional_ | _required_ | _required_ | **`sort-orders`** | A list of sort orders, stored as full sort order objects. | | ||
| | _optional_ | _required_ | _required_ | **`default-sort-order-id`** | Default sort order id of the table. Note that this could be used by writers, but is not used when reading because reads use the specs stored in manifest files. | | ||
| | | _optional_ | _optional_ | **`refs`** | A map of snapshot references. The map keys are the unique snapshot reference names in the table, and the map values are snapshot reference objects. There is always a `main` branch reference pointing to the `current-snapshot-id` even if the `refs` map is null. | |
There was a problem hiding this comment.
What was the reason it wasn't required in V3?
@RussellSpitzer Do you mean V2? I think it was largely because branching/tagging came after the voting/adoption of V2 and so for compatibility purposes it needed to be optional for writers.
I'm +1 on making it required in V3 though. I think in general it's good to standardize on fields in newer format versions, when those fields are fairly adopted in the previous version and it's not that much of a metadata overhead to write it or maintain (for a user not using branches/tags the worst case is a just a mapping of main to the details of main). It makes it easier for developers to assume which metadata properties exist or not.
| | **`2147483545 pos`** | `long` | Ordinal position of a row, used in position-based delete files | | ||
| | **`2147483544 row`** | `struct<...>` | Deleted row values, used in position-based delete files | | ||
| | **`2147483543 _row_id`** | `long` | A unique long assigned when row-lineage is enabled see [Row Lineage](#row-lineage) | | ||
| | **`2147483542 _last_updated_seq`** | `long` | The sequence number which last updated this row when row-lineage is enabled [Row Lineage](#row-lineage) | |
There was a problem hiding this comment.
I am wondering if we should assign sequence number regardless of if row-lineage is enabled or not.
There was a problem hiding this comment.
The goal here is to not make it a requirement for engines which are not compatible with row-lineage.
There was a problem hiding this comment.
Do we need to engine side changes for the last_updated_seq, or it can be done completely in the iceberg-core layer?
There was a problem hiding this comment.
It's a change in the underlying parquet files so all engines will need to participate.
|
|
||
| The set of metadata columns is: | ||
|
|
||
| | Field id, name | Type | Description | |
There was a problem hiding this comment.
Any chance we can avoid unnecessary changes? I usually use a text editor. I believe there are ways to disable auto-formatting in IntelliJ as well.
|
|
||
| In v3 and later, an Iceberg table can track row lineage fields for all newly created rows. Row lineage is enabled by setting the field `row-lineage` to true in the table's metadata. When enabled, engines must maintain the `next-row-id` table field and the following row-level fields when writing data files: | ||
|
|
||
| * `_row_id` a unique long identifier for every row within the table. The value is assigned via inheritance when a row is first added to the table and the existing value is explicitly written when the row is written to a new file. |
There was a problem hiding this comment.
written to a new file -> copied to a new file or something like below written to a different data file?
There was a problem hiding this comment.
I think I disagree with the "copied" language. When "the row" is written (even if it is modified) the row ID should be preserved. To me, "copied" implies that the row is not modified, opening a question of how to handle row modification.
|
|
||
| When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading. | ||
|
|
||
| A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. When reading, such files must set both columns to null for all rows. |
There was a problem hiding this comment.
If we allow writers to skip these fields in new data files, does it mean we loose the ability to distinguish a data file added prior to enabling row lineage and a data file with all new rows? Do we even care about that?
There was a problem hiding this comment.
Probably not, we still have first_row_id on the data file. Never mind.
|
|
||
| When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading. | ||
|
|
||
| A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. When reading, such files must set both columns to null for all rows. |
There was a problem hiding this comment.
such files must set both column -> shouldn't this be the reader's responsibility?
There was a problem hiding this comment.
Yes, let me reword that
|
@rdblue Ping everyone, we've had the vote on the Mailing list and I'd like to wrap this up and merge soon if possible. Please ping with any additional feedback |
|
+1 to merge this since the vote has passed. We can do minor cleanup as we go right? |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
I'm good with the spec definition here as is, if there's stylistic/formatting cleanup we could do follow ups.
sumedhsakdeo
left a comment
There was a problem hiding this comment.
Thanks @RussellSpitzer for a very clear description of updates to the spec for row lineage. It looks great! Left some questions for my clarifications.
|
|
||
| When a row is added or modified, the `_last_updated_sequence_number` field is set to `null` so that it is inherited when reading. Similarly, the `_row_id` field for an added row is set to `null` and assigned when reading. | ||
|
|
||
| A data file with only new rows for the table may omit the `_last_updated_sequence_number` and `_row_id`. If the columns are missing, readers should treat both columns as if they exist and are set to null for all rows. |
There was a problem hiding this comment.
readers should treat both columns as if they exist and are set to null for all rows
Clarifying, if we are also saying here, that _last_updated_sequence_number and _row_id are reserved column names in a table created with v3 spec.
There was a problem hiding this comment.
There are no reserved column names, only IDs. And this does update the table of reserved IDs.
|
|
||
| Values for `_row_id` and `_last_updated_sequence_number` are either read from the data file or assigned at read time. As a result on read, rows in a table always have non-null values for these fields when lineage is enabled. | ||
|
|
||
| When an existing row is moved to a different data file for any reason, writers are required to write `_row_id` and `_last_updated_sequence_number` according to the following rules: |
There was a problem hiding this comment.
When a user does INSERT OVERWRITE of an entire partition / table, some rows might be overwritten implicitly, as the operation is not copy-on-write, unlike MERGE INTO or UPDATE. For such cases, are we saying the rows are treated as new rows, and existing row _row_id or _last_updated_sequence_number will not be carried forward?
There was a problem hiding this comment.
Yes. INSERT OVERWRITE is an INSERT and the rows should be treated as new rows.
I know there are cases here where users have historically built patterns around INSERT OVERWRITE when MERGE was not available. For example, the read-union-overwrite pattern was heavily used at Netflix to add new data to existing partitions. The problem is that engines can't detect the intent and carry row information through. These patterns also can't be optimized by engines, so I think the best choice is to use the INSERT semantics here.
In addition, the Iceberg community has discouraged using INSERT OVERWRITE for years because of the challenges with implicit data overwrites. Implicitly overwriting a directory of data means that the physical layout needs to implicitly align with writes. That's not a good pattern to use.
|
|
||
| Files `data2` and `data3` are written with `null` for `first_row_id` and are assigned `first_row_id` at read time based on the manifest's `first_row_id` and the `record_count` of previously listed ADDED files in this manifest: (1,000 + 0) and (1,000 + 50). | ||
|
|
||
| When the new snapshot is committed, the table's `next-row-id` must also be updated (even if the new snapshot is not in the main branch). Because 225 rows were added (`added1`: 100 + `added2`: 0 + `added3`: 125), the new value is 1,000 + 225 = 1,225: |
There was a problem hiding this comment.
+1, examples made it easy to follow.
| All files that were added before `row-lineage` was enabled should propagate null for all of the `row-lineage` related | ||
| fields. The values for `_row_id` and `_last_updated_sequence_number` should always return null and when these rows are copied, | ||
| null should be explicitly written. After this point, rows are treated as if they were just created | ||
| and assigned `row_id` and `_last_updated_sequence_number` as if they were new rows. |
There was a problem hiding this comment.
For completeness, should we add line on expected behavior if disabling row lineage after enabling it for some time.
There was a problem hiding this comment.
It is not possible to disable row lineage after enabling it.
There was a problem hiding this comment.
Should we call that out in the spec? I know it is merged already. May be a followup PR?
|
Merged! Thanks for the awesome work on this, @RussellSpitzer! |
Co-authored-by: Ryan Blue <blue@apache.org>
Proposal Here :
https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit#heading=h.f2e8ffw3fu7n
Adds Row Lineage to the Spec
End goal is to provide two fields to all rows
_row_ida unique long which identifies every row added to the table_last_updatethe sequence number of the last commit to modify the rowFixes #11129