You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/31 22:38:56 UTC

[GitHub] [iceberg] rdblue opened a new pull request #2654: Update spec for v2 changes

rdblue opened a new pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654


   This updates the spec for v2 changes that are in master or pull requests:
   * Adds `identifier-field-ids` to schema (#2354)
   * Adds partition evolution section (#1025)
   * Adds `schemas` and `current-schema-id` to table metadata
   * Adds `key_metadata` to manifest lists
   
   Closes #1141.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r652201301



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.

Review comment:
       Updated to use your suggested wording.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] danielcweeks commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

danielcweeks commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r651439115



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.

Review comment:
       I assume that we treat `null` and undefined in the same way (e.g. map value is null vs. map key is not present)?  Seems like we might want to clarify that behavior at this point.  Also not sure if there should be special consideration for other values like `nan` or `+/- infinity` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r652032390



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.

Review comment:
       I'll change it so that only required columns are allowed. FYI @jackye1995, we'll need to update the implementation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#issuecomment-861162434


   This looks great to me but optional identity columns does not seem right. Do we have any systems where that's possible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r651440289



##########
File path: site/docs/spec.md
##########
@@ -254,6 +268,23 @@ Notes:
 2. The width, `W`, used to truncate decimal values is applied using the scale of the decimal column to avoid additional (and potentially conflicting) parameters.
 
 
+#### Partition Evolution
+
+Table partitioning can be evolved by adding, removing, renaming, or reordering partition spec fields.
+
+Changing a partition spec produces a new spec identified by a unique spec ID that is added to the table's list of partition specs and may be set as the table's default spec.
+
+When evolving a spec, changes should not cause partition field IDs to change because the partition field IDs are used as the partition tuple field IDs in manifest files.
+
+In v2, partition field IDs must be explicitly tracked for each partition field. New IDs are assigned based on the last assigned partition ID in table metadata.
+
+In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation. This assignment caused problems when reading metadata tables based on manifest files from multiple specs because partition fields with the same ID may contain different data types. For compatibility with old versions, the following rules are recommended for partition evolution in v1 tables:
+
+1. Do not reorder partition fields

Review comment:
       Good summary, thanks for doing this!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#issuecomment-861717269


   > . . . optional identity columns does not seem right. Do we have any systems where that's possible?
   
   I'm okay removing the ability to use optional columns. That's what @jackye1995 had originally and it aligns better with the SQL spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r652031438



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.

Review comment:
       Good point about NaN. I think we should consider all NaN values equal, even signaling NaNs.
   
   For optional, I think we will probably remove this because it causes problems with SQL booleans. That limits use to only existing required columns until fields can have default values, but it is probably worth it to avoid SQL boolean confusion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r652200065



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.

Review comment:
       I'm going to remove this. My originally logic was that we the feature is limited if we can only add required fields, but that is only because we don't currently support default values. If we supported default values, then I think we would choose to avoid the ambiguity and only allow required columns. Since we plan to add default values, I think that it makes less sense to work around it by allowing optional fields here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] stevenzwu commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

stevenzwu commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r643612208



##########
File path: site/docs/spec.md
##########
@@ -495,10 +528,12 @@ Table metadata consists of the following fields:
 |            | _required_ | **`last-sequence-number`**| The table's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a table. |
 | _required_ | _required_ | **`last-updated-ms`**| Timestamp in milliseconds from the unix epoch when the table was last updated. Each table metadata file should update this field just before writing. |
 | _required_ | _required_ | **`last-column-id`**| An integer; the highest assigned column ID for the table. This is used to ensure columns are always assigned an unused ID when evolving schemas. |
-| _required_ | _required_ | **`schema`**| The table’s current schema. |
+| _required_ | _required_ | **`schema`**| The table’s current schema. In v2, this must be the schema identified by the `current-schema-id`. |
+| _optional_ | _required_ | **`schemas`**| A list of schemas, stored as objects with `schema-id`. |
+| _optional_ | _required_ | **`current-schema-id`**| ID of the table's current schema. |
 | _required_ |            | **`partition-spec`**| The table’s current partition spec, stored as only fields. Note that this is used by writers to partition data, but is not used when reading because reads use the specs stored in manifest files. (**Deprecated**: use `partition-specs` and `default-spec-id`instead ) |
 | _optional_ | _required_ | **`partition-specs`**| A list of partition specs, stored as full partition spec objects. |
-| _optional_ | _required_ | **`default-spec-id`**| ID of the “current” spec that writers should use by default. |
+| _optional_ | _required_ | **`default-spec-id`**| ID of the "current" spec that writers should use by default. |

Review comment:
       nit: a little bit inconsistent here. `current-schema-id` vs `default-spec-id`. I guess we can't rename `default-spec-id` for compatibility reason.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#issuecomment-851706224


   @yyanyy, @aokolnychyi, and @openinx, can you take a look at these spec changes? This documents most of the remaining changes for v2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r644081533



##########
File path: site/docs/spec.md
##########
@@ -495,10 +528,12 @@ Table metadata consists of the following fields:
 |            | _required_ | **`last-sequence-number`**| The table's highest assigned sequence number, a monotonically increasing long that tracks the order of snapshots in a table. |
 | _required_ | _required_ | **`last-updated-ms`**| Timestamp in milliseconds from the unix epoch when the table was last updated. Each table metadata file should update this field just before writing. |
 | _required_ | _required_ | **`last-column-id`**| An integer; the highest assigned column ID for the table. This is used to ensure columns are always assigned an unused ID when evolving schemas. |
-| _required_ | _required_ | **`schema`**| The table’s current schema. |
+| _required_ | _required_ | **`schema`**| The table’s current schema. In v2, this must be the schema identified by the `current-schema-id`. |
+| _optional_ | _required_ | **`schemas`**| A list of schemas, stored as objects with `schema-id`. |
+| _optional_ | _required_ | **`current-schema-id`**| ID of the table's current schema. |
 | _required_ |            | **`partition-spec`**| The table’s current partition spec, stored as only fields. Note that this is used by writers to partition data, but is not used when reading because reads use the specs stored in manifest files. (**Deprecated**: use `partition-specs` and `default-spec-id`instead ) |
 | _optional_ | _required_ | **`partition-specs`**| A list of partition specs, stored as full partition spec objects. |
-| _optional_ | _required_ | **`default-spec-id`**| ID of the “current” spec that writers should use by default. |
+| _optional_ | _required_ | **`default-spec-id`**| ID of the "current" spec that writers should use by default. |

Review comment:
       This distinction is on purpose. A table can have only one schema and that is the current one. We track old schemas for older snapshots. But a table can have multiple valid partition specs and it is fine to write new data into either one. That's why we track a "default" spec to use when writing if you aren't doing something that overrides it like migrating data from one spec to another.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r651436461



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.

Review comment:
       This seems a bit controversial as columns in a primary key are never nullable (even though identity fields are not technically a primary key) and null is never equal to null in SQL.
   
   Would it be reasonable to follow the SQL standard and require the identity columns as well as columns used in equality deletes be always not null? My worry is that it is going to be tricky if we interpret this differently from SQL. What if I upsert by (1, 'a', null)? I think that row should match nothing from the SQL perspective.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#issuecomment-861880524


   I think my confusion comes from the fact that PK means unique and not null and that having null in any column in the tuple we upsert by, means that tuple matches nothing from the SQL perspective. It feels easier to assume columns we use in equality deletes are not nullable. I agree about the downside of not being able to use optional columns added later but I am not sure how important that is. We can probably cover that use case with default values that LinkedIn is working on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] danielcweeks commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

danielcweeks commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r651437984



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.

Review comment:
       I feel like we should change the wording of the second sentence.  I don't think "needlessly inefficient" is a great way to represent what we're trying to convey.  Maybe just:  However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg and is the responsibility of processing engines or data producers to enforce.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#issuecomment-861728965


   trying to remember why we decided to make it optional: 
   
   https://github.com/apache/iceberg/pull/2465#discussion_r615313057
   
   My current thinking is that it does not hurt to make it optional. If the engine enforces the primary key to be not nullable, it will fail in that engine during CREATE TABLE anyway, just like what we see in Flink.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

aokolnychyi edited a comment on pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#issuecomment-861880524


   I think my confusion comes from the fact that PK means unique and not null and that having null in any column in the tuple we upsert by, means that tuple matches nothing from the SQL perspective. It feels easier to assume columns we use in equality deletes are not nullable. I agree about the downside of not being able to use optional columns added later as identity columns but I am not sure how important that is. We can probably cover that use case with default values that LinkedIn is working on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2654: Update spec for v2 changes

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2654:
URL: https://github.com/apache/iceberg/pull/2654#discussion_r652201213



##########
File path: site/docs/spec.md
##########
@@ -176,6 +178,15 @@ Columns in Iceberg data files are selected by field id. The table schema's colum
 For example, a file may be written with schema `1: a int, 2: b string, 3: c double` and read using projection schema `3: measurement, 2: name, 4: a`. This must select file columns `c` (renamed to `measurement`), `b` (now called `name`), and a column of `null` values called `a`; in that order.
 
 
+#### Identifier Field IDs
+
+A schema can optionally track the set of primitive fields that identify rows in a table, using the property `identifier-field-ids` (see JSON encoding in Appendix C).
+
+Two rows are the "same"---that is, the rows represent the same entity---if the identifier fields are equal. However, uniqueness of rows by this identifier is not guaranteed or required by Iceberg because it is needlessly inefficient in some cases. Reasonable steps to enforce uniqueness are the responsibility of processing engines.
+
+Optional fields can be used as identifier fields. Identifier fields may be nested in structs but may not be nested within maps or lists; a nested identifier field is considered null if it or any parent struct is null. For identifier comparison, null is considered equal to itself.

Review comment:
       I'm going to update this so that float and double columns aren't allowed. I think there are too many issues if we were to allow them and implementations would probably have bugs handling signalling NaN values and -0. We already don't allow floating points in some places where they make little sense (like bucket partitioning) so I think being conservative here is the right call.
   
   Also, I doubt there are many valid use cases for having a floating point value in a row identifier.
   
   And while I'm updating this, I'm also going to disallow float and double columns used for delete columns in equality delete files because of the same issues with signalling NaN and -0 values.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org