You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/12 00:21:27 UTC

[GitHub] [iceberg] rdblue opened a new pull request #2805: Spec: Add back distinct_counts in data_file metadata

rdblue opened a new pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805


   This removes undeprecates `distinct_counts` in the `data_file` struct stored in manifests. While these counts are difficult to use, discussion from the dev list indicates that having them is better than having no data at all. This also tightens the description to clarify that the distinct counts must be produced from values in the file and not by merging distinct counts from other sources of metadata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] danielcweeks commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

danielcweeks commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r668134476



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be produced using values in the file, not on merged counts from other metadata |

Review comment:
       We probably don't need this qualification in the spec `not on merged counts from other metadata`.  The spec should be clear enough that it should reflect the distinct count in the file (anything else is out of spec by definition).  You might want to add that if an operation is performed that would compromise the correctness of the value, the field should be dropped.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

jackye1995 commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r668523084



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be produced using values in the file, not on merged counts from other metadata |

Review comment:
       I agree that it is better to state that it should not be derived by merging counts because it is easy to go towards that route. But it feels to me that the definition is a bit too specific. What I am thinking is to say that:
   
   > distinct counts must be derived using values in the file. This can be done through methods like counting or sketch, but not through merging counts from other metadata.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r676191184



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be produced using values in the file, not on merged counts from other metadata |

Review comment:
       I updated this to the wording that @jackye1995 suggested, but a bit more condensed.
   
   I also removed the line that @nastra found.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r670424323



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be produced using values in the file, not on merged counts from other metadata |

Review comment:
       @rdblue I think the line `distinct_counts was removed` in https://github.com/apache/iceberg/blob/072a86a9d83d6c43d7e19cef0bbba80a94714794/site/docs/spec.md#L1073 needs to be removed as well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] electrum commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

electrum commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r678715686



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be produced using values in the file, not on merged counts from other metadata |

Review comment:
       Should the spec explicitly say "approximate number of distinct values"? It says that implicitly by saying "sketches", but it's not obvious at a glance.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] findepi commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

findepi commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r678914504



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts |

Review comment:
       Can we clarify "distinct values" -> "distinct non-null values"?
   
   (some query engines include NULL within NDV and some do not)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r668347737



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be produced using values in the file, not on merged counts from other metadata |

Review comment:
       I think it is a bit more clear to state that this can't be a merged count. If someone is compacting data files without reading the data, like merging Parquet row groups without re-encoding then it may seem reasonable to merge the counts somehow. I'm not sure that people would consider merging an operation that doesn't reflect the distinct count in the file, that's why I mentioned it specifically.
   
   Is there anything else that you're including in cases that would not "reflect the distinct count in the file"? I'm trying to use language that allows non-exact values. I think we should be in agreement that sketch-based counts are okay.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] danielcweeks merged pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

danielcweeks merged pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #2805: Spec: Add back distinct_counts in data_file metadata

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#issuecomment-877887232


   @jackye1995, you may want to take a look at this as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org