You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/13 08:02:28 UTC

[GitHub] [iceberg] jackye1995 commented on a change in pull request #2805: Spec: Add back distinct_counts in data_file metadata

jackye1995 commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r668523084



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 120: long>`   | Map from column id to number of values in the column (including null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 124: long>`   | Map from column id to number of distinct values in the column; distinct counts must be produced using values in the file, not on merged counts from other metadata |

Review comment:
       I agree that it is better to state that it should not be derived by merging counts because it is easy to go towards that route. But it feels to me that the definition is a bit too specific. What I am thinking is to say that:
   
   > distinct counts must be derived using values in the file. This can be done through methods like counting or sketch, but not through merging counts from other metadata.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org