You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/24 12:36:17 UTC

[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #4945: Add table spec changes for statistics information in table snapshot

ajantha-bhat commented on code in PR #4945:
URL: https://github.com/apache/iceberg/pull/4945#discussion_r906020375


##########
format/spec.md:
##########
@@ -631,6 +632,30 @@ When expiring snapshots, retention policies in table and snapshot references are
     2. The snapshot is not one of the first `min-snapshots-to-keep` in the branch (including the branch's referenced snapshot)
 5. Expire any snapshot not in the set of snapshots to retain.
 
+#### Statistics file
+
+Statistics files are valid [Puffin files](../puffin-spec). Statistics are informational. A reader can choose to

Review Comment:
   Do we need to add one more abstraction layer for `Statistics`.
   
   Thinking of partition stats, For each partition value, if I am storing row count and file count stats. There can be million rows.
   So, puffin may not be efficient for that. 
   
   So, can we have an abstraction for statistics metadata?
   Where stats like sketches are stored in Puffin and simple stats are stored in parquet or Hfile or some other format ?
   I can also see more than one stats file per snapshot (one in puffin format and some in other format)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org