You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/17 16:14:33 UTC

[GitHub] [iceberg] findepi commented on a diff in pull request #6582: Add a Spark procedure to collect NDV

findepi commented on code in PR #6582:
URL: https://github.com/apache/iceberg/pull/6582#discussion_r1072412666


##########
core/src/main/java/org/apache/iceberg/puffin/StandardBlobTypes.java:
##########
@@ -26,4 +26,6 @@ private StandardBlobTypes() {}
    * href="https://datasketches.apache.org/">Apache DataSketches</a> library
    */
   public static final String APACHE_DATASKETCHES_THETA_V1 = "apache-datasketches-theta-v1";
+
+  public static final String NDV_BLOB = "ndv-blob";

Review Comment:
   > Spark doesn't use Apache DataSketches to collect approximate NDV
   
   Same was true for Trino. Trino uses HLL by default.
   I introduced DataSketches Theta aggregation so that we can be compatible. For that I had to revamped stats collection SPI so that connectors can request desired sketch format, as previously it was hard-coded: if a connector wants NDV information they get a HLL sketch.
   
   For NDV information, without update'able sketch, we shouldn't use a blob at all. The NDV number is just a property of actual updateabke sketch stored in the Puffin file. For a POC, you can use a fake blob with empty content, and associated NDV number as its property. Just give it some blob type name that reveals it's a fake temporary. For production use, we want to write Theta sketches.
   
   cc @rdblue 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org