You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by "leerho (via GitHub)" <gi...@apache.org> on 2023/06/08 01:39:42 UTC

[GitHub] [datasketches-java] leerho commented on issue #446: Sketches for Histogram and NDV

leerho commented on issue #446:
URL: https://github.com/apache/datasketches-java/issues/446#issuecomment-1581761804

   If your data is in Hive and you are willing to allow two passes on your data you could use KLL to establish the histogram boundaries you are interested in on the first pass, and then on the second pass feed an array of HLL sketches corresponding to the histogram ranges that would do distinct counts filtered for each range.  This is a little clumsy, but would provide reliable accuracy bounds based on the HLL configuration.  This avoids the kind of approximation of approximations issue @jmalkin mentioned.  
   
   Of course, the resulting histogram boundaries are also approximations, but at least you would have independent control of the accuracy of the boundaries and the accuracy of the NDV of each of the bins separately :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org