You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by "psfotis (via GitHub)" <gi...@apache.org> on 2023/01/30 18:46:02 UTC

[GitHub] [datasketches-java] psfotis opened a new issue, #425: Error Estimation on KLL sketches

psfotis opened a new issue, #425:
URL: https://github.com/apache/datasketches-java/issues/425

   Hello,
   
   I am working on a scenario where the items in an input stream are timestamps (day granularity). The stream contains 27M such items, but distinct timestamps are ~1K (spanning around three years). My goal is to estimate the #items in a given time range. To achieve this functionality, I am building a KLL sketch on the timestamps of the input stream (internally, these are converted to Unix epochs and passed in as floats to the sketch).
   
   To get then the estimate, I get the pmf from the KLL sketch by passing it as input the following split points: starting from the min timestamp the delta increments are 86400 (seconds in a day). Using the pmf, I can calculate the #items in each bin (corresponding to a day) specified by the split points. Through this histogram, I can then calculate the #items in a given time range.
   
   My issue with the KLL sketches is that the actual error can be quite large in my experiments: for KLL=200, the actual error on a 30-day time range has fluctuated between 84.5%-94% in my experiments. Increasing to K=1000, improves this performance substantially (96%-99.8% actual error). (Note that by actual error, I mean by comparing to the real #items without approximating them).
   
   I wanted to ask whether there is a way to get the error estimation from KLL sketches for a query like the above (i.e., #items in a range), and whether such errors are expected. 
   
   Happy to provide any more details if necessary. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] psfotis commented on issue #425: Error Estimation on KLL sketches

Posted by "psfotis (via GitHub)" <gi...@apache.org>.
psfotis commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1410873115

   Thanks much for the prmpt responses. I will continue checking if there is a bug on my end. In the meantime, the data that I am using is from the TPCX-AI benchmark https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp. In particular, the timestamp column of the orders datasets (scale factor = 10). The range I am looking at to estimate is the last 30 days. Regarding the pmf, I am not sure I follow why the mass of the bin should be 0.001 +- 0.0165. Care to explain this a bit more?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho commented on issue #425: Error Estimation on KLL sketches

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1411026291

   By the way, if you really need to resolve 1000 SP, which is a resolution of 0.001, for generating a PMF, you would need a K of > 3910. The single-sided equivalent is > 2863,


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] AlexanderSaydakov commented on issue #425: Error Estimation on KLL sketches

Posted by "AlexanderSaydakov (via GitHub)" <gi...@apache.org>.
AlexanderSaydakov commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1410962022

   I assumed that your data spans 3 years, which is, roughly, 1000 days. Assuming approximately uniform distribution, the mass of one day should be about 1/1000. And the error with k=200 is 0.0165.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] AlexanderSaydakov commented on issue #425: Error Estimation on KLL sketches

Posted by "AlexanderSaydakov (via GitHub)" <gi...@apache.org>.
AlexanderSaydakov commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1409501180

   oh, I realized that you are talking about dividing 3 years into daily bins. that is a lot of bins for a small sketch. I think the problem is with too few samples in those bins.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] psfotis commented on issue #425: Error Estimation on KLL sketches

Posted by "psfotis (via GitHub)" <gi...@apache.org>.
psfotis commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1409365217

   I am writing the sketches using datasketches-cpp-3.5.1. The sketches are then read from the java version (using 3.3.0) (datasketches-memory version is 2.1.0). I am getting similar results when reading from the cpp version.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] AlexanderSaydakov commented on issue #425: Error Estimation on KLL sketches

Posted by "AlexanderSaydakov (via GitHub)" <gi...@apache.org>.
AlexanderSaydakov commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1409485855

   Something seems to be way off. I believe we thoroughly characterized the rank error of KLL. Of course, there is always a chance of a bug. We need some way of reproducing this preferably in a simplified scenario. Is there a chance that your error calculations are off? Perhaps you could explain how exactly you evaluate this? Could you show a simplified test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] AlexanderSaydakov commented on issue #425: Error Estimation on KLL sketches

Posted by "AlexanderSaydakov (via GitHub)" <gi...@apache.org>.
AlexanderSaydakov commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1409521559

   Let's say, roughly we have 1000 days in your data set. So you want to split the distribution into 1000 bins and estimate the mass in each bin. Suppose your distribution is close to uniform, so we expect about 0.1% of the input to fall into one bin. So when you ask the sketch what is the estimated mass of the bin it should say 0.001 +- 0.0165 for K=200.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho closed issue #425: Error Estimation on KLL sketches

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho closed issue #425: Error Estimation on KLL sketches 
URL: https://github.com/apache/datasketches-java/issues/425


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho commented on issue #425: Error Estimation on KLL sketches

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1411001020

   Fotis, after some internal discussion, we think that part of the problem here may be a misunderstanding of how to apply the single-sided and double-sided error numbers that are shown in the tables in [KLLAccuracyAndSize](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html).  (We probably need to add some text describing this in more detail, which I will attempt to do here.)  To simplify our explanation, lets assume some of the key parameters of your experiment:
   -  KLL sketch, K=200
   - 27M input items (dates)
   - 1000 evenly distributed Split Points (SP). (day dates over a period of about 3 years)
   - single-sided absolute error of 1.33%
   - double-sided absolute error of 1.65%
   
   First of all, as stated on that page the error is _absolute (or additive) rank_ error as opposed to  _multiplicative (or relative)_ error.  This is important!
   
   We use normalized ranks, which means the representation of a rank is a fraction between 0 and 1.0 or a percent between 0% and 100%.  A rank of 0.5 represents the median of all 27M items as if they were ordered.
   
   **Single-sided error query example:**
   The query r1 = getRank(q1), where q1 happened to be the median date, then r1 = 0.5 +/- 0.0133.
   
   **Double-sided error query example:**
   Now add the query r2 = getRank(q1), where q2 happened to correspond to the 90th percentile, then r2= 0.9 +/- 0.0133.
   
   But as soon as we subtract (r2-r1), the absolute error increases, because we are subtracting two random variables, to 0.0165.  The error calculation is now (r2-r1) => 0.4 +/- 0.0165.  
   
   When performing a single query the effective resolution is about 1/0.0133 or about 75 SP.
   When doing differences between two SP we have an effective resolution of 1/0.0165 about 60 SP.
   
   By specifying 1000 SP, you may be implying to yourself that you want the sketch to resolve an error of .001, which is way lower than the noise threshold of the configured sketch at K=200.  There is no harm in specifying this large number of SP, but just don't kid yourself by querying two points that are too close together!   Asking the sketch to produce the PMF histogram of all 1000 SP will surely result in a very noisy histogram that may not be useful at all.
    
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] psfotis commented on issue #425: Error Estimation on KLL sketches

Posted by "psfotis (via GitHub)" <gi...@apache.org>.
psfotis commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1409395072

   Yeah, you are right... fast typing. I meant to say that the actual error fluctuates between 6%-16%, for K=200, and improves to .2%-4% for K=1000. Numbers in the original post are 100-actual error (showing how close to the actual value the estimations are). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] jmalkin commented on issue #425: Error Estimation on KLL sketches

Posted by "jmalkin (via GitHub)" <gi...@apache.org>.
jmalkin commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1409370081

   If higher actual error is an improvement, I feel like we're using different terminology?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] leerho commented on issue #425: Error Estimation on KLL sketches

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1411047024

   There is a method in KllSketch, [getKFromEpsilon(...)](https://github.com/apache/datasketches-java/blob/3.3.X/src/main/java/org/apache/datasketches/kll/KllSketch.java#L157-L171), that you can use to calculate K from a given error epsilon and the choice of either single-sided or double-sided error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] psfotis commented on issue #425: Error Estimation on KLL sketches

Posted by "psfotis (via GitHub)" <gi...@apache.org>.
psfotis commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1411036349

   Ah, now I see. Thanks much both for the explanation. Indeed my source of misunderstanding was the double-/single-sided error and their connections to SP. (I can also confirm that increasing the range was also providing me better accuracy, and now it makes sense why.) Now, it's also more clear on how to select K for a given number of SPs and desirable error/range. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [datasketches-java] AlexanderSaydakov commented on issue #425: Error Estimation on KLL sketches

Posted by "AlexanderSaydakov (via GitHub)" <gi...@apache.org>.
AlexanderSaydakov commented on issue #425:
URL: https://github.com/apache/datasketches-java/issues/425#issuecomment-1409284288

   What version are you using?
   Your error seems to be out of specs. For k=200 the double-sided error (for a histogram bin) should be within about 1.7%.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org