You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2020/01/31 11:56:48 UTC

[GitHub] [incubator-datasketches-java] richardstartin opened a new issue #298: Alternative to reservoir sketches

richardstartin opened a new issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298
 
 
   I recently did a review of the literature and some experiments with reservoir sampling, and found there are lower overhead sampling algorithms equivalent to the implementation of "algorithm R" in this library. I would like to contribute the best performing implementation to the library. 
   
   I have a few questions
   
   - does this library need another reservoir sketch and is it likely to accept one? That is, is preparing a patch a worthwhile use of my time?
   - Is my testing strategy agreeable? If not, what kind of tests would I need to provide? My strategy is:
       - generate input data from a known parametric distribution
       - sort the data to introduce sequential bias which must not be evident in the sample
       - take a sample
       - evaluate the maximum likelihood estimator over the sample to check I recover the distribution parameters within some tolerance. 
   - is it better to replace or to complement `ReservoirLongsSketch`/`ReservoirItemsSketch`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on issue #298: Alternative to reservoir sketches

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298#issuecomment-595280176
 
 
   Thank you for your thoughtful consideration of a contribution.  We have recently added a page for potential contributors, such as yourself, to help communicate what criteria we are currently using for new sketch contributions into the library.  Nevertheless, we are always open to suggestions!
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] jmalkin commented on issue #298: Alternative to reservoir sketches

Posted by GitBox <gi...@apache.org>.
jmalkin commented on issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298#issuecomment-580884961
 
 
   We're always interested in improved algorithms. The key question is whether the alternatives are properly mergeable, especially across different reservoir sizes.
   
   I know there are some stand-alone improvements that could be made, for instance (from memory discussing this several year sago) you can pre-compute how many items you'll skip before the next acceptance and avoid calling the random number generator each time. But validly merging the sampling state in that case is ... hard.
   
   We know that merging only preserves first-order probabilities (there's a uniform likelihood of inclusion for each item, but there can be correlations between item pairs). Since there isn't a (known?) way around that, that's ok. But mergeability is considered a key property for the library. Error bounds, too, but if the rest of the algo is valid that shouldn't matter here.
   
   So before writing new code, can we check that the alternatives can be merged? If yes, we can continue the discussion on how best to validate that and get it included. Replace vs complement comes down to things like whether there's a compatible way to migrate previously serialized binary images.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] richardstartin edited a comment on issue #298: Alternative to reservoir sketches

Posted by GitBox <gi...@apache.org>.
richardstartin edited a comment on issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298#issuecomment-581139532
 
 
   @jmalkin actually the idea behind the algorithms is of generating the number of elements to skip `s` rather than doing so implicitly by `s` failed trials, exactly as you mention, and I had not considered mergeability at all. There having been a previous discussion resulting in using algorithm R for the sake of mergeability gives me second thoughts about preparing a contribution.
   
   @leerho I read Vitter's paper on [reservoir sampling](http://www.cs.umd.edu/~samir/498/vitter.pdf). I wrote up my notes on the paper [here](https://richardstartin.github.io/posts/reservoir-sampling#implementations-and-evaluation-of-algorithms-r-x-and-z), and have implementations [here](https://github.com/richardstartin/reservoir-sampling), which show that the algorithms produce similar samples, and that the overhead of R is significant relative to X and Z when the work to produce the sampled value is cheap. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho edited a comment on issue #298: Alternative to reservoir sketches

Posted by GitBox <gi...@apache.org>.
leerho edited a comment on issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298#issuecomment-595280176
 
 
   Thank you for your thoughtful consideration of a contribution.  We have recently added a page for potential contributors, such as yourself, to help communicate what criteria we are currently using for new sketch contributions into the library.  Nevertheless, we are always open to suggestions!
   [Sketch Criteria](https://datasketches.apache.org/docs/Architecture/SketchCriteria.html)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho commented on issue #298: Alternative to reservoir sketches

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298#issuecomment-580891930
 
 
   Richard, it would also be helpful if you could give us a reference to the paper about the algorithm you are considering.
   
     
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] richardstartin commented on issue #298: Alternative to reservoir sketches

Posted by GitBox <gi...@apache.org>.
richardstartin commented on issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298#issuecomment-581139532
 
 
   @jmalkin actually the idea behind the algorithms is of generating the number of elements to skip `s` rather than doing so implicitly by `s` failed trials, exactly as you mention, and I had not considered mergeability at all. There having been a previous discussion resulting in using algorithm R for the sake of mergeability gives me second thoughts about preparing a contribution.
   
   @leerho I read Vitter's paper on [reservoir sampling](http://www.cs.umd.edu/~samir/498/vitter.pdf). I wrote up my notes on the paper [here](https://richardstartin.github.io/posts/reservoir-sampling), and have implementations [here](https://github.com/richardstartin/reservoir-sampling), which show that the algorithms produce similar samples, and that the overhead of R is significant relative to X and Z when the work to produce the sampled value is cheap. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-java] leerho closed issue #298: Alternative to reservoir sketches

Posted by GitBox <gi...@apache.org>.
leerho closed issue #298: Alternative to reservoir sketches
URL: https://github.com/apache/incubator-datasketches-java/issues/298
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org