You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@datasketches.apache.org by GitBox <gi...@apache.org> on 2020/06/09 16:47:25 UTC

[GitHub] [incubator-datasketches-cpp] thvasilo opened a new issue #157: Weighted version of the KLL sketch?

thvasilo opened a new issue #157:
URL: https://github.com/apache/incubator-datasketches-cpp/issues/157


   Hello,
   
   There's a consideration at [XGBoost](https://github.com/dmlc/xgboost/issues/5746) about potentially using the KLL sketch to represent feature value histograms.
   
   One potential blocker is the need for a weighted version of the sketch, this would allow us to use data points that are weighted, and adjust their feature contributions accordingly (See Appendix A of [XGBoost paper](https://arxiv.org/abs/1603.02754)).
   
   I remember discussing in the past the possibility of using data weights with KLL, is that still an option?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-cpp] DanielTing commented on issue #157: Weighted version of the KLL sketch?

Posted by GitBox <gi...@apache.org>.
DanielTing commented on issue #157:
URL: https://github.com/apache/incubator-datasketches-cpp/issues/157#issuecomment-642152592


   @thvasilo: Is it possible to assume that all the weights in the data sum up to at least 1? (or some other known constant.) There are no assumptions on individual weights or on weights being integral here, but it does mean you need to know something about the overall scaling of the weights.
   
   There are several possible implementations that incorporate weighting. If you can assume this, then you get the simplest (and perhaps most performant) implementation, but it's certainly possible to handle weights without this assumption as well. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-cpp] thvasilo commented on issue #157: Weighted version of the KLL sketch?

Posted by GitBox <gi...@apache.org>.
thvasilo commented on issue #157:
URL: https://github.com/apache/incubator-datasketches-cpp/issues/157#issuecomment-641486798


   Yes, I think the outcome would be the same in this case.  I think for this to be used in XGBoost it would require real-valued weights.
   
   I found the paper that I had in mind when talking to Zohar Karnin a couple of years ago, that includes a [weighted extension for KLL](https://arxiv.org/pdf/1907.00236.pdf) (Section 4).
   
   I'll ping @trivialfis in case he wants to chime in.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-cpp] leerho commented on issue #157: Weighted version of the KLL sketch?

Posted by GitBox <gi...@apache.org>.
leerho commented on issue #157:
URL: https://github.com/apache/incubator-datasketches-cpp/issues/157#issuecomment-641473089


   The only option we have thought about would be restricted to positive integer weights and >= 1.
   
   My interpretation of weights in a quantiles context is that an item with a weight of 2 is equivalent to updating the sketch with two identical items with a weight of 1.  Is this your understanding?
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org


[GitHub] [incubator-datasketches-cpp] thvasilo commented on issue #157: Weighted version of the KLL sketch?

Posted by GitBox <gi...@apache.org>.
thvasilo commented on issue #157:
URL: https://github.com/apache/incubator-datasketches-cpp/issues/157#issuecomment-642167879


   @DanielTing The XGBoost use-case would be for batch training, so we can assume to know all weights in advance and normalize them so they sum to 1.
   
   I'm having a chat with the first author of the linked paper tomorrow, I'll update with any progress. I plan to work on this codebase so I hope we can come up with something we can contribute back.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@datasketches.apache.org
For additional commands, e-mail: commits-help@datasketches.apache.org