You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by "AlexanderSaydakov (via GitHub)" <gi...@apache.org> on 2023/05/24 01:38:14 UTC

[GitHub] [druid] AlexanderSaydakov commented on pull request #14334: use the latest datasketches-java-4.0.0

AlexanderSaydakov commented on PR #14334:
URL: https://github.com/apache/druid/pull/14334#issuecomment-1560339901

We hesitated for some time, but finally decided that inclusive mode is a bit better. This is a major version change with some API incompatibility, so, if ever, this is the right time for the change.
The difference is in the definition of rank. Suppose we are analyzing a distribution of some items exactly. The only thing required is a comparator of items ("less than" operator). We sort the items and define the rank of an item as the proportion of the whole distribution strictly less than that item in the exclusive mode or less than or equal to that item in the inclusive mode. It seems that the inclusive mode is more common in the literature and is slightly more well-behaved in some edge cases.
To illustrate the difference, suppose we have just one item. Its rank in inclusive mode is 1, but 0 in exclusive mode. But with millions of items the difference in rank will be tiny, and, most probably, negligible. If we do a histogram or partitioning, some items on the edges can fall into the bin or partition on the right or on the left depending on the mode.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org