You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/03/02 06:26:11 UTC

[GitHub] leerho commented on issue #7161: Indicating time trends in topN results

leerho commented on issue #7161: Indicating time trends in topN results
URL: https://github.com/apache/incubator-druid/issues/7161#issuecomment-468890906

I don't know what algorithm is being used to establish the "TopN", but I hope it isn't the old
- Sort each node
- Take the top 1000 from each node
- Merge the samples from each node together and sort
- Take the top 100 from that list and hope (yes hope) that you have captured the top 100.

This can result in unknown and significant errors, depending on how the data is distributed across the nodes, and cannot give any information about which ones in that final list actually belong there, what the error might be in its position in the list, and which ones in the list might just be noise. This technique is just a heuristic and it is not hard to prove how it can return very misleading results.

This can be can be much more efficiently, accurately, and confidently solved using the DataSketches [Frequent Items sketch (FIS)](https://datasketches.github.io/docs/FrequentItems/FrequentItemsOverview.html).

The FIS will tell you the upper and lower bounds of the actual weights of each item, which can be used to establish the range within which the true weight must lie. However, when those ranges overlap, it creates a tie situation in terms of actual order. Nonetheless, these bounds can be used to identify which items may be tied, for example, for the 3rd position. These bounds, by the way are hard bounds, so the confidence is 100%.

Also, you can query the sketch using either the NO_FALSE_POSITIVES (more strict) or a NO_FALSE_NEGATIVES (less strict) query. But under no circumstances with the sketch return items that fall below the noise floor.

Be aware that the reason that this sketch is called "Frequent Items" is that there may not be any items that are more frequent than any of the others (I.e. if they all occur with the same frequency). Thus, returning zero results can happen. Thus the name "TopN" is misleading as there may not be N items returned.

This is why I prefer to call this kind of operation "Frequent Items" or "Heavy Hitters" and not "TopN".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org