You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/03/05 05:54:56 UTC

[GitHub] [incubator-druid] leerho edited a comment on issue #7187: Improve topN algorithm

leerho edited a comment on issue #7187: Improve topN algorithm
URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-469548325

@gianm
I think we may have some confusion here. FIS does not depend on `Comparable` at all, only `Identity`. And the weights are positive longs. (The Quantiles sketch depends on `Comparable`).

Imagine you are Apple and have a stream of song title objects, each with a price as purchased. Different instances of the same title may have a different price due to store discounts or whatever.

The task is to identify the *largest revenue* song titles, by title and total revenue.

The input stream is a stream of {Object songTitle, long purchasePrice} (Price in cents)
You create the sketch:

` ItemsSketch<SongTitle> fisSketch = new ItemSketch<SongTitle>(1024);`

Note: the sketch size is configurable, depending on how accurate you want it to be.

You update the sketch with

```
while (remainingItems > 0) {
SongTitle songTitle = next();
fisSketch.update(nextItem, (long)(songTitle.getPrice()));
}
```
Note that the sketch does the item aggregation for you, and no need to sort the input.
The purchasePrice is the "weight". When duplicate songTitles appear in the stream, their purchasePrice is accumulated in the sketch.

When all the sketches at each of the nodes are complete, you send the sketches to the Broker where they are all merged together.

The final merged sketch is then queried:

`ItemSketch.Row<T>[] rows = getFrequentItems(ErrorType.NO_FALSE_POSITIVES);`

Each row of the resulting array of rows contains:
- A reference to the frequent Item
- An Estimate of the total weight (total revenue) for that item
- The upper bound of what the true total Weight is
- The lower bound of what the true total Weight is.
- plus a few other convenience methods.

The array is sorted by the accumulated weights, so they are in approximate rank order. The sketch will return all items that have a weight above the [threshold](https://datasketches.github.io/docs/FrequentItems/FrequentItemsErrorTable.html).

This is effectively a superset of the typical "TopN" heuristic as it does a lot more than the current heuristic, and it provides accuracy guarantees that the existing heuristic cannot.

Fundamentally the Druid "TopN" heuristic is data sensitive and it can fail without warning where this will not.

So I am not clear where you think this would not replace the current functionality? What is the functionality that you think is missing?

It should certainly be worth a try :)

Lee.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org