You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Steven Phillips (JIRA)" <ji...@apache.org> on 2014/02/26 01:52:19 UTC
[jira] [Commented] (ZOOKEEPER-1889) Implement Top-N sort operator

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13912334#comment-13912334 ] 

Steven Phillips commented on ZOOKEEPER-1889:
--------------------------------------------

I want to give a brief summary of the design for this operator.

This operator maintains a hyperbatch and a SelectionVector4. The length of the selectionVector is (limit + 1). Indices [0..limit-1] are used to maintain a min-heap, with the value pointing to the batch and index of the value it represents. The last element of the selectionVector is used for staging the next incoming record. It is done this way because the generated methods for doing comparisons uses the values stored in the selection vector as a pointer to the records in the hyperbatch.

The first N records to come in are added to the heap, and the heap property is reestablished after each insertion. Once the heap has reached it's final size, each incoming record is compared to the top of the heap. The heap is a min-heap, meaning the root of the heap contains the lowest priority item in the heap. In other words, if we are looking for the top 10 values, the root points to the current 10th value. Each incoming record is compared to the root, and if necessary the two are swapped, and the heap property restored.

Periodically, there can be a purge, in which the values referenced by the selection vector are copied into new value vectors, and the old buffers are released, freeing up memory.

> Implement Top-N sort operator
> -----------------------------
>
>                 Key: ZOOKEEPER-1889
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1889
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Steven Phillips
>
> When, for example, doing an order by with a limit, if limit << total, it would be much more efficient to maintain a priority queue instead of sorting the entire data set.
> In most cases, this will greatly reduce the number of comparisons, since most incoming records will not fall in the Top N, and thus will only require a single comparison operation. Incoming records that are in the Top-N will require at most log N comparisons.
> This will also allow periodic purging of record batches, reducing memory requirements.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)