You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Joel Bernstein (Jira)" <ji...@apache.org> on 2021/01/03 15:08:00 UTC

[jira] [Commented] (SOLR-14608) Faster sorting for the /export handler

    [ https://issues.apache.org/jira/browse/SOLR-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17257783#comment-17257783 ] 

Joel Bernstein commented on SOLR-14608:
---------------------------------------

Some performance numbers for the knew sorting algorithm:

Sorting and exporting 3 million documents. Since we are isolating the performance of the sort we will only include fields in the field list that are being sorted on.

Here we sort and export 2 fields using the Solr 8.7 export handler:


{code}
joelbernstein@Joel-Bernsteins-MacBook-Pro-2 loader % time curl 'http://localhost:10000/solr/testex/export?q=*:*&fl=test_s,stock_s&sort=test_s+asc,stock_s+asc' > /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  210M    0  210M    0     0  19.9M      0 --:--:--  0:00:10 --:--:-- 26.8M
curl  > /dev/null  0.05s user 0.09s system 1% cpu 10.587 total
{code}

Here we sort and export 2 fields using the new Solr export handler in this ticket:

{code}
joelbernstein@Joel-Bernsteins-MacBook-Pro-2 loader % time curl 'http://localhost:8983/solr/testex/export?q=*:*&fl=test_s,stock_s&sort=test_s+asc,stock_s+asc' > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  210M    0  210M    0     0  39.7M      0 --:--:--  0:00:05 --:--:-- 44.8M

{code}


> Faster sorting for the /export handler
> --------------------------------------
>
>                 Key: SOLR-14608
>                 URL: https://issues.apache.org/jira/browse/SOLR-14608
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Joel Bernstein
>            Assignee: Joel Bernstein
>            Priority: Major
>
> The largest cost of the export handler is the sorting. This ticket will implement an improved algorithm for sorting that should greatly increase overall throughput for the export handler.
> *The current algorithm is as follows:*
> Collect a bitset of matching docs. Iterate over that bitset and materialize the top level oridinals for the sort fields in the document and add them to priority queue of size 30000. Then export the top 30000 docs, turn off the bits in the bit set and iterate again until all docs are sorted and sent. 
> There are two performance bottlenecks with this approach:
> 1) Materializing the top level ordinals adds a huge amount of overhead to the sorting process.
> 2) The size of priority queue, 30,000, adds significant overhead to sorting operations.
> *The new algorithm:*
> Has a top level *merge sort iterator* that wraps segment level iterators that perform segment level priority queue sorts.
> *Segment level:*
> The segment level docset will be iterated and the segment level ordinals for the sort fields will be materialized and added to a segment level priority queue. As the segment level iterator pops docs from the priority queue the top level ordinals for the sort fields are materialized. Because the top level ordinals are materialized AFTER the sort, they only need to be looked up when the segment level ordinal changes. This takes advantage of the sort to limit the lookups into the top level ordinal structures. This also eliminates redundant lookups of top level ordinals that occur during the multiple passes over the matching docset.
> The segment level priority queues can be kept smaller than 30,000 to improve performance of the sorting operations because the overall batch size will still be 30,000 or greater when all the segment priority queue sizes are added up. This allows for batch sizes much larger then 30,000 without using a single large priority queue. The increased batch size means fewer iterations over the matching docset and the decreased priority queue size means faster sorting operations.
> *Top level:*
> A top level iterator does a merge sort over the segment level iterators by comparing the top level ordinals materialized when the segment level docs are popped from the segment level priority queues. This requires no extra memory and will be very performant.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org