You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2009/05/14 23:39:45 UTC

[jira] Created: (SOLR-1169) SortedIntDocSet

SortedIntDocSet
---------------

                 Key: SOLR-1169
                 URL: https://issues.apache.org/jira/browse/SOLR-1169
             Project: Solr
          Issue Type: Improvement
            Reporter: Yonik Seeley
            Assignee: Yonik Seeley
             Fix For: 1.4


A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1169) SortedIntDocSet

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709645#action_12709645 ] 

Mike Klaas commented on SOLR-1169:
----------------------------------

sweet.  intersecting sorted int dicts should be faster in the general case.  HashSet will of course win when one set is very small, but I expect this to still be pretty fast anyway.

> SortedIntDocSet
> ---------------
>
>                 Key: SOLR-1169
>                 URL: https://issues.apache.org/jira/browse/SOLR-1169
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>             Fix For: 1.4
>
>
> A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1169) SortedIntDocSet

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709605#action_12709605 ] 

Yonik Seeley commented on SOLR-1169:
------------------------------------

We need skipping in our DocSets in order to be able to pass them as filters and improve search performance.
One could Sort a HashDocSet every time it's used.... but that's not desirable.

The way Lucene now uses filters, random access performance is no longer important, but being able to efficiently skip is.
Intersection performance is very important to Solr of course, so if we can get SortedIntDocSet performance fast enough then it would make sense for it to replace HashDocSet.

I've already started working on this, and the results look promising.  I'll post a draft soon.

> SortedIntDocSet
> ---------------
>
>                 Key: SOLR-1169
>                 URL: https://issues.apache.org/jira/browse/SOLR-1169
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>             Fix For: 1.4
>
>
> A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (SOLR-1169) SortedIntDocSet

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley updated SOLR-1169:
-------------------------------

    Attachment: SOLR-1169.patch

Draft patch attached.  It's not "hooked in" to Solr yet... just the performance tests, which I'm doing now.

> SortedIntDocSet
> ---------------
>
>                 Key: SOLR-1169
>                 URL: https://issues.apache.org/jira/browse/SOLR-1169
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>             Fix For: 1.4
>
>         Attachments: SOLR-1169.patch
>
>
> A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (SOLR-1169) SortedIntDocSet

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved SOLR-1169.
--------------------------------

    Resolution: Fixed

committed.

> SortedIntDocSet
> ---------------
>
>                 Key: SOLR-1169
>                 URL: https://issues.apache.org/jira/browse/SOLR-1169
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>             Fix For: 1.4
>
>         Attachments: SOLR-1169.patch
>
>
> A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (SOLR-1169) SortedIntDocSet

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709713#action_12709713 ] 

Yonik Seeley commented on SOLR-1169:
------------------------------------

The first results are in, and it's looking good.

Algorithm:
For intersectionSize (and intersection), if the sets are near in size, we do a linear scan.  A little micro-optimization there got us an extra 13% speedup... just for rearranging the order of comparisons base on the fact that one was more likely to be true than the other (based on the set sizes).  When the set sizes differ by a factor of 8 or more, we use a modified binary search.  The first modification keeps track of the lower bound rather than resetting to 0... a big win (an obvious optimization... I didn't measure the benefit).  Then second modification probes closer to the lower bound (rather than using the midpoint) based on the relative size difference of the sets, and leads to a 40% performance increase.

Performance results of intersectionSize:

|DocSet sizes|Advantage|Percent|comments
|1-200 x 1-200                            |Int |53%| random sized DocSets from 1-200 in size intersected with other DocSets of size 1-200|
|1-3000 x 1-3000                       |    Int |33% |3000 is the current default upper bound of HashDocSet
|1-3000 x 1-3000                       |   Int |130% |-client instead of -server
|2000-3000 x 2000-3000         |  Int |60% | only big sets
|20000-30000 x 20000-30000|  Int |54% |  only bigger sets
|100-200 x 1000-2000              |  Hash |87% | small sets with big sets
|1-10000 x 20000-30000         |  Int  |74% | smaller sets intersected with BitDocSets
|1-30000 x 1-30000                  |  Int |80% | docsets over maxDoc/64 are BitDocSets (maxDoc=1M)

So to sum up, only small sets intersected with big sets are slower.... but given that big sets intersected with big sets take a majority of the time, we get a nice net win.  It gets more dramatic when intersecting a small set with a BitDocSet... these affects are probably due to nicer treatment of the memory subsystem when accessing memory in order.  I think these intersections tend to be bound by memory bandwidth.

The improvements will also allow us to bump up the max size of the "small set" implementation.  From a memory consumption point of view, the break-even point is maxDoc/32.  When I tested using SortedIntDocSets with maxDoc/64, there was always a net speedup over maxDoc/32 and maxDoc/100, so this seems to be a good balance between performance and saving memory.

Memory savings: SortedIntDocSet is more efficient than HashDocSet at storing the same amount of data, and it can be used at larger sizes (relative to maxDoc) before performance decreases (another memory win).

Other savings: Faster set creation - Lucene currently delivers docs in order, hence so sorting step is needed after collection.

Other notes: I tried a cool partitioning algorithm that I thought would be superior - take the middle element of the big set and use it to partition the small set.  Say you have set sizes of 100 and 10000... you do a single binary search on the small set, and now for all 100 elements you have half the big set size to search.  Recurse on the corresponding lower and upper partitions until they are small enough to use a different method such as a modified binary search.  This approach worked, but it wasn't able to beat the modified binary search alone once I put in all the optimizations... *shrugs*


> SortedIntDocSet
> ---------------
>
>                 Key: SOLR-1169
>                 URL: https://issues.apache.org/jira/browse/SOLR-1169
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>             Fix For: 1.4
>
>         Attachments: SOLR-1169.patch
>
>
> A DocSet type that can skip to support SOLR-1165

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.