You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2010/02/14 04:18:27 UTC
[jira] Commented: (SOLR-1773) Field Collapsing (lightweight version)

    [ https://issues.apache.org/jira/browse/SOLR-1773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833495#action_12833495 ] 

Koji Sekiguchi commented on SOLR-1773:
--------------------------------------

Random comment on the patch:

- TimeAllowed not supported
- cache not supported
- distributed search is not supported
- sort field is hard-coded in the patch
- collapse.type=adjacent is not supported
- collapse.aggregate is not supported (but supportable)
- not yet, but collapse.sort can be supported

supported parameters:

|collapse|set to on to use field collapsing|
|collapse.field|field name to collapse (required)|
|collapse.limit|maximum number of collapsed docs to return in each collapse group|
|collapse.fl|comma- or space- delimited list of fields to return|


> Field Collapsing (lightweight version)
> --------------------------------------
>
>                 Key: SOLR-1773
>                 URL: https://issues.apache.org/jira/browse/SOLR-1773
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.4
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: SOLR-1773.patch
>
>
> I'd like to start another approach for field collapsing suggested by Yonik on 19/Dec/09 at SOLR-236. Re-posting the idea:
> {code}
> =================== two pass collapsing algorithm for collapse.aggregate=max ====================
> First pass: pretend that collapseCount=1
>   - Use a TreeSet as  a priority queue since one can remove and insert entries.
>   - A HashMap<Key,TreeSetEntry> will be used to map from collapse group to top entry in the TreeSet
>   - compare new doc with smallest element in treeset.  If smaller discard and go to the next doc.
>   - If new doc is bigger, look up it's group.  Use the Map to find if the group has been added to the TreeSet and add it if not.
>   - If the new bigger doc is already in the TreeSet, compare with the document in that group.  If bigger, update the node,
>     remove and re-add to the TreeSet to re-sort.
> efficiency: the treeset and hashmap are both only the size of the top number of docs we are looking at (10 for instance)
> We will now have the top 10 documents collapsed by the right field with a collapseCount of 1.  Put another way, we have the top 10 groups.
> Second pass (if collapseCount>1):
>  - create a priority queue for each group (10) of size collapseCount
>  - re-execute the query (or if the sort within the collapse groups does not involve score, we could just use the docids gathered during phase 1)
>  - for each document, find it's appropriate priority queue and insert
>  - optimization: we can use the previous info from phase1 to even avoid creating a priority queue if no other items matched.
> So instead of creating collapse groups for every group in the set (as is done now?), we create it for only 10 groups.
> Instead of collecting the score for every document in the set (40MB per request for a 10M doc index is *big*) we re-execute the query if needed.
> We could optionally store the score as is done now... but I bet aggregate throughput on large indexes would be better by just re-executing.
> Other thought: we could also cache the first phase in the query cache which would allow one to quickly move to the 2nd phase for any collapseCount.
> {code}
> The restriction is:
> {quote}
> one would not be able to tell the total number of collapsed docs, or the total number of hits (or the DocSet) after collapsing. So only collapse.facet=before would be supported.
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.