You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Michael Gundlach (JIRA)" <ji...@apache.org> on 2009/11/10 00:47:32 UTC

[jira] Issue Comment Edited: (SOLR-236) Field collapsing

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775192#action_12775192 ] 

Michael Gundlach edited comment on SOLR-236 at 11/9/09 11:45 PM:
-----------------------------------------------------------------

I've found an NPE that occurs when performing quasi-distributed field collapsing.

My company only has one use case for field collapsing: collapsing on email address.  Our index is spread across multiple cores.  We found that if we shard by email address, so that all documents with a given email address are guaranteed to appear on the same core, then we can do distributed field collapsing.

We add &collapse.field=email and &shards=core1,core2,... to a regular query.  Each core collapses on email and sends the results back to the requestor.  Since no emails appear on more than one core, we've accomplished distributed search.  We do lose the <collapse_count> section, but that's not needed for our purpose -- we just need an accurate total document count, and to have no more than one document for a given email address in the results.

Unfortunately, this throws an NPE when searching on a tokenized field.  Searching string fields is fine.  I don't understand exactly why the NPE appears, but I did bandaid over it by checking explicitly for nulls at the appropriate line in the code.  No more NPE.

There's a downside, which is that if we attempt to collapse on a field other than email -- one which has documents appearing in multiple cores -- the results are buggy: the first search returns few documents, and the number of documents actually displayed don't always match the "numFound" value.  Then upon refresh we get what we think is the correct numFound, and the correct list of documents.  This doesn't bother me too much, as you're guaranteed to get incorrect answers from the collapse code anyway when collapsing on a field that you didn't use as your key for sharding.

In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute the fix, or at least point out the error:

1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a patch file.  The resultant patch file looks very different from the latest SOLR-236 patchfile, so I assume I did something wrong.

2. I pulled trunk, made my 2 line change, and created another patch file.  This file is tiny but of course is missing all of the field collapsing changes.

Would you like me to post either of these patchfiles to this issue?  Or is it sufficient to just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse, doc);" where sdoc was null.)  Perhaps my use case is extraordinary enough that you're happy leaving the NPE in place and telling other users to not do what I'm doing?

Thanks!
Michael

      was (Author: gundlach):
    I've found an NPE that occurs when performing quasi-distributed field collapsing.

My company only has one use case for field collapsing: collapsing on email address.  Our index is spread across multiple cores.  We found that if we shard by email address, so that a given all documents with a given email address are guaranteed to appear on the same core, then we can do distributed field collapsing.

We add &collapse.field=email and &shards=core1,core2,... to a regular query.  Each core collapses on email and sends the results back to the requestor.  Since no emails appear on more than one core, we've accomplished distributed search.  We do lose the <collapse_count> section, but that's not needed for our purpose -- we just need an accurate total document count, and to have no more than one document for a given email address in the results.

Unfortunately, this throws an NPE when searching on a tokenized field.  Searching string fields is fine.  I don't understand exactly why the NPE appears, but I did bandaid over it by checking explicitly for nulls at the appropriate line in the code.  No more NPE.

There's a downside, which is that if we attempt to collapse on a field other than email -- one which has documents appearing in multiple cores -- the results are buggy: the first search returns few documents, and the number of documents actually displayed don't always match the "numFound" value.  Then upon refresh we get what we think is the correct numFound, and the correct list of documents.  This doesn't bother me too much, as you're guaranteed to get incorrect answers from the collapse code anyway when collapsing on a field that you didn't use as your key for sharding.

In the spirit of Yonik's law of patches, I have made two imperfect patches attempting to contribute the fix, or at least point out the error:

1. I pulled trunk, applied the latest SOLR-236 patch, made my 2 line change, and created a patch file.  The resultant patch file looks very different from the latest SOLR-236 patchfile, so I assume I did something wrong.

2. I pulled trunk, made my 2 line change, and created another patch file.  This file is tiny but of course is missing all of the field collapsing changes.

Would you like me to post either of these patchfiles to this issue?  Or is it sufficient to just tell you that the NPE occured in QueryComponent.java on line 556? ("rb._responseDocs.set(sdoc.positionInResponse, doc);" where sdoc was null.)  Perhaps my use case is extraordinary enough that you're happy leaving the NPE in place and telling other users to not do what I'm doing?

Thanks!
Michael
  
> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>             Fix For: 1.5
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.