You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Stephen Weiss (JIRA)" <ji...@apache.org> on 2010/07/11 03:09:38 UTC

[jira] Commented: (SOLR-236) Field collapsing

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887130#action_12887130 ] 

Stephen Weiss commented on SOLR-236:
------------------------------------

Oh Martijn, I hope you're reading.   After a few months of calm we had some OOM's again on our production servers.  So I tried your latest patch with the solr 1.4.1 release, since bundled in there are various fixes for memory leaks.  The performance difference is great - far less CPU and RAM usage all around.  But there's a catch!  Something was introduced to change the "numFound" that is reported.  After we noticed this, I found your comment and removed these lines from NonAdjacentDocumentCollapser.java:

+        if (collapsedGroupPriority.size() > maxNumberOfGroups) {
+          NonAdjacentCollapseGroup inferiorGroup = collapsedGroupPriority.first();
+          collapsedDocs.remove(inferiorGroup.fieldValue);
+          collapsedGroupPriority.remove(inferiorGroup);
+        }

We did *NOT* remove line 99 as suggested because this caused compiler problems:


    [javac] /home/sweiss/apache-solr-1.4.1/src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java:99: cannot find symbol
    [javac] symbol  : variable collapseDoc
    [javac] location: class org.apache.solr.search.fieldcollapse.NonAdjacentDocumentCollapser
    [javac]       if (collapseDoc == null) {

After doing this, I noticed a *huge* performance drop - far worse than what we had even with 1.4 and your patch from December.  Searches were taking >10s to complete (before we were just over 1s for the worst searches).  So, I went back and tried to find a way to get the "numFound" through other means - and I figured I could just facet on the same field we're collapsing on, and then count the number of facets.  Looks good - the count of the facets is the right count, and it would appear to be working.

But, there's a snag.  It seems that the results being returned by your patch, unaltered, are incorrect.  For an example - my search for "orange" returns 7200 collapsed results, either using the real numFound from the altered patch, or using the facet method wtih the new patch.  This equates to 160 pages of results.  However, with the unaltered patch, if we actually try to retrieve page 158, or really any result over 130 or so, we get the exact same results.  With the altered patch (removing those few lines), page 158 actually is page 158.  Basically, it seems like your patch throws away good results - and I get the feeling that it throws away those good results somewhere in those 5 lines.

Now, I'm stuck.  I really don't know what to do... I don't want the OOMs to continue, but it looks like they will regardless because both the old version (1.4 + December patch) and the new, altered patched version are using too many resources.  But if I used the latest patch without changing it, I'm not getting the right results all the way through.

Is there anything we can do?  I appreciate your help... :-)

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>            Assignee: Shalin Shekhar Mangar
>             Fix For: Next
>
>         Attachments: collapsing-patch-to-1.3.0-dieter.patch, collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch, collapsing-patch-to-1.3.0-ivan_3.patch, DocSetScoreCollector.java, field-collapse-3.patch, field-collapse-4-with-solrj.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-5.patch, field-collapse-solr-236-2.patch, field-collapse-solr-236.patch, field-collapsing-extended-592129.patch, field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, NonAdjacentDocumentCollapser.java, NonAdjacentDocumentCollapserTest.java, quasidistributed.additional.patch, SOLR-236-1_4_1.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236-trunk.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, SOLR-236.patch, solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org