You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Brian Mertens (JIRA)" <ji...@apache.org> on 2007/09/07 18:05:31 UTC

[jira] Commented: (SOLR-236) Field collapsing

    [ https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525761 ] 

Brian Mertens commented on SOLR-236:
------------------------------------

Imagine a case where a Solr database contains news stories from many newspapers and some wire services.

A single wire story will typically be picked up and reprinted in many different papers, ranging from national papers like the NYTimes, to small town papers. My database will have all of them, and possibly also the original from the wire service. Each paper will choose their own headline, and will edit the story differently for length to fill a hole on the printed page, so they cannot be trivially detected as duplicates, but to my users, they basically are.

I need to detect and group together these "duplicates" when displaying search results.

So let's say every story has had an integer hash value calculated of the first X words of the lead paragraph, and that value is indexed and stored (e.g. "similarity_hash"), as a way to detect duplicate stories.

I would want to Field Collapse my results on that hash value, so that all occurrences of the same story are lumped together.

Also, my users would much prefer the most "authoritative" version of the story to be displayed as the primary result, with a count and link to the collapsed results. Authoritativeness could be coded as simple as 1) Wire Service, 2) National Paper, 3) Regional Paper, 4) Small Town Paper, which could be index and stored as an integer "authority". (For finer-grained authority we could store the newspapers circulation numbers.)

Then I could display to users:
"Dog Bites Man" 
New York Times, _link to see 77 other duplicates_

So, finally getting to the point, would it be possible to make this feature work such that it field collapses results on one field ("similarity_hash"), selects the one to return based on another field ("authority" or "circulation')? (While allowing the results to be sorted by a third field, e.g. date or relevance.)

Perhaps by a new parameter?
 collapse.authority=[field] // indexed field used for selecting which result from collapsed group to return, default being... ?

If this sounds familiar, it is somewhat similar to what Google News is doing:
  http://www.pcworld.com/article/id,136680/article.html

Final question: Do you think Field Collapse could work nicely with SOLR-303 Federated Search, or is that a bridge too far?

> Field collapsing
> ----------------
>
>                 Key: SOLR-236
>                 URL: https://issues.apache.org/jira/browse/SOLR-236
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>    Affects Versions: 1.3
>            Reporter: Emmanuel Keller
>         Attachments: field_collapsing_1.1.0.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch
>
>
> This patch include a new feature called "Field collapsing".
> "Used in order to collapse a group of results with similar value for a given field to a single entry in the result set. Site collapsing is a special case of this, where all results for a given web site is collapsed into one or two entries in the result set, typically with an associated "more documents from this site" link. See also Duplicate detection."
> http://www.fastsearch.com/glossary.aspx?m=48&amid=299
> The implementation add 3 new query parameters (SolrParams):
> "collapse.field" to choose the field used to group results
> "collapse.type" normal (default value) or adjacent
> "collapse.max" to select how many continuous results are allowed before collapsing
> TODO (in progress):
> - More documentation (on source code)
> - Test cases
> Two patches:
> - "field_collapsing.patch" for current development version
> - "field_collapsing_1.1.0.patch" for Solr-1.1.0
> P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.