You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Charles Hornberger <ch...@gmail.com> on 2008/01/07 00:25:35 UTC

eliminating "too many results from the same source"

I've got a problem that I'm not quite sure how to solve and am wondering if
anyone has any insight or similar experience to share.

Here's the situation: Documents in our Solr index include a field
identifying their author (we have 1000s of authors). When displaying an
individual document, we also want to display a list of related documents by
other authors*, so we do a search using the current document's title, author
name, summary, and keywords as the query. Sometimes the search yields a
results set in which all of the top n documents (in reality, n is ~10) are
from one author.

Apparently, people don't like this.

So what is being asked for is a result set in which no more than m (where m
is probably 3) of the top n are from any single author. (It's not that we
want to exclude documents m+1, m+2, etc. by each author from the result set
entirely; we just don't want them in the top n.)

More generically, I can imagine this as a feature that might be occasionally
useful, e.g. as a kind of "diversity boost function" to be used when scoring
results, where you specify the fields for which you want to enforce
diversity (e.g., author name, genre, color, etc.), and provide your values
for n and m, and Solr, uhm, obliges. :-)

Any tips or ideas on how to proceed? (We're using Solr 1.2 so we don't have
MoreLikeThis, but we can upgrade to a newer version if it's likely that
MoreLikeThis can provide what we're looking for.)

-Charlie

* In fact, we wouldn't mind if additional documents by the same author were
included, but we found that when we didn't exclude the original author from
the result set, we almost always had the same problem: The first n documents
were always by the original author.

Re: eliminating "too many results from the same source"

Posted by Charles Hornberger <ch...@gmail.com>.
Of course -- and now I feel silly for not having thought of that :-).
Thanks!

On Jan 6, 2008 4:37 PM, Walter Underwood <wu...@netflix.com> wrote:

> Field collapsing might work for you. I haven't looked at the details
> of the implementation and it is still in development, but it is the
> right sort of feature. You'd like to see the top N matches for
> each value of the author field, right?
>
> wunder
>
> On 1/6/08 3:25 PM, "Charles Hornberger" <ch...@gmail.com>
> wrote:
>
> > I've got a problem that I'm not quite sure how to solve and am wondering
> if
> > anyone has any insight or similar experience to share.
> >
> > Here's the situation: Documents in our Solr index include a field
> > identifying their author (we have 1000s of authors). When displaying an
> > individual document, we also want to display a list of related documents
> by
> > other authors*, so we do a search using the current document's title,
> author
> > name, summary, and keywords as the query. Sometimes the search yields a
> > results set in which all of the top n documents (in reality, n is ~10)
> are
> > from one author.
> >
> > Apparently, people don't like this.
> >
> > So what is being asked for is a result set in which no more than m
> (where m
> > is probably 3) of the top n are from any single author. (It's not that
> we
> > want to exclude documents m+1, m+2, etc. by each author from the result
> set
> > entirely; we just don't want them in the top n.)
> >
> > More generically, I can imagine this as a feature that might be
> occasionally
> > useful, e.g. as a kind of "diversity boost function" to be used when
> scoring
> > results, where you specify the fields for which you want to enforce
> > diversity (e.g., author name, genre, color, etc.), and provide your
> values
> > for n and m, and Solr, uhm, obliges. :-)
> >
> > Any tips or ideas on how to proceed? (We're using Solr 1.2 so we don't
> have
> > MoreLikeThis, but we can upgrade to a newer version if it's likely that
> > MoreLikeThis can provide what we're looking for.)
> >
> > -Charlie
> >
> > * In fact, we wouldn't mind if additional documents by the same author
> were
> > included, but we found that when we didn't exclude the original author
> from
> > the result set, we almost always had the same problem: The first n
> documents
> > were always by the original author.
>
>

Re: eliminating "too many results from the same source"

Posted by Walter Underwood <wu...@netflix.com>.
Field collapsing might work for you. I haven't looked at the details
of the implementation and it is still in development, but it is the
right sort of feature. You'd like to see the top N matches for
each value of the author field, right?

wunder

On 1/6/08 3:25 PM, "Charles Hornberger" <ch...@gmail.com>
wrote:

> I've got a problem that I'm not quite sure how to solve and am wondering if
> anyone has any insight or similar experience to share.
> 
> Here's the situation: Documents in our Solr index include a field
> identifying their author (we have 1000s of authors). When displaying an
> individual document, we also want to display a list of related documents by
> other authors*, so we do a search using the current document's title, author
> name, summary, and keywords as the query. Sometimes the search yields a
> results set in which all of the top n documents (in reality, n is ~10) are
> from one author.
> 
> Apparently, people don't like this.
> 
> So what is being asked for is a result set in which no more than m (where m
> is probably 3) of the top n are from any single author. (It's not that we
> want to exclude documents m+1, m+2, etc. by each author from the result set
> entirely; we just don't want them in the top n.)
> 
> More generically, I can imagine this as a feature that might be occasionally
> useful, e.g. as a kind of "diversity boost function" to be used when scoring
> results, where you specify the fields for which you want to enforce
> diversity (e.g., author name, genre, color, etc.), and provide your values
> for n and m, and Solr, uhm, obliges. :-)
> 
> Any tips or ideas on how to proceed? (We're using Solr 1.2 so we don't have
> MoreLikeThis, but we can upgrade to a newer version if it's likely that
> MoreLikeThis can provide what we're looking for.)
> 
> -Charlie
> 
> * In fact, we wouldn't mind if additional documents by the same author were
> included, but we found that when we didn't exclude the original author from
> the result set, we almost always had the same problem: The first n documents
> were always by the original author.