You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by fbrisbart <fb...@bestofmedia.com> on 2012/03/21 13:51:14 UTC

Disseminate results from different sources

Hi all,

I have, in my dataset, documents from different sources (forum, news,
reviews, ...)
And I'd like to have a mix of them in my search results.


The problem is that, depending only on the relevance, the results are
often grouped by source (Ex.:50 'forum' docs before the first 'review'
doc)
So, I am looking for a way to slightly disseminate the results and avoid
this behaviour.

I could run 1 search per source and manually do the mix. But, I have ~10
different sources, and I'm afraid this will be too slow.

Is there a clean & fast way to do that ? I eventually think about
implementing a custom Scorer.



Thanks,
Franck


Re: Disseminate results from different sources

Posted by fbrisbart <fb...@bestofmedia.com>.
Thanks for the replies,

it fixed my mind, and I now have something to implement :o)
I will try to do that with 2 requests:
- 1 grouped by source to retrieve the documents to boost
- 1 with a FunctionQuery to add the boosts computed during the first
request

It won't be easy to do that with 1 request because my results must be
grouped by another field than source. (a 'forum' doc correponds to a
post and will be grouped by topic, ...)

And about the random field, I suspect that, if there's 1000 forum docs
for 10 news doc, there will be 100 boosted forum docs for 1 boosted news
doc. And my problem still remains.


Thanks,
Franck





Le mercredi 21 mars 2012 à 14:24 -0300, Emmanuel Espina a écrit :
> In general the algorithm considers what is more relevant and probably
> you should check why one kind of result is giving always higher scores
> than the others. Are you using norms (not setting omitNorms = true).
> With debugQuery=true you can get a detail of how the score is
> calculated.
> 
> That would be solving the cause. To solve the symptom probably you can
> use FieldCollapsing : http://wiki.apache.org/solr/FieldCollapsing
> You group by source and set the number of documents per group to 10
> and in that way you would get results in all the categories. Then
> implement some criteria in your app to select documents from those
> sources based on the score.
> 
> Thanks
> Emmanuel
> 
> 
> 2012/3/21 fbrisbart <fb...@bestofmedia.com>:
> > Hi all,
> >
> > I have, in my dataset, documents from different sources (forum, news,
> > reviews, ...)
> > And I'd like to have a mix of them in my search results.
> >
> >
> > The problem is that, depending only on the relevance, the results are
> > often grouped by source (Ex.:50 'forum' docs before the first 'review'
> > doc)
> > So, I am looking for a way to slightly disseminate the results and avoid
> > this behaviour.
> >
> > I could run 1 search per source and manually do the mix. But, I have ~10
> > different sources, and I'm afraid this will be too slow.
> >
> > Is there a clean & fast way to do that ? I eventually think about
> > implementing a custom Scorer.
> >
> >
> >
> > Thanks,
> > Franck
> >



Re: Disseminate results from different sources

Posted by Emmanuel Espina <es...@gmail.com>.
In general the algorithm considers what is more relevant and probably
you should check why one kind of result is giving always higher scores
than the others. Are you using norms (not setting omitNorms = true).
With debugQuery=true you can get a detail of how the score is
calculated.

That would be solving the cause. To solve the symptom probably you can
use FieldCollapsing : http://wiki.apache.org/solr/FieldCollapsing
You group by source and set the number of documents per group to 10
and in that way you would get results in all the categories. Then
implement some criteria in your app to select documents from those
sources based on the score.

Thanks
Emmanuel


2012/3/21 fbrisbart <fb...@bestofmedia.com>:
> Hi all,
>
> I have, in my dataset, documents from different sources (forum, news,
> reviews, ...)
> And I'd like to have a mix of them in my search results.
>
>
> The problem is that, depending only on the relevance, the results are
> often grouped by source (Ex.:50 'forum' docs before the first 'review'
> doc)
> So, I am looking for a way to slightly disseminate the results and avoid
> this behaviour.
>
> I could run 1 search per source and manually do the mix. But, I have ~10
> different sources, and I'm afraid this will be too slow.
>
> Is there a clean & fast way to do that ? I eventually think about
> implementing a custom Scorer.
>
>
>
> Thanks,
> Franck
>

Re: Disseminate results from different sources

Posted by Tanguy Moal <ta...@gmail.com>.
Hello Franck,

I've had the same issue in the past.

I addressed that by adding a random value to each document.
I use this value in the "bf" parameter, so that the random value alters 
more or less the documents' score.

This results in a natural shuffling of documents which had the same 
score before.

I think you can also use a random field (random sort field type) (see 
http://lucene.apache.org/solr/api/org/apache/solr/schema/RandomSortField.html)
Using random sort field gives a unique random value to each doc per 
requested field name (i.e. random_1234() gives a different random values 
distribution than random_4321(), which can be helpful to give documents 
a different random value without reindexing everything, additionally you 
can change the random_call() every day to make sure you change the 
results order from time to time, but not at each query :-))

The only reason why I chose not to use random sort fields is very 
personal : I needed to box the random values (using 
scale(random_whatever(),0,1) so that the random tie breaker doesn't take 
precedence on natural scoring of documents, and that scale function 
needs to compute min and max random values for the selected documents, 
which seemed to be costly for large sets. (*10 on query time for a 
docset of about 100k doc) -- but I might be wrong here.

I hope this helps,

--
Tanguy

Le 21/03/2012 13:51, fbrisbart a écrit :
> Hi all,
>
> I have, in my dataset, documents from different sources (forum, news,
> reviews, ...)
> And I'd like to have a mix of them in my search results.
>
>
> The problem is that, depending only on the relevance, the results are
> often grouped by source (Ex.:50 'forum' docs before the first 'review'
> doc)
> So, I am looking for a way to slightly disseminate the results and avoid
> this behaviour.
>
> I could run 1 search per source and manually do the mix. But, I have ~10
> different sources, and I'm afraid this will be too slow.
>
> Is there a clean&  fast way to do that ? I eventually think about
> implementing a custom Scorer.
>
>
>
> Thanks,
> Franck
>