You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Axel Tetzlaff <ax...@freiheit.com> on 2009/01/15 14:15:29 UTC

Re: Unwanted clustering of search results after sorting by score

Hi,

I'm working on the problem Max described as well. We did try to omit the
norms which lead to the phenomenon that products that have a very extensive
description were more likely to have a higher score since they contained the
word more often. Due to many expands of the SynonymFilter at index-time this
grew especially ugly. But as you already pointed out we should have a deeper
look at how the score is assembled..

Nevertheless the second problem of getting a good mix of shops can be
discussed seperatly. Say we have 5 products per result page and the 10 best
matches for a search have all the same score. 8 of the products are of one
shop (A), and the two others by two other shops (B,C).

What we often get is (letter indicating a product of this shop)
1. A
2. A
3. A
4. A
5. A
---- second result page ----
6. A
7. B
8. A
9. C
10. A

but what we want to get is s.th. like this:

1. A
2. C
3. B
4. A
5. A
---- second result page ----
6. A
7. A
8. A
9. A
10. A

As you can imagine there is no uniform distribution of products over shops.
So sorting by a random field does not work out since there are shops with
10s of thousands of products and shops with less than 100 products.

So theoretically I would sort by score and then by a magic factor which gets
greater the less products of this shop (eventually with that same score) are
already in the search result. Alternativly to a second sorting criteria the
score could be diminished with as well I guess...

What really bothers me, is that this requirement seems to need an extra
iteration over the search result which keeps track of the distribution of
products and shops in the search result.

We're really thankful for any hint on howto tackle this problem,
Axel
--
View this message in context: http://www.nabble.com/Unwanted-clustering-of-search-results-after-sorting-by-score-tp20977761p21477387.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unwanted clustering of search results after sorting by score

Posted by Axel Tetzlaff <ax...@freiheit.com>.

Hi Otis,

thanks for your input. Although I agree that we may have to go over the
search result once more, I dont think doing so for the first result page
only, is sufficient.
In the first example I showed before, you can see that some of the desired
products (of shops B and C) in fact occur on later pages - and the example
is heavily simplified. With over half a million products, searches for
single words (which are most common) can easily have a huge set of matching
documents.

Otis Gospodnetic wrote:
> 
>   This should be doable with a function query, too.
> 
I had a look at function queries as well, and couldn't figure out how to
incorporate them for this purpose. Afaik one can only operate on numeric
fields - which have to be set up at index time. But the distribution of the
shop to which a product in the search result belongs, can only be determined
at search time.
Can you give me a closer hint on how you would aggregate this information
with a function query?

thanks,
    Axel
-- 
View this message in context: http://www.nabble.com/Unwanted-clustering-of-search-results-after-sorting-by-score-tp20977761p21495453.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unwanted clustering of search results after sorting by score

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Axel,

Others may have better ideas, but the simplest idea that occurs to me right now is to really just go over the search results and resort them the way you described.  However, I don't think this is as scary as it sounds.  You don't really have to go through the whole result set - you only need to do this for the N hits you are displaying (10 in your example).  All of the data you need to access will already be in memory and cached, so this should be cheap, quick, and easy.  The magic factor that's inversely proportional to the number of products in a shop could be stored in a separate field at index time.

This should be doable with a function query, too.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Axel Tetzlaff <ax...@freiheit.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, January 15, 2009 8:15:29 AM
> Subject: Re: Unwanted clustering of search results after sorting by score
> 
> 
> Hi,
> 
> I'm working on the problem Max described as well. We did try to omit the
> norms which lead to the phenomenon that products that have a very extensive
> description were more likely to have a higher score since they contained the
> word more often. Due to many expands of the SynonymFilter at index-time this
> grew especially ugly. But as you already pointed out we should have a deeper
> look at how the score is assembled..
> 
> Nevertheless the second problem of getting a good mix of shops can be
> discussed seperatly. Say we have 5 products per result page and the 10 best
> matches for a search have all the same score. 8 of the products are of one
> shop (A), and the two others by two other shops (B,C).
> 
> What we often get is (letter indicating a product of this shop)
> 1.    A
> 2.    A
> 3.    A
> 4.    A
> 5.    A
> ---- second result page ----
> 6.    A
> 7.    B
> 8.    A
> 9.    C
> 10.  A 
> 
> but what we want to get is s.th. like this:
> 
> 1.    A
> 2.    C
> 3.    B
> 4.    A
> 5.    A
> ---- second result page ----
> 6.    A
> 7.    A
> 8.    A
> 9.    A
> 10.  A 
> 
> As you can imagine there is no uniform distribution of products over shops.
> So sorting by a random field does not work out since there are shops with
> 10s of thousands of products and shops with less than 100 products.
> 
> So theoretically I would sort by score and then by a magic factor which gets
> greater the less products of this shop (eventually with that same score) are
> already in the search result. Alternativly to a second sorting criteria the
> score could be diminished with as well I guess...
> 
> What really bothers me, is that this requirement seems to need an extra
> iteration over the search result which keeps track of the distribution of
> products and shops in the search result.
> 
> We're really thankful for any hint on howto tackle this problem,
> Axel
> -- 
> View this message in context: 
> http://www.nabble.com/Unwanted-clustering-of-search-results-after-sorting-by-score-tp20977761p21477387.html
> Sent from the Solr - User mailing list archive at Nabble.com.