You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Karthick Duraisamy Soundararaj <ka...@gmail.com> on 2012/08/18 00:50:14 UTC

Diversifying Search Results - Custom Collector

Hi all,
          I know this is a bit long description & so thanks in advance for
reading!

I am trying to implement a custom collector, whose job is to diversify the
results based on a field.  Grouping cannot solve the problem because I dont
want to limit the number of results showed based on the grouping field. The
requirement is similar to the discussion here
http://grokbase.com/t/lucene/solr-user/11a60m38z4/negative-boosts-for-docs-with-common-field-value
but
the problem I am trying to solve is not the same. I dont need negative
boosts. Neither do I have a too many documents by same author.

My problem is that when there are a lot of documents representing products,
products from same manufacturer seem to appear in close proximity in the
results and therefore, it doesnt provide brand diversity. When you search
for sofas, you get sofas from a manufacturer A dominating the first page
while the sofas from manufacturer B dominating the second page, etc. The
issue here is that a manufacturer tends to describes the different sofas he
produces the same way and therefore there is a very little difference
between the documents representing two sofas.

I dont want to use grouping as I dont want to limit the number of products
from a manufacturer.

I am thinking about implementing a custom collector to diversify the
results by enforcing a penalty on scores of documents before adding them to
the priority queue(FieldCahcheHitQueue). I am not exactly sure as how I am
going to enforce this penalty but I seem to be needing two things:
                           1.  A way to manipulate the score during
collection time(TopFieldCollector) after the default scorer has done its
job. The comparator seems to use getScore in its copy and setBottom methods.
                           2.  Hold a reference to all the entries in the
priorityQueue ( HashMap<diversifyingField, Entry> ).
                           3.  Reorder the heap after altering the score.

I would appreciate if anyone has suggestion on the best way to do these :
                          1.  Whats the best way to manipulate the score
from within the collector so they get reflected in the comparator?
                          2.  Do you think it going to degrade the
performance terribly if I hold on to references to all the entries of the
priorityQueue?
                          3.  Is it a terrible idea to reorder the heap?
 Reordering would introduce two operations for duplicate field values
                                     Based on the lookup from the
priorityQueue, change the score of an entry. So this change could be
anywhere, to ensure heap ordering following needs to be done
                                                     float tmpScore =
entryToBeModified.score

 entryToBeModified.score = MaxScore + (reversMul *
entryToBeModified.score )
                                                     pq.updateTop();
                                                     pq.Top().score =
tmp.score * *penalty  /* this is the new penalty to update the score */*
                                                     pq.updateTop();

I looked at function queries but a function query doesnt seem to have any
knowledge of ordering. So I don see a way to create a custom function query
to achieve this.

I would like to make it as modular as possible and I promise contribute it
back :) !  I want it to be extensible, clean and pluggable. Please do let
me know if you need any more information or if you feel there is a better
way to achieve the functionality.

Thank

Re: Diversifying Search Results - Custom Collector

Posted by Karthick Duraisamy Soundararaj <d....@gmail.com>.
Hi Mikhail,
                  You are correct.  "[+] show 6 result.."  will work but
it wouldn't suit my requirements. This is a question of user experience
right?

Imagine if the product manager comes to you and says I dont want to see
 "[+] show 6 result.." and I want the results to be diverse but should be
showed like any other search results.

I think grouping does this by two pass collection. First pass, it figures
out all the groups and then in the second  pass, it collects the results
into these groups.


Thanks,
Karthick

On Mon, Aug 20, 2012 at 3:24 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I don't believe your task can be solved by playing with scoring/collector
> or shuffling.
> For me it's absolutely Grouping usecase (despite I don't really know this
> feature well).
>
> > Grouping cannot solve the problem because I dont want to limit the
> number of results showed based on the grouping field.
>
> I'm not really getting it. why you can set limit to 11 and just show the
> labels like "[+] show 6 result.." or if you have 11 "[+] show more than 10
> .."
>
> If you experience problem with constructing search result page, I can
> suggest submit search request with rows=0&facet.field=BRAND, then your
> algorithm can choose number of necessary items per every brand and submit
> rows=X&fq=BRAND:Y it gives you arbitrarily sizes for "groups".
>
> Will this work for you?
>
>
> On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj <
> d.s.karthick@gmail.com> wrote:
>
>> Tanguy,
>>               You idea is perfect for cases where there is a too many
>> documents with 80-90% documents having same value for a particular field.
>> As an example, your idea is ideal for, lets say we have 10 documents in
>> total like this,
>>
>>  doc1 : <merchantName> Kellog's </merchantName>
>>  doc2 : <merchantName> Kellog's </merchantName>
>>  doc3 : <merchantName> Kellog's </merchantName>
>>  doc4 : <merchantName> Kellog's </merchantName>
>>  doc5 : <merchantName> Kellog's </merchantName>
>>  doc6 : <merchantName> Kellog's </merchantName>
>>  doc7 : <merchantName> Kellog's </merchantName>
>>  doc8 : <merchantName> Nestle </merchantName>
>>  doc9 : <merchantName> Kellog's </merchantName>
>>  doc10 : <merchantName> Kellog's </merchantName>
>>
>> But I have
>>  doc1 : <merchantName> Maggi </merchantName>
>>  doc2 : <merchantName> Maggi  </merchantName>
>>  doc3 : <merchantName> M&M's </merchantName>
>>  doc4 : <merchantName> M&M's </merchantName>
>>  doc5 : <merchantName> Hershey's </merchantName>
>>  doc6 : <merchantName> Hershey's </merchantName>
>>  doc7 : <merchantName> Nestle </merchantName>
>>  doc8 : <merchantName> Nestle </merchantName>
>>  doc9 : <merchantName> Kellog's </merchantName>
>>  doc10 : <merchantName> Kellog's </merchantName>
>>
>>
>> Thanks,
>> Karthick
>>
>> On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <ta...@gmail.com>wrote:
>>
>>> Hello,
>>>
>>> I don't know if that could help, but if I understood your issue, you
>>> have a lot of documents with the same or very close scores. Moreover I
>>> think you get your matches in Merchant order (more or less) because they
>>> must be indexed in that very same order, so solr returns documents of same
>>> scores in insertion order (although there is no contract specifying this)
>>>
>>> You could work around that issue by :
>>> 1/ Turning off tf/idf because you're searching in documents with little
>>> text where only the match counts, but frequencies obviously aren't helping.
>>> 2/ Add a random number to each document at index time, and boost on that
>>> random value at query time, this will shuffle your results, that's probably
>>> the simplest thing to do.
>>>
>>> Hope this helps,
>>>
>>> Tanguy
>>>
>>> 2012/8/20 Karthick Duraisamy Soundararaj <d....@gmail.com>
>>>
>>>> Hello Mikhail,
>>>>                         Thank you for the reply. In terms of user
>>>> experience, I want to spread out the products from same brand farther from
>>>> each other, *atleast* in the first 50-100 results we display. I am
>>>> thinking about two different approaches as solution.
>>>>
>>>>                       1. For first few results, display one top scoring
>>>> product of a manufacturer  (For a given field, display the top scoring
>>>> results of the unique field values for the first N matches) . This N could
>>>> be either a percentage relative to total matches or a configurable absolute
>>>> value.
>>>>                       2. Enforce a penalty on  the score for the
>>>> results that have duplicate field values. The penalty can be enforced such
>>>> a way that, the results with higher scores will not be affected as against
>>>> the ones with lower score.
>>>>
>>>> Both of the solutions can be implemented while sorting the documents
>>>> with TopFieldCollector / TopScoreDocCollector.
>>>>
>>>> Does this answer your question?  Please let me know if you have any
>>>> more questions.
>>>>
>>>> Thanks,
>>>> Karthick
>>>>
>>>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>>>> mkhludnev@griddynamics.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I've got the problem description below. Can you explain the expected
>>>>> user experience, and/or solution approach before diving into the algorithm
>>>>> design?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>>>> karthick.soundararaj@gmail.com> wrote:
>>>>>
>>>>>> My problem is that when there are a lot of documents representing
>>>>>> products,
>>>>>> products from same manufacturer seem to appear in close proximity in
>>>>>> the
>>>>>> results and therefore, it doesnt provide brand diversity. When you
>>>>>> search
>>>>>> for sofas, you get sofas from a manufacturer A dominating the first
>>>>>> page
>>>>>> while the sofas from manufacturer B dominating the second page, etc.
>>>>>> The
>>>>>> issue here is that a manufacturer tends to describes the different
>>>>>> sofas he
>>>>>> produces the same way and therefore there is a very little difference
>>>>>> between the documents representing two sofas.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sincerely yours
>>>>> Mikhail Khludnev
>>>>> Tech Lead
>>>>> Grid Dynamics
>>>>>
>>>>> <http://www.griddynamics.com>
>>>>>  <mk...@griddynamics.com>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: Diversifying Search Results - Custom Collector

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,

I don't believe your task can be solved by playing with scoring/collector
or shuffling.
For me it's absolutely Grouping usecase (despite I don't really know this
feature well).

> Grouping cannot solve the problem because I dont want to limit the number
of results showed based on the grouping field.

I'm not really getting it. why you can set limit to 11 and just show the
labels like "[+] show 6 result.." or if you have 11 "[+] show more than 10
.."

If you experience problem with constructing search result page, I can
suggest submit search request with rows=0&facet.field=BRAND, then your
algorithm can choose number of necessary items per every brand and submit
rows=X&fq=BRAND:Y it gives you arbitrarily sizes for "groups".

Will this work for you?

On Mon, Aug 20, 2012 at 8:28 PM, Karthick Duraisamy Soundararaj <
d.s.karthick@gmail.com> wrote:

> Tanguy,
>               You idea is perfect for cases where there is a too many
> documents with 80-90% documents having same value for a particular field.
> As an example, your idea is ideal for, lets say we have 10 documents in
> total like this,
>
>  doc1 : <merchantName> Kellog's </merchantName>
>  doc2 : <merchantName> Kellog's </merchantName>
>  doc3 : <merchantName> Kellog's </merchantName>
>  doc4 : <merchantName> Kellog's </merchantName>
>  doc5 : <merchantName> Kellog's </merchantName>
>  doc6 : <merchantName> Kellog's </merchantName>
>  doc7 : <merchantName> Kellog's </merchantName>
>  doc8 : <merchantName> Nestle </merchantName>
>  doc9 : <merchantName> Kellog's </merchantName>
>  doc10 : <merchantName> Kellog's </merchantName>
>
> But I have
>  doc1 : <merchantName> Maggi </merchantName>
>  doc2 : <merchantName> Maggi  </merchantName>
>  doc3 : <merchantName> M&M's </merchantName>
>  doc4 : <merchantName> M&M's </merchantName>
>  doc5 : <merchantName> Hershey's </merchantName>
>  doc6 : <merchantName> Hershey's </merchantName>
>  doc7 : <merchantName> Nestle </merchantName>
>  doc8 : <merchantName> Nestle </merchantName>
>  doc9 : <merchantName> Kellog's </merchantName>
>  doc10 : <merchantName> Kellog's </merchantName>
>
>
> Thanks,
> Karthick
>
> On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <ta...@gmail.com>wrote:
>
>> Hello,
>>
>> I don't know if that could help, but if I understood your issue, you have
>> a lot of documents with the same or very close scores. Moreover I think you
>> get your matches in Merchant order (more or less) because they must be
>> indexed in that very same order, so solr returns documents of same scores
>> in insertion order (although there is no contract specifying this)
>>
>> You could work around that issue by :
>> 1/ Turning off tf/idf because you're searching in documents with little
>> text where only the match counts, but frequencies obviously aren't helping.
>> 2/ Add a random number to each document at index time, and boost on that
>> random value at query time, this will shuffle your results, that's probably
>> the simplest thing to do.
>>
>> Hope this helps,
>>
>> Tanguy
>>
>> 2012/8/20 Karthick Duraisamy Soundararaj <d....@gmail.com>
>>
>>> Hello Mikhail,
>>>                         Thank you for the reply. In terms of user
>>> experience, I want to spread out the products from same brand farther from
>>> each other, *atleast* in the first 50-100 results we display. I am
>>> thinking about two different approaches as solution.
>>>
>>>                       1. For first few results, display one top scoring
>>> product of a manufacturer  (For a given field, display the top scoring
>>> results of the unique field values for the first N matches) . This N could
>>> be either a percentage relative to total matches or a configurable absolute
>>> value.
>>>                       2. Enforce a penalty on  the score for the results
>>> that have duplicate field values. The penalty can be enforced such a way
>>> that, the results with higher scores will not be affected as against the
>>> ones with lower score.
>>>
>>> Both of the solutions can be implemented while sorting the documents
>>> with TopFieldCollector / TopScoreDocCollector.
>>>
>>> Does this answer your question?  Please let me know if you have any more
>>> questions.
>>>
>>> Thanks,
>>> Karthick
>>>
>>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>>> mkhludnev@griddynamics.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I've got the problem description below. Can you explain the expected
>>>> user experience, and/or solution approach before diving into the algorithm
>>>> design?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>>> karthick.soundararaj@gmail.com> wrote:
>>>>
>>>>> My problem is that when there are a lot of documents representing
>>>>> products,
>>>>> products from same manufacturer seem to appear in close proximity in
>>>>> the
>>>>> results and therefore, it doesnt provide brand diversity. When you
>>>>> search
>>>>> for sofas, you get sofas from a manufacturer A dominating the first
>>>>> page
>>>>> while the sofas from manufacturer B dominating the second page, etc.
>>>>> The
>>>>> issue here is that a manufacturer tends to describes the different
>>>>> sofas he
>>>>> produces the same way and therefore there is a very little difference
>>>>> between the documents representing two sofas.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sincerely yours
>>>> Mikhail Khludnev
>>>> Tech Lead
>>>> Grid Dynamics
>>>>
>>>> <http://www.griddynamics.com>
>>>>  <mk...@griddynamics.com>
>>>>
>>>>
>>>
>>>
>>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Diversifying Search Results - Custom Collector

Posted by Karthick Duraisamy Soundararaj <d....@gmail.com>.
Tanguy,
              You idea is perfect for cases where there is a too many
documents with 80-90% documents having same value for a particular field.
As an example, your idea is ideal for, lets say we have 10 documents in
total like this,

 doc1 : <merchantName> Kellog's </merchantName>
 doc2 : <merchantName> Kellog's </merchantName>
 doc3 : <merchantName> Kellog's </merchantName>
 doc4 : <merchantName> Kellog's </merchantName>
 doc5 : <merchantName> Kellog's </merchantName>
 doc6 : <merchantName> Kellog's </merchantName>
 doc7 : <merchantName> Kellog's </merchantName>
 doc8 : <merchantName> Nestle </merchantName>
 doc9 : <merchantName> Kellog's </merchantName>
 doc10 : <merchantName> Kellog's </merchantName>

But I have
 doc1 : <merchantName> Maggi </merchantName>
 doc2 : <merchantName> Maggi  </merchantName>
 doc3 : <merchantName> M&M's </merchantName>
 doc4 : <merchantName> M&M's </merchantName>
 doc5 : <merchantName> Hershey's </merchantName>
 doc6 : <merchantName> Hershey's </merchantName>
 doc7 : <merchantName> Nestle </merchantName>
 doc8 : <merchantName> Nestle </merchantName>
 doc9 : <merchantName> Kellog's </merchantName>
 doc10 : <merchantName> Kellog's </merchantName>


Thanks,
Karthick

On Mon, Aug 20, 2012 at 12:01 PM, Tanguy Moal <ta...@gmail.com> wrote:

> Hello,
>
> I don't know if that could help, but if I understood your issue, you have
> a lot of documents with the same or very close scores. Moreover I think you
> get your matches in Merchant order (more or less) because they must be
> indexed in that very same order, so solr returns documents of same scores
> in insertion order (although there is no contract specifying this)
>
> You could work around that issue by :
> 1/ Turning off tf/idf because you're searching in documents with little
> text where only the match counts, but frequencies obviously aren't helping.
> 2/ Add a random number to each document at index time, and boost on that
> random value at query time, this will shuffle your results, that's probably
> the simplest thing to do.
>
> Hope this helps,
>
> Tanguy
>
> 2012/8/20 Karthick Duraisamy Soundararaj <d....@gmail.com>
>
>> Hello Mikhail,
>>                         Thank you for the reply. In terms of user
>> experience, I want to spread out the products from same brand farther from
>> each other, *atleast* in the first 50-100 results we display. I am
>> thinking about two different approaches as solution.
>>
>>                       1. For first few results, display one top scoring
>> product of a manufacturer  (For a given field, display the top scoring
>> results of the unique field values for the first N matches) . This N could
>> be either a percentage relative to total matches or a configurable absolute
>> value.
>>                       2. Enforce a penalty on  the score for the results
>> that have duplicate field values. The penalty can be enforced such a way
>> that, the results with higher scores will not be affected as against the
>> ones with lower score.
>>
>> Both of the solutions can be implemented while sorting the documents with
>> TopFieldCollector / TopScoreDocCollector.
>>
>> Does this answer your question?  Please let me know if you have any more
>> questions.
>>
>> Thanks,
>> Karthick
>>
>> On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
>> mkhludnev@griddynamics.com> wrote:
>>
>>> Hello,
>>>
>>> I've got the problem description below. Can you explain the expected
>>> user experience, and/or solution approach before diving into the algorithm
>>> design?
>>>
>>> Thanks
>>>
>>>
>>> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
>>> karthick.soundararaj@gmail.com> wrote:
>>>
>>>> My problem is that when there are a lot of documents representing
>>>> products,
>>>> products from same manufacturer seem to appear in close proximity in the
>>>> results and therefore, it doesnt provide brand diversity. When you
>>>> search
>>>> for sofas, you get sofas from a manufacturer A dominating the first page
>>>> while the sofas from manufacturer B dominating the second page, etc. The
>>>> issue here is that a manufacturer tends to describes the different
>>>> sofas he
>>>> produces the same way and therefore there is a very little difference
>>>> between the documents representing two sofas.
>>>>
>>>
>>>
>>>
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>>> Tech Lead
>>> Grid Dynamics
>>>
>>> <http://www.griddynamics.com>
>>>  <mk...@griddynamics.com>
>>>
>>>
>>
>>
>

Re: Diversifying Search Results - Custom Collector

Posted by Karthick Duraisamy Soundararaj <d....@gmail.com>.
Hello Mikhail,
                        Thank you for the reply. In terms of user
experience, I want to spread out the products from same brand farther from
each other, *atleast* in the first 50-100 results we display. I am thinking
about two different approaches as solution.

                      1. For first few results, display one top scoring
product of a manufacturer  (For a given field, display the top scoring
results of the unique field values for the first N matches) . This N could
be either a percentage relative to total matches or a configurable absolute
value.
                      2. Enforce a penalty on  the score for the results
that have duplicate field values. The penalty can be enforced such a way
that, the results with higher scores will not be affected as against the
ones with lower score.

Both of the solutions can be implemented while sorting the documents with
TopFieldCollector / TopScoreDocCollector.

Does this answer your question?  Please let me know if you have any more
questions.

Thanks,
Karthick

On Mon, Aug 20, 2012 at 3:26 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello,
>
> I've got the problem description below. Can you explain the expected user
> experience, and/or solution approach before diving into the algorithm
> design?
>
> Thanks
>
>
> On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
> karthick.soundararaj@gmail.com> wrote:
>
>> My problem is that when there are a lot of documents representing
>> products,
>> products from same manufacturer seem to appear in close proximity in the
>> results and therefore, it doesnt provide brand diversity. When you search
>> for sofas, you get sofas from a manufacturer A dominating the first page
>> while the sofas from manufacturer B dominating the second page, etc. The
>> issue here is that a manufacturer tends to describes the different sofas
>> he
>> produces the same way and therefore there is a very little difference
>> between the documents representing two sofas.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>
>

Re: Diversifying Search Results - Custom Collector

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello,

I've got the problem description below. Can you explain the expected user
experience, and/or solution approach before diving into the algorithm
design?

Thanks

On Sat, Aug 18, 2012 at 2:50 AM, Karthick Duraisamy Soundararaj <
karthick.soundararaj@gmail.com> wrote:

> My problem is that when there are a lot of documents representing products,
> products from same manufacturer seem to appear in close proximity in the
> results and therefore, it doesnt provide brand diversity. When you search
> for sofas, you get sofas from a manufacturer A dominating the first page
> while the sofas from manufacturer B dominating the second page, etc. The
> issue here is that a manufacturer tends to describes the different sofas he
> produces the same way and therefore there is a very little difference
> between the documents representing two sofas.
>



-- 
Sincerely yours
Mikhail Khludnev
Tech Lead
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>