You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Lance Norskog <go...@gmail.com> on 2009/02/02 05:56:25 UTC
Re: Optimizing & Improving results based on user feedback

To avoid the "users only see the first page" problem, one solution is: if
the result set has more than one page with high scores near each other,
scramble them.

That is, if the top 20 results range in score from 19.0 to 20.0, they really
are all about the same relevance, so just card-shuffle them. This way the
same search by 500 users will give a statistically more valid click dataset.

On Fri, Jan 30, 2009 at 8:25 AM, Ryan McKinley <ry...@gmail.com> wrote:

> yes, applying a boost would be a good addition.
>
> patches are always welcome ;)
>
>
>
> On Jan 30, 2009, at 10:56 AM, Matthew Runo wrote:
>
>  I've thought about patching the QueryElevationComponent to apply boosts
>> rather than a specific sort. Then the file might look like..
>>
>> <query text="AAA"> <doc id="A" boost="5" /> <doc id="B" boost="4" />
>> </query>
>> And I could write a script that looks at click data once a day to fill out
>> this file.
>> Thanks for your time!
>>
>> Matthew Runo
>> Software Engineer, Zappos.com
>> mruno@zappos.com - 702-943-7833
>>
>> On Jan 30, 2009, at 6:37 AM, Ryan McKinley wrote:
>>
>>  It may not be as fine-grained as you want, but also check the
>>> QueryElevationComponent.  This takes a preconfigured list of what the top
>>> results should be for a given query and makes thoes documents the top
>>> results.
>>>
>>> Presumably, you could use click logs to determine what the top result
>>> should be.
>>>
>>>
>>> On Jan 29, 2009, at 7:45 PM, Walter Underwood wrote:
>>>
>>>  "A Decision Theoretic Framework for Ranking using Implicit Feedback"
>>>> uses clicks, but the best part of that paper is all the side comments
>>>> about difficulties in evaluation. For example, if someone clicks on
>>>> three results, is that three times as good or two failures and a
>>>> success? We have to know the information need to decide. That paper
>>>> is in the LR4IR 2008 proceedings.
>>>>
>>>> Both Radlinski and Joachims seem to be focusing on click data.
>>>>
>>>> I'm thinking of something much simpler, like taking the first
>>>> N hits and reordering those before returning. Brute force, but
>>>> would get most of the benefit. Usually, you only have reliable
>>>> click data for a small number of documents on each query, so
>>>> it is a waste of time to rerank the whole list. Besides, if you
>>>> need to move something up 100 places on the list, you should
>>>> probably be tuning your regular scoring rather than patching
>>>> it with click data.
>>>>
>>>> wunder
>>>>
>>>> On 1/29/09 3:43 PM, "Matthew Runo" <mr...@zappos.com> wrote:
>>>>
>>>>  Agreed, it seems that a lot of the algorithms in these papers would
>>>>> almost be a whole new RequestHandler ala Dismax. Luckily a lot of them
>>>>> seem to be built on Lucene (at least the ones that I looked at that
>>>>> had code samples).
>>>>>
>>>>> Which papers did you see that actually talked about using clicks? I
>>>>> don't see those, beyond "Addressing Malicious Noise in Clickthrough
>>>>> Data" by Filip Radlinski and also his "Query Chains: Learning to Rank
>>>>> from Implicit Feedback" - but neither is really on topic.
>>>>>
>>>>> Thanks for your time!
>>>>>
>>>>> Matthew Runo
>>>>> Software Engineer, Zappos.com
>>>>> mruno@zappos.com - 702-943-7833
>>>>>
>>>>> On Jan 29, 2009, at 11:36 AM, Walter Underwood wrote:
>>>>>
>>>>>  Thanks, I didn't know there was so much research in this area.
>>>>>> Most of the papers at those workshops are about tuning the
>>>>>> entire ranking algorithm with machine learning techniques.
>>>>>>
>>>>>> I am interested in adding one more feature, click data, to an
>>>>>> existing ranking algorithm. In my case, I have enough data to
>>>>>> use query-specific boosts instead of global document boosts.
>>>>>> We get about 2M search clicks per day from logged in users
>>>>>> (little or no click spam).
>>>>>>
>>>>>> I'm checking out some papers from Thorsten Joachims and from
>>>>>> Microsoft Research that are specifically about clickthrough
>>>>>> feedback.
>>>>>>
>>>>>> wunder
>>>>>>
>>>>>> On 1/27/09 11:15 PM, "Neal Richter" <nr...@gmail.com> wrote:
>>>>>>
>>>>>>  OK I've implemented this before, written academic papers and patents
>>>>>>> related to this task.
>>>>>>>
>>>>>>> Here are some hints:
>>>>>>> - you're on the right track with the editorial boosting elevators
>>>>>>> - http://wiki.apache.org/solr/UserTagDesign
>>>>>>> - be darn careful about assuming that one click is enough evidence
>>>>>>> to boost a long
>>>>>>>  'distance'
>>>>>>> - first page effects in search will skew the learning badly if you
>>>>>>> don't compensate.
>>>>>>>    95% of users never go past the first page of results, 1% go
>>>>>>> past the second
>>>>>>>    page.  So perfectly good results on the second page get
>>>>>>> permanently locked out
>>>>>>> - consider forgetting what you learn under some condition
>>>>>>>
>>>>>>> In fact this whole area is called 'learning to rank' and is a hot
>>>>>>> research topic in IR.
>>>>>>> http://web.mit.edu/shivani/www/Ranking-NIPS-05/
>>>>>>> http://research.microsoft.com/en-us/um/people/lr4ir-2007/
>>>>>>> https://research.microsoft.com/en-us/um/people/lr4ir-2008/
>>>>>>>
>>>>>>> - Neal Richter
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 27, 2009 at 2:06 PM, Matthew Runo <mr...@zappos.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello folks!
>>>>>>>>
>>>>>>>> We've been thinking about ways to improve organic search results
>>>>>>>> for a while
>>>>>>>> (really, who hasn't?) and I'd like to get some ideas on ways to
>>>>>>>> implement a
>>>>>>>> feedback system that uses user behavior as input. Basically, it'd
>>>>>>>> work on
>>>>>>>> the premise that what the user actually clicked on is probably a
>>>>>>>> really good
>>>>>>>> match for their search, and should be boosted up in the results
>>>>>>>> for that
>>>>>>>> search.
>>>>>>>>
>>>>>>>> For example, if I search for "rain boots", and really love the
>>>>>>>> 10th result
>>>>>>>> down (and show it by clicking on it), then we'd like to capture
>>>>>>>> this and use
>>>>>>>> the data to boost up that result //for that search//. We've
>>>>>>>> thought about
>>>>>>>> using index time boosts for the documents, but that'd boost it
>>>>>>>> regardless of
>>>>>>>> the search terms, which isn't what we want. We've thought about
>>>>>>>> using the
>>>>>>>> Elevator handler, but we don't really want to force a product to
>>>>>>>> the top -
>>>>>>>> we'd prefer it slowly rises over time as more and more people
>>>>>>>> click it from
>>>>>>>> the same search terms. Another way might be to stuff the keyword
>>>>>>>> into the
>>>>>>>> document, the more times it's in the document the higher it'd
>>>>>>>> score - but
>>>>>>>> there's gotta be a better way than that.
>>>>>>>>
>>>>>>>> Obviously this can't be done 100% in solr - but if anyone had some
>>>>>>>> clever
>>>>>>>> ideas about how this might be possible it'd be interesting to hear
>>>>>>>> them.
>>>>>>>>
>>>>>>>> Thanks for your time!
>>>>>>>>
>>>>>>>> Matthew Runo
>>>>>>>> Software Engineer, Zappos.com
>>>>>>>> mruno@zappos.com - 702-943-7833
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Lance Norskog
goksron@gmail.com
650-922-8831 (US)