You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2011/03/02 03:39:23 UTC

Query & click logs for custom Lucene relevance models

Hello,

I'm helping out a student interested in using query and click logs to build 
custom relevance models for Lucene.  Step #1 is finding a good dataset that 
contains the needed data.  I've looked around, found a few things, but nothing 
that looks very good.

I was wondering if anyone has any dataset suggestions?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


Re: Query & click logs for custom Lucene relevance models

Posted by Tommaso Teofili <to...@gmail.com>.
Hi Otis,
you may find some resources (mainly code, not datasets) for your student's
use case at [1] .
Also I know LWE from LucidImagination [2] has a click & scoring framework
but that component is not open source at the moment; however I don't know if
they used also publicly available datasets to build such a feature.
My 0.2 cents,
Tommaso

[1] : http://code.google.com/p/oluolu
[2] :
http://www.lucidimagination.com/enterprise-search-solutions/lucidworks/1.6

2011/3/2 Otis Gospodnetic <ot...@yahoo.com>

> Hello,
>
> I'm helping out a student interested in using query and click logs to build
> custom relevance models for Lucene.  Step #1 is finding a good dataset that
> contains the needed data.  I've looked around, found a few things, but
> nothing
> that looks very good.
>
> I was wondering if anyone has any dataset suggestions?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>

Re: Query & click logs for custom Lucene relevance models

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Andrzej,

Thanks for bringing up that AOL dataset (I've got a copy of that stashed away), 
because the person I'm helping looked at this, and we thought it didn't have all 
the data one needs to build custom relevance models.
Here is a small sample:

 AnonID     Query   QueryTime       ItemRank        ClickURL 
 217        lottery 2006-03-01 11:58:51     1       http://www.calottery.com 
 217        lottery 2006-03-01 11:58:51     1       http://www.calottery.com 
 217        ameriprise.com  2006-03-01 14:06:23     1     
  http://www.ameriprise.com 
 217        susheme 2006-03-02 12:31:08 
 217        united.com      2006-03-03 14:54:13 
 217        mizuno.com      2006-03-07 22:41:17     1     
  http://www.mizuno.com 
 217        p; .; p;' p; ' ;' ;';   2006-03-09 12:09:27 
 217        p; .; p;' p; ' ;' ;';   2006-03-09 12:09:35 
 217        buddylis        2006-03-16 15:23:33 
 217        bestasiancompany.com    2006-03-20 15:15:43     1     
  http://www.bestasiancompany.com 
 217        lottery 2006-03-27 14:10:38     1       http://www.calottery.com 
 217        lottery 2006-03-27 16:34:59     1       http://www.calottery.com 
 217        ask.com 2006-03-31 14:31:10     1       http://www.ask.com 

For instance, in order to  build custom relevance models, wouldn't we need to 
have the actual corpus/index associated with this data in order to get the base 
relevance scores first?

Or could one just look at clicks where ItemRank is low (meaning they were not 
close to the top of search results) and apply some algo that essentially 
produces a boost score that stands on its own and is applied on top of the 
relevance score at search time?
Would it make sense to have a global boost score for each document, or would 
that need to be query-specific and thus applied at query-time and not at 
index-time?

If you have an idea how one could/should go about using just the above to  build 
custom relevance models for Lucene, I'm all eyeballs.

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Andrzej Bialecki <ab...@getopt.org>
> To: openrelevance-dev@lucene.apache.org
> Sent: Wed, March 2, 2011 5:07:55 AM
> Subject: Re: Query & click logs for custom Lucene relevance models
> 
> On 3/2/11 3:39 AM, Otis Gospodnetic wrote:
> > Hello,
> > 
> > I'm  helping out a student interested in using query and click logs to build
> >  custom relevance models for Lucene.  Step #1 is finding a good dataset  
that
> > contains the needed data.  I've looked around, found a few  things, but 
>nothing
> > that looks very good.
> > 
> > I was  wondering if anyone has any dataset suggestions?
> 
> The (in)famous AOL  dataset comes to my mind, and it's very good, maybe even 
>too good :) AOL  officially pulled it back, but it's still available and IMHO 
>legitimate to use -  it was a blunder all right but it carried a suitable 
>license and things can't be  un-published ...
> 
> -- Best regards,
> Andrzej Bialecki      <><
>  ___. ___ ___ ___ _ _    __________________________________
> [__ || __|__/|__||\/|  Information  Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded  Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 

Re: Query & click logs for custom Lucene relevance models

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 3/2/11 3:39 AM, Otis Gospodnetic wrote:
> Hello,
>
> I'm helping out a student interested in using query and click logs to build
> custom relevance models for Lucene.  Step #1 is finding a good dataset that
> contains the needed data.  I've looked around, found a few things, but nothing
> that looks very good.
>
> I was wondering if anyone has any dataset suggestions?

The (in)famous AOL dataset comes to my mind, and it's very good, maybe 
even too good :) AOL officially pulled it back, but it's still available 
and IMHO legitimate to use - it was a blunder all right but it carried a 
suitable license and things can't be un-published ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com