You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/05/19 21:58:09 UTC

Open Source Relevance

Copied from http://lucene.grantingersoll.com/2008/05/18/open-source-search-engine-relevance/

For a while now, I have been trying to get my hands on TREC data for
the Lucene project. For those who aren’t familiar, TREC is an annual
competition for search engines that provides a common set of documents
to index, queries to execute and judgments to check your answers to
see how good an engine performs. While it isn’t the be all, end all
for relevance, it is a pretty good sanity check on how you are doing.
For instance, many search engines do OK out of the box on it, but once
you tune them, they can do much better. Of course, you risk
overtuning to TREC as well.

In TREC, the queries and the judgments are provided for free, but one
has to pay for the data, or at least most of it, since it is usually
owned by Reuters or some other organization. It isn’t expensive or
anything, but it is a barrier none the less, especially for an open
source project. Furthermore, the whole notion of paying for data in
this day and age of open source and Creative Commons just doesn’t sit
right with me. Don’t get me wrong, I’m a big fan of TREC, having
participated in the past, it provides a valuable service to the
proprietary/academic IR community.

So, what does this have to do with Lucene? When I say I am trying to
get my hands on TREC data, I don’t mean just for me, I literally mean
obtaining TREC data for Lucene. That is, I want the data to be made
available, ideally, for all Lucene (and, for that matter, all open
source search engine) users to use and run experiments on so as to
spur on innovation in Lucene’s scoring algorithms, etc. Now, I know
the copyright owners will never allow this, as I have asked. So, my
next thought was let’s just get it for internal use by committers at
Apache. So, I went back to TREC and we have an agreement to do this,
more or less. The problem, however, is that they say we can only use
the data on ASF (Apache) machines. Not a big deal, right? Kind of.
The ASF doesn’t really have the hardware to run TREC style
experiments. We pretty much have one Solaris “zone” alloted us (a
“zone” is a virtual machine guest image running.) Furthermore, the
ASF is pretty much an all volunteer, worldwide distributed
organization. We do almost all of our work on our own machines as
VOLUNTEERS. Practically speaking, the best way for any of us to take
advantage of the data is to have it locally, which I am told, isn’t
going to happen.

So, what’s the point? I think it is time the open source search
community (and I don’t mean just Lucene) develop and publish a set of
TREC-style relevance judgments for freely available data that is
easily obtained from the Internet. Simply put, I am wondering if
there are volunteers out there who would be willing to develop a
practical set of queries and judgments for datasets like Wikipedia,
iBiblio, the Internet Archive, etc. We wouldn’t host these datasets,
we would just provide the queries and judgments, as well as the info
on how to obtain the data. Then, it is easy enough to provide simple
scripts that do things like run Lucene’s contrib/benchmark Quality
tasks against said data.

Practically speaking, I don’t think we even need to go as deep as
TREC. I think we would find the most use in making judgments on the
top 10 or 20 results for any given query.

So, what do others think? Am I off my rocker? Are there any
volunteers out there? I think we could do this pretty simply through
some scripts, and the effective use of a wiki. I don’t think our goal
is, in the short run, to be scientifically rigorous, but it should be
over time. Instead, I think our goal is to run a practical relevance
test like any organization should when deploying search: take 50 (top)
queries and judge them, as well as 20 or so random queries and judge
them. (I wonder if Wikipedia would give us there top 50 queries, or
maybe it is already available.) Over time, we can add queries, and
refine judgments using the web 2.0 mentality of the wisdom of crowds.

FWIW, there is probably some alignment with the Wikia search project.

Cheers,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Open Source Relevance

Posted by Grant Ingersoll <gs...@apache.org>.

Cool, hadn't seen that.

-Grant

On May 20, 2008, at 1:01 PM, Steven A Rowe wrote:

> On 05/19/2008 at 3:58 PM, Grant Ingersoll wrote:
>> I think it is time the open source search community (and
>> I don’t mean just Lucene) develop and publish a set of
>> TREC-style relevance judgments for freely available data
>> that is easily obtained from the Internet.
>
> Stephen Green, Minion developer at Sun, whose posts comparing Minion  
> and Lucene were recently mentioned on the solr-user mailing list[1],  
> has similar ideas.  From <http://blogs.sun.com/searchguy/entry/minion_and_lucene_performance 
> >:
>
>   I think it would be a good idea for all of the open
>   source engines to get together, find a nice open document
>   collection (the Apache mailing list archives and their
>   associated searches?) and build a nice set of regression
>   tests and some pooled relevance sets so that we can test
>   retrieval performance without having to rely on the TREC
>   data.
>
> Steve
>
> [1] Solr += Minion? on solr-user: <http://www.nabble.com/Minion%2C-anyone--td17344160.html 
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: Open Source Relevance

Posted by Steven A Rowe <sa...@syr.edu>.

On 05/19/2008 at 3:58 PM, Grant Ingersoll wrote:
> I think it is time the open source search community (and
> I don’t mean just Lucene) develop and publish a set of
> TREC-style relevance judgments for freely available data
> that is easily obtained from the Internet.

Stephen Green, Minion developer at Sun, whose posts comparing Minion and Lucene were recently mentioned on the solr-user mailing list[1], has similar ideas.  From <http://blogs.sun.com/searchguy/entry/minion_and_lucene_performance>:

   I think it would be a good idea for all of the open
   source engines to get together, find a nice open document
   collection (the Apache mailing list archives and their
   associated searches?) and build a nice set of regression
   tests and some pooled relevance sets so that we can test
   retrieval performance without having to rely on the TREC
   data.

Steve

[1] Solr += Minion? on solr-user: <http://www.nabble.com/Minion%2C-anyone--td17344160.html>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org