You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/08/07 23:57:28 UTC

TREC Collection, NIST and Lucene

DISCLAIMER: Just to be clear, what follows is my personal opinion and  
in no way, shape or form reflects an official position from the  
Lucene project:

So, now that we have all this great stuff for running TREC  
experiments in contrib/benchmark, I am wondering if people think it  
would be useful to send an open letter (and I mean an official letter  
from the ASF in a similar vein to http://apache.org/jcp/ 
sunopenletter.html) to the people that run TREC (i.e. NIST) inquiring  
if there is a way in which we can let our users obtain one or more  
TREC collections for running experiments.

It seems to me, that having the Lucene community  (and other Open  
Source search projects if they want) involved in TREC would be a real  
plus for the competition since we could serve as a baseline AND since  
we are transparent in what we do (i.e. our algorithms are open for  
public scrutiny) we can truly encourage open research.  After all,  
the whole goal of TREC (according to their website) is:
        "...to encourage research in information retrieval from large  
text collections."

By lowering the cost of entry, we truly could further this goal.   
After all, isn't furthering research about others being able to  
repeat experiments?  If you don't have the data, you can't repeat the  
experiment.

As for benefits to us, it allows us to do direct comparisons of  
Lucene and gives us some data points about how Lucene performs in  
terms of precision and recall (not that TREC is the be all, end all  
for measuring relevance, but...)  Furthermore, I think it would  
encourage Lucene users/developers to think about relevance as much as  
we think about speed.  Also, I think it would help us think about how  
to make Lucene scoring more pluggable (and still fast) such that we  
could make alternate relevancy models available similar to the  
Axiomatic Retrieval Scoring that was recently proposed.

Currently, the data is copyrighted and you pay to gain access, as I  
understand it (it has been a while since I ran TREC).  So, do people  
have suggestions on ways we could address this?  Maybe people have to  
sign a waiver or something or maybe the ASF could work out something  
with NIST or maybe the license could allow for personal use?  Really  
speculating here...  The key is the data needs to be free for open  
source use.  I don't think it needs to be ASF licensed.   Perhaps if  
we can present some possible solutions to the problem, our proposal  
will be more likely to be accepted.

Obviously, we would have to discuss this with the ASF as well to see  
if they would support it.

So, is this worthwhile to people?  Am I barking up the wrong tree?  I  
am willing to write up the letter and do the legwork, but I want to  
know the community is behind it as well.  Perhaps it would be better  
to do this informally?  Maybe just send an inquiry saying "Hey, we're  
from the Lucene project and we would love to be able to do TREC runs,  
can you help us out? Yada, yada, yada..."

Just thinking out loud,
Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TREC Collection, NIST and Lucene

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
It's more of a chicken-and-egg problem I guess; it's the same with E.U. grants 
and local science grants over here (Poland) -- the government funds some 
projects, but who if not us funds the government? I am a strong believer that 
the results of public grants should be open and available for everyone. Maybe 
I'm idealistic though.

Fingers crossed it will work out as expected.

Dawid

> Obviously, I hope the result is positive, but I don't know if precluding 
> open source is against their mission.  After all, I do understand that 
> it takes time and money to create these collections and that should be 
> valued.  By the same token, it is the Federal government that is 
> sponsoring the competition and doing a lot of the work, so it seems one 
> could argue for making a collection that is free from copyright 
> restrictions so that anyone can participate.  It is a tricky issue, 
> however, so we can hope for the best.
> 
> I will wait a couple more days to solicit comments and then submit it to 
> the appropriate people at NIST.
> 
> -Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TREC Collection, NIST and Lucene

Posted by Grant Ingersoll <gs...@apache.org>.
On Aug 20, 2007, at 3:01 PM, Dawid Weiss wrote:

>
> I like it too. And I'm wondering what the response to this will be  
> -- it will in a way show if TREC really stands up to their mission,  
> won't it?
>

Obviously, I hope the result is positive, but I don't know if  
precluding open source is against their mission.  After all, I do  
understand that it takes time and money to create these collections  
and that should be valued.  By the same token, it is the Federal  
government that is sponsoring the competition and doing a lot of the  
work, so it seems one could argue for making a collection that is  
free from copyright restrictions so that anyone can participate.  It  
is a tricky issue, however, so we can hope for the best.

I will wait a couple more days to solicit comments and then submit it  
to the appropriate people at NIST.

-Grant


> D.
>
> Grant Ingersoll wrote:
>> How does this sound:
>> Dear ----,
>> My name is Grant Ingersoll and I am committer on the Lucene Java  
>> search library (http://lucene.apache.org) at the Apache Software  
>> Foundation (ASF).  I am not, however, writing in any official  
>> capacity as a representative of the ASF.  Perhaps at a later date,  
>> this will change, but for now I just want to keep things informal.
>> I am, however, interested in starting a discussion about how open  
>> source projects like Lucene could participate in future TREC  
>> evaluations, or at least gain access to TREC data resources.   
>> While the people involved in Lucene feel we have built a top notch  
>> search system, one of the things the community as a whole lacks is  
>> the ability to do formal evaluations like TREC offers, and thus  
>> research and development of new algorithms is hindered.  Granted,  
>> individuals may perform TREC evaluations given they have purchased  
>> a license to the data, but the community as a whole does not have  
>> this ability.
>> I am wondering if there is some way in which we can arrange for  
>> open source projects to obtain access to the TREC collections.   
>> The biggest barrier for projects like Lucene, obviously, is the  
>> fee that needs to be paid.  Furthermore, there are undoubtedly  
>> distribution and copyright concerns.  Yet, a part of me feels that  
>> we can work something out through creative licensing or some other  
>> novel approach that protects the appropriate interests, furthers  
>> TREC's mission and supports the vibrant Open Source community  
>> around Lucene and other search engines.  Perhaps it would be  
>> possible to require that any participant who wants the TREC data  
>> must prove that they are appropriately affiliated with an official  
>> open source project, as defined by the Open Source Initiative  
>> (http://www.opensource.org).  Many tool vendors have similar  
>> licenses that allow open source participants to use their tool  
>> while working on open source projects[1].  Perhaps we could  
>> provide a similar approach to the TREC data.
>> I feel this would benefit TREC substantially, by providing an  
>> open, baseline system for all the world to see and I see that it  
>> fits very much with the motto of TREC  "...to encourage research  
>> in information retrieval from large text collections."    
>> Naturally, it benefits Lucene by allowing Lucene to undertake more  
>> formal evaluation of relevance, etc.
>> If you are interested in more background on this on the Lucene  
>> Java developers mailing list, please refer to
>> http://www.gossamer-threads.com/lists/lucene/java-dev/52022? 
>> search_string=TREC;#52022 I look forward to hearing back from you  
>> and I would be more than happy to answer any questions you have.
>> Sincerely,
>> Grant Ingersoll
>> [1] JetBrains, Atlassian, Clover Test Coverage, etc.
>> -------
>> -Grant
>> On Aug 10, 2007, at 4:52 AM, Tom White wrote:
>>>> Furthermore, I think it would
>>>> encourage Lucene users/developers to think about relevance as  
>>>> much as
>>>> we think about speed.
>>>
>>> +1
>>>
>>> However I think it would be much better to start by making informal
>>> approaches as you suggest - the open letter seems to me to be
>>> appropriate only as a last resort.
>>>
>>> Tom
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TREC Collection, NIST and Lucene

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
I like it too. And I'm wondering what the response to this will be -- it will 
in a way show if TREC really stands up to their mission, won't it?

D.

Grant Ingersoll wrote:
> How does this sound:
> 
> Dear ----,
> 
> My name is Grant Ingersoll and I am committer on the Lucene Java search 
> library (http://lucene.apache.org) at the Apache Software Foundation 
> (ASF).  I am not, however, writing in any official capacity as a 
> representative of the ASF.  Perhaps at a later date, this will change, 
> but for now I just want to keep things informal.
> 
> I am, however, interested in starting a discussion about how open source 
> projects like Lucene could participate in future TREC evaluations, or at 
> least gain access to TREC data resources.  While the people involved in 
> Lucene feel we have built a top notch search system, one of the things 
> the community as a whole lacks is the ability to do formal evaluations 
> like TREC offers, and thus research and development of new algorithms is 
> hindered.  Granted, individuals may perform TREC evaluations given they 
> have purchased a license to the data, but the community as a whole does 
> not have this ability.
> 
> I am wondering if there is some way in which we can arrange for open 
> source projects to obtain access to the TREC collections.  The biggest 
> barrier for projects like Lucene, obviously, is the fee that needs to be 
> paid.  Furthermore, there are undoubtedly distribution and copyright 
> concerns.  Yet, a part of me feels that we can work something out 
> through creative licensing or some other novel approach that protects 
> the appropriate interests, furthers TREC's mission and supports the 
> vibrant Open Source community around Lucene and other search engines.  
> Perhaps it would be possible to require that any participant who wants 
> the TREC data must prove that they are appropriately affiliated with an 
> official open source project, as defined by the Open Source Initiative 
> (http://www.opensource.org).  Many tool vendors have similar licenses 
> that allow open source participants to use their tool while working on 
> open source projects[1].  Perhaps we could provide a similar approach to 
> the TREC data.
> 
> I feel this would benefit TREC substantially, by providing an open, 
> baseline system for all the world to see and I see that it fits very 
> much with the motto of TREC  "...to encourage research in information 
> retrieval from large text collections."   Naturally, it benefits Lucene 
> by allowing Lucene to undertake more formal evaluation of relevance, etc.
> 
> If you are interested in more background on this on the Lucene Java 
> developers mailing list, please refer to
> http://www.gossamer-threads.com/lists/lucene/java-dev/52022?search_string=TREC;#52022 
> 
> 
> I look forward to hearing back from you and I would be more than happy 
> to answer any questions you have.
> 
> Sincerely,
> Grant Ingersoll
> 
> [1] JetBrains, Atlassian, Clover Test Coverage, etc.
> 
> -------
> 
> -Grant
> 
> 
> 
> 
> 
> On Aug 10, 2007, at 4:52 AM, Tom White wrote:
> 
>>> Furthermore, I think it would
>>> encourage Lucene users/developers to think about relevance as much as
>>> we think about speed.
>>
>> +1
>>
>> However I think it would be much better to start by making informal
>> approaches as you suggest - the open letter seems to me to be
>> appropriate only as a last resort.
>>
>> Tom
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TREC Collection, NIST and Lucene

Posted by José Ramón Pérez Agüera <jo...@fdi.ucm.es>.
It is perfect :-) I think, maybe would be interesting that you send a CC to
LCD, because I think that they have some kind of rights on TREC collections.

http://trec.nist.gov/data/docs_eng.html

http://www.ldc.upenn.edu/

jose

On 8/20/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> How does this sound:
>
> Dear ----,
>
> My name is Grant Ingersoll and I am committer on the Lucene Java
> search library (http://lucene.apache.org) at the Apache Software
> Foundation (ASF).  I am not, however, writing in any official
> capacity as a representative of the ASF.  Perhaps at a later date,
> this will change, but for now I just want to keep things informal.
>
> I am, however, interested in starting a discussion about how open
> source projects like Lucene could participate in future TREC
> evaluations, or at least gain access to TREC data resources.  While
> the people involved in Lucene feel we have built a top notch search
> system, one of the things the community as a whole lacks is the
> ability to do formal evaluations like TREC offers, and thus research
> and development of new algorithms is hindered.  Granted, individuals
> may perform TREC evaluations given they have purchased a license to
> the data, but the community as a whole does not have this ability.
>
> I am wondering if there is some way in which we can arrange for open
> source projects to obtain access to the TREC collections.  The
> biggest barrier for projects like Lucene, obviously, is the fee that
> needs to be paid.  Furthermore, there are undoubtedly distribution
> and copyright concerns.  Yet, a part of me feels that we can work
> something out through creative licensing or some other novel approach
> that protects the appropriate interests, furthers TREC's mission and
> supports the vibrant Open Source community around Lucene and other
> search engines.  Perhaps it would be possible to require that any
> participant who wants the TREC data must prove that they are
> appropriately affiliated with an official open source project, as
> defined by the Open Source Initiative (http://www.opensource.org).
> Many tool vendors have similar licenses that allow open source
> participants to use their tool while working on open source projects
> [1].  Perhaps we could provide a similar approach to the TREC data.
>
> I feel this would benefit TREC substantially, by providing an open,
> baseline system for all the world to see and I see that it fits very
> much with the motto of TREC  "...to encourage research in information
> retrieval from large text collections."   Naturally, it benefits
> Lucene by allowing Lucene to undertake more formal evaluation of
> relevance, etc.
>
> If you are interested in more background on this on the Lucene Java
> developers mailing list, please refer to
> http://www.gossamer-threads.com/lists/lucene/java-dev/52022?
> search_string=TREC;#52022
>
> I look forward to hearing back from you and I would be more than
> happy to answer any questions you have.
>
> Sincerely,
> Grant Ingersoll
>
> [1] JetBrains, Atlassian, Clover Test Coverage, etc.
>
> -------
>
> -Grant
>
>
>
>
>
> On Aug 10, 2007, at 4:52 AM, Tom White wrote:
>
> >> Furthermore, I think it would
> >> encourage Lucene users/developers to think about relevance as much as
> >> we think about speed.
> >
> > +1
> >
> > However I think it would be much better to start by making informal
> > approaches as you suggest - the open letter seems to me to be
> > appropriate only as a last resort.
> >
> > Tom
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
José Ramón Pérez Agüera

Dept. de Ingeniería del Software e Inteligencia Artificial
Despacho 411 tlf. 913947599
Facultad de Informática
Universidad Complutense de Madrid

Re: TREC Collection, NIST and Lucene

Posted by Grant Ingersoll <gs...@apache.org>.
How does this sound:

Dear ----,

My name is Grant Ingersoll and I am committer on the Lucene Java  
search library (http://lucene.apache.org) at the Apache Software  
Foundation (ASF).  I am not, however, writing in any official  
capacity as a representative of the ASF.  Perhaps at a later date,  
this will change, but for now I just want to keep things informal.

I am, however, interested in starting a discussion about how open  
source projects like Lucene could participate in future TREC  
evaluations, or at least gain access to TREC data resources.  While  
the people involved in Lucene feel we have built a top notch search  
system, one of the things the community as a whole lacks is the  
ability to do formal evaluations like TREC offers, and thus research  
and development of new algorithms is hindered.  Granted, individuals  
may perform TREC evaluations given they have purchased a license to  
the data, but the community as a whole does not have this ability.

I am wondering if there is some way in which we can arrange for open  
source projects to obtain access to the TREC collections.  The  
biggest barrier for projects like Lucene, obviously, is the fee that  
needs to be paid.  Furthermore, there are undoubtedly distribution  
and copyright concerns.  Yet, a part of me feels that we can work  
something out through creative licensing or some other novel approach  
that protects the appropriate interests, furthers TREC's mission and  
supports the vibrant Open Source community around Lucene and other  
search engines.  Perhaps it would be possible to require that any  
participant who wants the TREC data must prove that they are  
appropriately affiliated with an official open source project, as  
defined by the Open Source Initiative (http://www.opensource.org).   
Many tool vendors have similar licenses that allow open source  
participants to use their tool while working on open source projects 
[1].  Perhaps we could provide a similar approach to the TREC data.

I feel this would benefit TREC substantially, by providing an open,  
baseline system for all the world to see and I see that it fits very  
much with the motto of TREC  "...to encourage research in information  
retrieval from large text collections."   Naturally, it benefits  
Lucene by allowing Lucene to undertake more formal evaluation of  
relevance, etc.

If you are interested in more background on this on the Lucene Java  
developers mailing list, please refer to
http://www.gossamer-threads.com/lists/lucene/java-dev/52022? 
search_string=TREC;#52022

I look forward to hearing back from you and I would be more than  
happy to answer any questions you have.

Sincerely,
Grant Ingersoll

[1] JetBrains, Atlassian, Clover Test Coverage, etc.

-------

-Grant





On Aug 10, 2007, at 4:52 AM, Tom White wrote:

>> Furthermore, I think it would
>> encourage Lucene users/developers to think about relevance as much as
>> we think about speed.
>
> +1
>
> However I think it would be much better to start by making informal
> approaches as you suggest - the open letter seems to me to be
> appropriate only as a last resort.
>
> Tom
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: TREC Collection, NIST and Lucene

Posted by Tom White <to...@gmail.com>.
> Furthermore, I think it would
> encourage Lucene users/developers to think about relevance as much as
> we think about speed.

+1

However I think it would be much better to start by making informal
approaches as you suggest - the open letter seems to me to be
appropriate only as a last resort.

Tom

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org