You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/05/11 18:07:41 UTC

Open Relevance Project?

A few of us who are interested in an Open Relevance assessment project  
(ala TREC) have started to put some thoughts down on "paper" over at http://wiki.apache.org/lucene-java/OpenRelevance

Thus, if you'd like to somehow participate (TBD what that actually  
means just yet) in developing a set of open collections, queries and  
assessments for relevance testing, let's discuss here and on that Wiki  
page.

The basic gist of it is, we'd like to crawl Creative Commons and/or  
other free content, redistribute it along with queries and judgments,  
thus fueling the testing capabilities to further improve Lucene's  
search quality as well as, of course, providing the means for a  
completely open assessment process whereby anyone can participate  
without having to fork up money to license 20 year old copyrighted  
news articles that are of no other value whatsoever other than testing.

At this point, we're open to a lot of ideas.  Once we solidify a bit,  
then we'd like to make it an official Lucene subproject and get our  
own resources as well as figure out how to crawl and host the content  
using ASF infrastructure (without making the ASF infra. team upset!)

Cheers,
Grant

Re: Open Relevance Project?

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Mon, May 11, 2009 at 12:07:41PM -0400, Grant Ingersoll wrote:
> Thus, if you'd like to somehow participate (TBD what that actually  
> means just yet) in developing a set of open collections, queries and  
> assessments for relevance testing, let's discuss here and on that Wiki  
> page.

I won't be able to contribute directly in the near to medium term, but I look
forward to participating as a user.  Sounds like a great project.

Marvin Humphrey

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

Sounds good to me.  I would be able to help a few different ways.

On Mon, May 11, 2009 at 9:07 AM, Grant Ingersoll <gs...@apache.org>wrote:

> A few of us who are interested in an Open Relevance assessment project (ala
> TREC) have started to put some thoughts down on "paper" over at
> http://wiki.apache.org/lucene-java/OpenRelevance
>
> Thus, if you'd like to somehow participate (TBD what that actually means
> just yet) in developing a set of open collections, queries and assessments
> for relevance testing, let's discuss here and on that Wiki page.
>
>

Re: Open Relevance Project?

Posted by André Warnier <aw...@ice-sa.com>.

Ted Dunning wrote:
> Even if the corpus is very large, I doubt there will be all that much
> aggregate bandwidth.  The audience for this is relatively small.

+1
(I mean, count me in as 1 for the audience)

As for the corpus, why not start will all the Apache projects 
documentation ?
It is relatively homogenous, free, it is there, it is close, and would 
insure some audience.

Re: Open Relevance Project?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

I agree!  A a matter of fact, that is exactly what I just wrote here:
http://www.jroller.com/otis/entry/followup_open_relevance_project#comment-1242703187000

"....For example, couldn't a vendor use it to compare old implementation to
new implementation and provide some kind of metric showing improvements
in new version?...."

The first "vendor" in ORP's case might be Lucene.  My hope would be that others could and would take what ORP builds and apply it to their implementations.  My next wish after that would be to see others publish the results.  But, I think we'll never see any results from commercial vendors - I have a feeling they don't have much to gain by exposing their results to the competition and to the public.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Ted Dunning <te...@gmail.com>
> To: general@lucene.apache.org
> Sent: Monday, May 18, 2009 11:12:41 PM
> Subject: Re: Open Relevance Project?
> 
> I completely agee with this.  In practice, search engines and to a larger
> extent recommendation engines shape user behavior and are, in turn, shaped
> by user behavior so that static relevancy tests are of only very limited
> value in the end game.
> 
> But it is still *very* nice to have them.
> 
> On Mon, May 18, 2009 at 8:00 PM, Mark Miller wrote:
> 
> > Grant Ingersoll wrote:
> >
> >> Some interesting discussion at
> >> 
> http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/
> >>
> > That was an interesting read. I think a lot of the argument misses the
> > point. It doesn't seem to me that the main benefit or intent comes from
> > 'bake offs' with other search engines ("Selling search applications to
> > enterprises isn't, in my experience, about winning relevance bake-offs.") -
> > the main benefit is allowing us to measure changes and improvements to
> > Lucene's relevancy calculations and to make judgments about how Lucene
> > currently performs. I see it easily as important as the Lucene benchmark
> > contrib. Its not going to be a secret sauce, just like the benchmarker has
> > been no secret sauce - but its going to make it easier to reliably improve
> > Lucene in the future.
> >
> >

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

I completely agee with this.  In practice, search engines and to a larger
extent recommendation engines shape user behavior and are, in turn, shaped
by user behavior so that static relevancy tests are of only very limited
value in the end game.

But it is still *very* nice to have them.

On Mon, May 18, 2009 at 8:00 PM, Mark Miller <ma...@gmail.com> wrote:

> Grant Ingersoll wrote:
>
>> Some interesting discussion at
>> http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/
>>
> That was an interesting read. I think a lot of the argument misses the
> point. It doesn't seem to me that the main benefit or intent comes from
> 'bake offs' with other search engines ("Selling search applications to
> enterprises isn't, in my experience, about winning relevance bake-offs.") -
> the main benefit is allowing us to measure changes and improvements to
> Lucene's relevancy calculations and to make judgments about how Lucene
> currently performs. I see it easily as important as the Lucene benchmark
> contrib. Its not going to be a secret sauce, just like the benchmarker has
> been no secret sauce - but its going to make it easier to reliably improve
> Lucene in the future.
>
>

Re: Open Relevance Project?

Posted by Michael McCandless <lu...@mikemccandless.com>.

I think for now I won't add myself as committer.  I'm plenty swamped :)

I'll try to keep close tabs though.  This is an important effort!

Mike

On Wed, May 27, 2009 at 11:15 AM, Grant Ingersoll <gs...@apache.org> wrote:
> So, of those who have expressed interest, who is willing to step up and be a
> committer?  Right now, we have me, Andrzej, Simon and Otis who have put
> their name on the wiki, but Ted and Mike have also implied they are
> interested.    Please add your name if you think you can work on it and can
> fulfill the obligations of being a committer
> (http://www.apache.org/dev/#committers).
>
> I'm going to call a vote on adding ORP as a subproject of Lucene very soon
> and would like to finalize the proposal.
>
> -Grant
>

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

So, of those who have expressed interest, who is willing to step up  
and be a committer?  Right now, we have me, Andrzej, Simon and Otis  
who have put their name on the wiki, but Ted and Mike have also  
implied they are interested.    Please add your name if you think you  
can work on it and can fulfill the obligations of being a committer (http://www.apache.org/dev/#committers 
).

I'm going to call a vote on adding ORP as a subproject of Lucene very  
soon and would like to finalize the proposal.

-Grant

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

+1.  Let's not get ahead of ourselves w/ changing the world or  
anything like that.  First and foremost, we need this for Lucene, if  
others benefit, so be it.  You are right on in that we need a shared,  
free way of judging whether Lucene is improving on relevance (even if  
it is already very good out of the box).  Otherwise, we can't even  
have the conversation.  For instance, it would help in evaluating the  
Axiomatic patch in JIRA or the SweetSpot stuff or a whole host of  
things (for instance, our current len. norm tends to favor shorter  
docs, is this the right default?)


On May 18, 2009, at 11:00 PM, Mark Miller wrote:

> Grant Ingersoll wrote:
>> Some interesting discussion at http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/
> That was an interesting read. I think a lot of the argument misses  
> the point. It doesn't seem to me that the main benefit or intent  
> comes from 'bake offs' with other search engines ("Selling search  
> applications to enterprises isn't, in my experience, about winning  
> relevance bake-offs.") - the main benefit is allowing us to measure  
> changes and improvements to Lucene's relevancy calculations and to  
> make judgments about how Lucene currently performs. I see it easily  
> as important as the Lucene benchmark contrib. Its not going to be a  
> secret sauce, just like the benchmarker has been no secret sauce -  
> but its going to make it easier to reliably improve Lucene in the  
> future.
>
> - Mark
>>
>> On May 18, 2009, at 1:57 PM, Grant Ingersoll wrote:
>>
>>>
>>> On May 18, 2009, at 11:41 AM, Ted Dunning wrote:
>>>
>>>> On the other hand, it is likely that we could find query and  
>>>> click logs for
>>>> the documentation.
>>>
>>> Only if they are redacted/aggregated first.  ASF Members have  
>>> access, but we'd need to get permission to distribute (after  
>>> redaction/aggregation) I suspect.   Given the AOL marketing  
>>> fiasco, we'd have to go over them in pretty good detail before  
>>> releasing to make sure there is no personal information.  AFAIK,  
>>> I'm the only ASF Member who has so far volunteered on this thread  
>>> and I highly doubt I have the time for what I imagine to be a  
>>> pretty decent sized endeavor.
>>>
>>> Stripping IP address is pretty straightforward, but the query  
>>> terms might be a bit more involved.
>>>
>>> Still, can't hurt to find out what's involved.
>>>
>>> -Grant
>>
>>
>
>
> -- 
> - Mark
>
> http://www.lucidimagination.com
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Open Relevance Project?

Posted by Mark Miller <ma...@gmail.com>.

Grant Ingersoll wrote:
> Some interesting discussion at 
> http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/ 
>
That was an interesting read. I think a lot of the argument misses the 
point. It doesn't seem to me that the main benefit or intent comes from 
'bake offs' with other search engines ("Selling search applications to 
enterprises isn't, in my experience, about winning relevance 
bake-offs.") - the main benefit is allowing us to measure changes and 
improvements to Lucene's relevancy calculations and to make judgments 
about how Lucene currently performs. I see it easily as important as the 
Lucene benchmark contrib. Its not going to be a secret sauce, just like 
the benchmarker has been no secret sauce - but its going to make it 
easier to reliably improve Lucene in the future.

- Mark
>
> On May 18, 2009, at 1:57 PM, Grant Ingersoll wrote:
>
>>
>> On May 18, 2009, at 11:41 AM, Ted Dunning wrote:
>>
>>> On the other hand, it is likely that we could find query and click 
>>> logs for
>>> the documentation.
>>
>> Only if they are redacted/aggregated first.  ASF Members have access, 
>> but we'd need to get permission to distribute (after 
>> redaction/aggregation) I suspect.   Given the AOL marketing fiasco, 
>> we'd have to go over them in pretty good detail before releasing to 
>> make sure there is no personal information.  AFAIK, I'm the only ASF 
>> Member who has so far volunteered on this thread and I highly doubt I 
>> have the time for what I imagine to be a pretty decent sized endeavor.
>>
>> Stripping IP address is pretty straightforward, but the query terms 
>> might be a bit more involved.
>>
>> Still, can't hurt to find out what's involved.
>>
>> -Grant
>
>


-- 
- Mark

http://www.lucidimagination.com

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

Some interesting discussion at http://thenoisychannel.com/2009/05/18/copying-trec-is-the-wrong-track-for-the-enterprise/

On May 18, 2009, at 1:57 PM, Grant Ingersoll wrote:

>
> On May 18, 2009, at 11:41 AM, Ted Dunning wrote:
>
>> On the other hand, it is likely that we could find query and click  
>> logs for
>> the documentation.
>
> Only if they are redacted/aggregated first.  ASF Members have  
> access, but we'd need to get permission to distribute (after  
> redaction/aggregation) I suspect.   Given the AOL marketing fiasco,  
> we'd have to go over them in pretty good detail before releasing to  
> make sure there is no personal information.  AFAIK, I'm the only ASF  
> Member who has so far volunteered on this thread and I highly doubt  
> I have the time for what I imagine to be a pretty decent sized  
> endeavor.
>
> Stripping IP address is pretty straightforward, but the query terms  
> might be a bit more involved.
>
> Still, can't hurt to find out what's involved.
>
> -Grant

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

On May 18, 2009, at 11:41 AM, Ted Dunning wrote:

> On the other hand, it is likely that we could find query and click  
> logs for
> the documentation.

Only if they are redacted/aggregated first.  ASF Members have access,  
but we'd need to get permission to distribute (after redaction/ 
aggregation) I suspect.   Given the AOL marketing fiasco, we'd have to  
go over them in pretty good detail before releasing to make sure there  
is no personal information.  AFAIK, I'm the only ASF Member who has so  
far volunteered on this thread and I highly doubt I have the time for  
what I imagine to be a pretty decent sized endeavor.

Stripping IP address is pretty straightforward, but the query terms  
might be a bit more involved.

Still, can't hurt to find out what's involved.

-Grant

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

On the other hand, it is likely that we could find query and click logs for
the documentation.

On Mon, May 18, 2009 at 3:59 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Mail archives are likely useful for a mail based corpus.  I agree with
> Andrzej about the rest of the docs, though.
>
>
>
> On May 18, 2009, at 5:25 AM, André Warnier wrote:
>
>  Hi.
>> There has been an erlier suggestion here, later endorsed by someone else,
>> to use the documentation of the Apache projects as a corpus.
>> Being far from an expert, I am just naively wondering why the experts on
>> this list seem to totally ignore it, without providing any argument.
>> Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible,
>> useless, uninteresting or ... ?
>>
>>
>
-- 
Ted Dunning, CTO
DeepDyve

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

Mail archives are likely useful for a mail based corpus.  I agree with  
Andrzej about the rest of the docs, though.


On May 18, 2009, at 5:25 AM, André Warnier wrote:

> Hi.
> There has been an erlier suggestion here, later endorsed by someone  
> else, to use the documentation of the Apache projects as a corpus.
> Being far from an expert, I am just naively wondering why the  
> experts on this list seem to totally ignore it, without providing  
> any argument.
> Is it somehow unsuitable, unpractical, inappropriate, bad,  
> unfeasible, useless, uninteresting or ... ?
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Open Relevance Project?

Posted by Andrzej Bialecki <ab...@getopt.org>.

André Warnier wrote:
> Hi.
> There has been an erlier suggestion here, later endorsed by someone 
> else, to use the documentation of the Apache projects as a corpus.
> Being far from an expert, I am just naively wondering why the experts on 
> this list seem to totally ignore it, without providing any argument.
> Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible, 
> useless, uninteresting or ... ?

The documentation is mostly on a single topic - programming. The 
vocabulary is, let's not deceive ourselves, limited ;) Pages contain a 
lot of noise (Forrest navigation, javadoc dressing, common class names, 
code snippets, etc).

For a general-purpose corpus you would want to have several topics, with 
a well-balanced representation, and using a broad vocabulary and low 
level of noise.

Additionally, this collection gets relatively little endorsement (links 
with meaningful anchors) from within apache.org, so the typical PageRank 
scoring wouldn't work too well (on the other hand, it resembles intranet 
linkage, so it could be useful for studying scoring algos for enterprise 
search).

So, while this collection is not useless, it's not the best fit either.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Open Relevance Project?

Posted by André Warnier <aw...@ice-sa.com>.

Hi.
There has been an erlier suggestion here, later endorsed by someone 
else, to use the documentation of the Apache projects as a corpus.
Being far from an expert, I am just naively wondering why the experts on 
this list seem to totally ignore it, without providing any argument.
Is it somehow unsuitable, unpractical, inappropriate, bad, unfeasible, 
useless, uninteresting or ... ?

Re: Open Relevance Project?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Not sure if this was mentioned before, but .... hm, I was going to point out http://index.isc.org/ (see http://ioiblog.wordpress.com/2008/11/07/kicking-off-the-ioi-blog/ ), but the server doesn't seem to be listening.... aha, here: http://ioiblog.wordpress.com/2009/02/

Perhaps we can get data from Dennis and Jeremie?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Ted Dunning <te...@gmail.com>
> To: general@lucene.apache.org
> Sent: Wednesday, May 13, 2009 2:48:43 PM
> Subject: Re: Open Relevance Project?
> 
> Crawling a reference dataset requires essentially one-time bandwidth.
> 
> Also, it is possible to download, say, wikipedia in a single go.  Likewise
> there are various web-crawls that are available for research purposes (I
> think).  See http://webascorpus.org/ for one example.  These would be single
> downloads.
> 
> I don't entirely see the point of redoing the spidering.
> 
> On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll wrote:
> 
> > Good point, although you never know.  We also will have some bandwidth reqs
> > for crawling.
> >
> >
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

Very good point.

For that matter, the suggestion of apache docs is a trenchant one because
there is some possibility of getting a sample of queries (and if feasible,
session associated clicks).  Having real users provide feedback for real
queries like that would be *vastly* more useful than just dead documents.

On Wed, May 13, 2009 at 1:38 PM, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:

> define WHAT kind of corpus




-- 
Ted Dunning, CTO
DeepDyve

Re: Open Relevance Project?

Posted by Simon Willnauer <si...@googlemail.com>.

I followed the whole discussion on how obtaining a certain corpus of
document going on on this thread. I personally think that we should
first define WHAT kind of corpus or rather what kind of different
corpus should be included in this new OpenRelevance project and not
HOW this corpus is collected / aggregated. IR is not just about having
a huge corpus of full-text documents / web-pages especially when it
comes to ranking.

My understanding of OpenRelevance is to provide a set of corpus and
measurement procedures for various use cases not just to compete with
TREC. Please correct me if I'm wrong.
Beyond that the project should help to improve Lucene - Ranking itself
or at least be helpful to obtain a measurement reference for more than
just WebSearch.

Anyway, I personally feel that the discussion about how to obtain a
certain corpus are out of scope at this stage of the project.

Simon

On Wed, May 13, 2009 at 9:13 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> On May 13, 2009, at 2:48 PM, Ted Dunning wrote:
>
>> Crawling a reference dataset requires essentially one-time bandwidth.
>>
>
> True, but we will likely evolve over time to have multiple datasets, but no
> reason to get ahead of ourselves.
>
>
>> Also, it is possible to download, say, wikipedia in a single go.
>
> Wikipedia isn't always that interesting from a relevance testing standpoint,
> for IR at least (QA, machine learning, etc. it is more so).  A lot of
> queries simply have only one or two relevant results.  While that is useful,
> it is not often the whole picture of what one needs for IR.
>
>> Likewise
>> there are various web-crawls that are available for research purposes (I
>> think).  See http://webascorpus.org/ for one example.  These would be
>> single
>> downloads.
>>
>> I don't entirely see the point of redoing the spidering.
>
> I think we have to be able to control the spidering, so that we can say
> we've vetted what's in it, due to copyright, etc.  But, maybe not.  I've
> talked with quite a few people who have corpora available, and it always
> comes down to copyright for redistribution in a public way.  No one wants to
> assume the risk, even though they all crawl and redistribute (for money).
>
> For instance, the Internet Archive even goes so far as to apply robots.txt
> retroactively.  We probably could do the same thing, but I'm not sure if it
> is necessary.
>
>

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

On May 13, 2009, at 2:48 PM, Ted Dunning wrote:

> Crawling a reference dataset requires essentially one-time bandwidth.
>

True, but we will likely evolve over time to have multiple datasets,  
but no reason to get ahead of ourselves.

> Also, it is possible to download, say, wikipedia in a single go.

Wikipedia isn't always that interesting from a relevance testing  
standpoint, for IR at least (QA, machine learning, etc. it is more  
so).  A lot of queries simply have only one or two relevant results.   
While that is useful, it is not often the whole picture of what one  
needs for IR.

> Likewise
> there are various web-crawls that are available for research  
> purposes (I
> think).  See http://webascorpus.org/ for one example.  These would  
> be single
> downloads.
>
> I don't entirely see the point of redoing the spidering.

I think we have to be able to control the spidering, so that we can  
say we've vetted what's in it, due to copyright, etc.  But, maybe  
not.  I've talked with quite a few people who have corpora available,  
and it always comes down to copyright for redistribution in a public  
way.  No one wants to assume the risk, even though they all crawl and  
redistribute (for money).

For instance, the Internet Archive even goes so far as to apply  
robots.txt retroactively.  We probably could do the same thing, but  
I'm not sure if it is necessary.

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

Crawling a reference dataset requires essentially one-time bandwidth.

Also, it is possible to download, say, wikipedia in a single go.  Likewise
there are various web-crawls that are available for research purposes (I
think).  See http://webascorpus.org/ for one example.  These would be single
downloads.

I don't entirely see the point of redoing the spidering.

On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll <gs...@apache.org>wrote:

> Good point, although you never know.  We also will have some bandwidth reqs
> for crawling.
>
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

Good point, although you never know.  We also will have some bandwidth  
reqs for crawling.

On May 13, 2009, at 1:36 PM, Ted Dunning wrote:

> Even if the corpus is very large, I doubt there will be all that much
> aggregate bandwidth.  The audience for this is relatively small.
>
> On Wed, May 13, 2009 at 8:56 AM, Grant Ingersoll  
> <gs...@apache.org>wrote:
>
>> help overcome the bandwidth problem.
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

Even if the corpus is very large, I doubt there will be all that much
aggregate bandwidth.  The audience for this is relatively small.

On Wed, May 13, 2009 at 8:56 AM, Grant Ingersoll <gs...@apache.org>wrote:

> help overcome the bandwidth problem.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

So, I suppose the next steps are to formalize this project a little  
more.  I'll call a vote on a separate thread to add it as a Lucene  
sub.  I figured I would contact infrastructure to see what they  
think.  Was also thinking that maybe we should talk with iBiblio or  
some other content repository to see if they can help overcome the  
bandwidth problem.

-Grant

On May 11, 2009, at 6:12 PM, Grant Ingersoll wrote:

>
> On May 11, 2009, at 4:01 PM, Michael McCandless wrote:
>
>> I'd love to see a resource like this (it's high time!), and I'll try
>> to help when/where I can, starting with some initial
>> comments/questions:
>>
>> I think it's actually quite a challenge to do well.  EG it's easy to
>> make a corpus that's too easy because it's highly diverse (and thus
>> most search engines have no trouble pulling back relevant results).
>> Instead, I think the content set should be well & tightly scoped to a
>> certain topic, and not necessarily that large (ie we don't need a  
>> huge
>> number of documents).  It would help if that scoping is towards
>> content that many people find "of interest" so we get "accurate"
>> judgements by as wide an audience as possible.
>
> I think we will want a generic one, and then focused ones, but we  
> should start with generic at first.
>
>>
>>
>> EG how about coverage of the 2009 H1N1 outbreak (that's licensed
>> appropriately)?  Or... the 2008 US presidential election?  Or...
>> research on Leukemia (but I fear such content is not typically
>> licensed appropriately, nor will it have wide interest).
>>
>> What does "using Nutch to crawl Creative Commons" actually mean?  Can
>> I browse the content that's being crawled?
>
> Nutch has a CC plugin that allows it to filter out non-CC content,  
> AIUI.
>
>>
>>
>> Also, to help us build up the relevance judgements, I think we should
>> build a basic custom app for collecting queries as well as annotating
>> them.  I should be able to go to that page and run my own queries,
>> which are collected.  Then, I should be able to browse previously
>> collected queries, click on them, and add my own judgement.  The site
>> should try to offer up queries that are "in need" of judgements.  It
>> should run the search and let me step through the results, marking
>> those that are relevant; but we would then bias the results to that
>> search engine; maybe under the hood we rotate through search engines
>> each time?
>>
>> Do we have anyone involved who's built similar corpora before?  Or  
>> has
>> anyone read papers on how prior corpora were designed/created?
>
> This is all good, but here I'm thinking simpler, at least at first.   
> I don't know that we need to be writing apps, although feel free,  
> since it is O/S after all.  :-)  I was wondering if we couldn't  
> handle this wiki style (how is still not clear) whereby we simply  
> have pages that contain the queries and judgments and over time the  
> wisdom of the crowds will work to maintain standards, fill in gaps,  
> etc.    Maybe, in regards to judgments, we allow people to vote for  
> them, which over time will yield an appropriate result (but is  
> subject to early issues).  Not sure what all that means just yet,  
> but the wiki approach allows us to get going with minimal resources  
> while still delivering value.  Hmm, now it's starting to sound like  
> an app...  ;-)
>
> As opposed to TREC style stuff, I don't think we need the top 1000  
> (although it could work).  Just the top ten or twenty.  Sometimes,  
> it can even be useful to just rate a whole page of results at once,  
> even at the cost of granularity.   Basically, what I'm proposing we  
> do is carry out a pragmatic relevance test out in the open, just as  
> people should do in house.  I think this fits with Lucene's model of  
> operation quite well: be practical by focusing on real data and real  
> feedback as opposed to obsessing over theory.  (Not that you were  
> suggesting otherwise, I'm just stating it)
>
> I need to find the reference, but I recall the last edition of SIGIR  
> having a discussion on crowdsourcing relevance judgments.
>
> -Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Open Relevance Project?

Posted by Grant Ingersoll <gs...@apache.org>.

On May 11, 2009, at 4:01 PM, Michael McCandless wrote:

> I'd love to see a resource like this (it's high time!), and I'll try
> to help when/where I can, starting with some initial
> comments/questions:
>
> I think it's actually quite a challenge to do well.  EG it's easy to
> make a corpus that's too easy because it's highly diverse (and thus
> most search engines have no trouble pulling back relevant results).
> Instead, I think the content set should be well & tightly scoped to a
> certain topic, and not necessarily that large (ie we don't need a huge
> number of documents).  It would help if that scoping is towards
> content that many people find "of interest" so we get "accurate"
> judgements by as wide an audience as possible.

I think we will want a generic one, and then focused ones, but we  
should start with generic at first.

>
>
> EG how about coverage of the 2009 H1N1 outbreak (that's licensed
> appropriately)?  Or... the 2008 US presidential election?  Or...
> research on Leukemia (but I fear such content is not typically
> licensed appropriately, nor will it have wide interest).
>
> What does "using Nutch to crawl Creative Commons" actually mean?  Can
> I browse the content that's being crawled?

Nutch has a CC plugin that allows it to filter out non-CC content, AIUI.

>
>
> Also, to help us build up the relevance judgements, I think we should
> build a basic custom app for collecting queries as well as annotating
> them.  I should be able to go to that page and run my own queries,
> which are collected.  Then, I should be able to browse previously
> collected queries, click on them, and add my own judgement.  The site
> should try to offer up queries that are "in need" of judgements.  It
> should run the search and let me step through the results, marking
> those that are relevant; but we would then bias the results to that
> search engine; maybe under the hood we rotate through search engines
> each time?
>
> Do we have anyone involved who's built similar corpora before?  Or has
> anyone read papers on how prior corpora were designed/created?

This is all good, but here I'm thinking simpler, at least at first.  I  
don't know that we need to be writing apps, although feel free, since  
it is O/S after all.  :-)  I was wondering if we couldn't handle this  
wiki style (how is still not clear) whereby we simply have pages that  
contain the queries and judgments and over time the wisdom of the  
crowds will work to maintain standards, fill in gaps, etc.    Maybe,  
in regards to judgments, we allow people to vote for them, which over  
time will yield an appropriate result (but is subject to early  
issues).  Not sure what all that means just yet, but the wiki approach  
allows us to get going with minimal resources while still delivering  
value.  Hmm, now it's starting to sound like an app...  ;-)

As opposed to TREC style stuff, I don't think we need the top 1000  
(although it could work).  Just the top ten or twenty.  Sometimes, it  
can even be useful to just rate a whole page of results at once, even  
at the cost of granularity.   Basically, what I'm proposing we do is  
carry out a pragmatic relevance test out in the open, just as people  
should do in house.  I think this fits with Lucene's model of  
operation quite well: be practical by focusing on real data and real  
feedback as opposed to obsessing over theory.  (Not that you were  
suggesting otherwise, I'm just stating it)

I need to find the reference, but I recall the last edition of SIGIR  
having a discussion on crowdsourcing relevance judgments.

-Grant

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

I was involved in TREC-1 through 5 or so as a researcher.  That means that I
didn't actually create the corpus but I certainly had to deal with the
results and see how things turned out.

On Mon, May 11, 2009 at 1:01 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
> Do we have anyone involved who's built similar corpora before?  Or has
> anyone read papers on how prior corpora were designed/created?

-- 
Ted Dunning, CTO
DeepDyve

Re: Open Relevance Project?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Michael McCandless wrote:

> I think it's actually quite a challenge to do well.  EG it's easy to
> make a corpus that's too easy because it's highly diverse (and thus
> most search engines have no trouble pulling back relevant results).
> Instead, I think the content set should be well & tightly scoped to a
> certain topic, and not necessarily that large (ie we don't need a huge
> number of documents).  It would help if that scoping is towards
> content that many people find "of interest" so we get "accurate"
> judgements by as wide an audience as possible.
> 
> EG how about coverage of the 2009 H1N1 outbreak (that's licensed
> appropriately)?  Or... the 2008 US presidential election?  Or...
> research on Leukemia (but I fear such content is not typically
> licensed appropriately, nor will it have wide interest).

These are good ideas. It's difficult not only to collect a meaningful 
corpus, but also later to distribute it, if it weighs a hundred GBs or more.

> 
> What does "using Nutch to crawl Creative Commons" actually mean?  Can
> I browse the content that's being crawled?

Yes. It's easy to collect a lot of web pages starting from a seed list 
and expanding the crawling frontier to linked resources, while applying 
CC license filters. Nutch provides a lot of tools out of the box that we 
need anyway, such as keeping track of page status, following outlinks, 
parsing, working with web graph (important for scoring web documents), 
indexing, searching and content browsing.

> Also, to help us build up the relevance judgements, I think we should
> build a basic custom app for collecting queries as well as annotating
> them.  I should be able to go to that page and run my own queries,
> which are collected.  Then, I should be able to browse previously
> collected queries, click on them, and add my own judgement.  The site
> should try to offer up queries that are "in need" of judgements.  It
> should run the search and let me step through the results, marking
> those that are relevant; but we would then bias the results to that
> search engine; maybe under the hood we rotate through search engines
> each time?

Comparing results across search engines is clearly a challenge. Among 
others, this requires that the corpus that we use with the engines that 
we operate (Lucene? KinoSearch? other open source engines?) contains at 
least top-X (where X > N) URL-s returned from external engines for every 
query - otherwise we won't be able to compare the results.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Open Relevance Project?

Posted by Ted Dunning <te...@gmail.com>.

The standard technique for this is what is called "pooled relevance" where
all the results from all the search engines are combined into a pool for
judging.

In our case, we should probably make the pool dynamic so that tests on new
search engines will enlarge the pool.

Related to that, we should not pretend that we can measure recall for any
meaningful sized corpus.

On Mon, May 11, 2009 at 1:01 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> It should run the search and let me step through the results, marking
> those that are relevant; but we would then bias the results to that
> search engine; maybe under the hood we rotate through search engines
> each time?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Open Relevance Project?

Posted by Michael McCandless <lu...@mikemccandless.com>.

I'd love to see a resource like this (it's high time!), and I'll try
to help when/where I can, starting with some initial
comments/questions:

I think it's actually quite a challenge to do well.  EG it's easy to
make a corpus that's too easy because it's highly diverse (and thus
most search engines have no trouble pulling back relevant results).
Instead, I think the content set should be well & tightly scoped to a
certain topic, and not necessarily that large (ie we don't need a huge
number of documents).  It would help if that scoping is towards
content that many people find "of interest" so we get "accurate"
judgements by as wide an audience as possible.

EG how about coverage of the 2009 H1N1 outbreak (that's licensed
appropriately)?  Or... the 2008 US presidential election?  Or...
research on Leukemia (but I fear such content is not typically
licensed appropriately, nor will it have wide interest).

What does "using Nutch to crawl Creative Commons" actually mean?  Can
I browse the content that's being crawled?

Also, to help us build up the relevance judgements, I think we should
build a basic custom app for collecting queries as well as annotating
them.  I should be able to go to that page and run my own queries,
which are collected.  Then, I should be able to browse previously
collected queries, click on them, and add my own judgement.  The site
should try to offer up queries that are "in need" of judgements.  It
should run the search and let me step through the results, marking
those that are relevant; but we would then bias the results to that
search engine; maybe under the hood we rotate through search engines
each time?

Do we have anyone involved who's built similar corpora before?  Or has
anyone read papers on how prior corpora were designed/created?

Mike

On Mon, May 11, 2009 at 12:07 PM, Grant Ingersoll <gs...@apache.org> wrote:
> A few of us who are interested in an Open Relevance assessment project (ala
> TREC) have started to put some thoughts down on "paper" over at
> http://wiki.apache.org/lucene-java/OpenRelevance
>
> Thus, if you'd like to somehow participate (TBD what that actually means
> just yet) in developing a set of open collections, queries and assessments
> for relevance testing, let's discuss here and on that Wiki page.
>
> The basic gist of it is, we'd like to crawl Creative Commons and/or other
> free content, redistribute it along with queries and judgments, thus fueling
> the testing capabilities to further improve Lucene's search quality as well
> as, of course, providing the means for a completely open assessment process
> whereby anyone can participate without having to fork up money to license 20
> year old copyrighted news articles that are of no other value whatsoever
> other than testing.
>
> At this point, we're open to a lot of ideas.  Once we solidify a bit, then
> we'd like to make it an official Lucene subproject and get our own resources
> as well as figure out how to crawl and host the content using ASF
> infrastructure (without making the ASF infra. team upset!)
>
> Cheers,
> Grant
>