You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2007/03/18 16:03:09 UTC

[jira] Created: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
-----------------------------------------------------------

Key: LUCENE-836
URL: https://issues.apache.org/jira/browse/LUCENE-836
Project: Lucene - Java
Issue Type: New Feature
Components: Other
Reporter: Grant Ingersoll
Priority: Minor

Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC. I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data. It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.

Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary. Even so, an Apache licensed set of benchmarks may be useful for the community as a whole. Hmmm....

Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable. Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.

At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-836:
-------------------------------

    Attachment: lucene-836.benchmark.quality.patch

A ready to commit patch for search quality benchmarking. 

Javadocs can be reviewed in http://people.apache.org/~doronc/api/ - see the benchmark.quality package for a code sample to run the quality benchmark with your input index, queries, judgments, etc.

I would like to commit this in a day or two, to make it easier to proceed with LUCENE-965 and the other search quality ideas - comments (especially on the API) are most welcome...


> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch
>
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-836:
-------------------------------

    Attachment: lucene-836.benchmark.quality.patch

Updated patch is cleaner and almost ready to commit: interfaces are more clean now, and most javadocs is in place. Package javadocs still missing. 

> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch
>
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-836:
-------------------------------

    Attachment: lucene-836.benchmark.quality.patch

lucene-836.benchmark.quality.patch adds a new package "quality" under o.a.l.benchmark. 

This is also followup to some of http://www.mail-archive.com/java-dev@lucene.apache.org/msg10851.html

Patch is based on trunk folder. 
Fastest way to test it: "ant test" from contrib/benchmark dir.
To see more output in this run, try "ant test -Dtests.verbose=true".

This is early code, not ready to commit - wanted to show it sooner for feedback, especially the API. 

For a quick view of the API see benchmark.quality at http://people.apache.org/~doronc/api (note that not much javadocs yet - I would wait with that for API closure.)

Code in this patch is:
  - extendable.
  - can run a quality benchmark.
  - report quality results, comparing to given judgements (optional).
  - create a submission log (optional).
  - format of submission log can be modified, by extending a logger class.
  - format of inputs - queries, judgments - can be modified, by extending 
    default readers, or by providing pre-read ones.

There is a general "Judge" interface - answering if a given doc name is valid for a given "QualityQuery". And one implementation of it, based on Trec's QRels. The alternative of TRels, for instance, would mean another implementation of the "Judge" interface. (I would love a better name for it, btw...)

A new TestQualityRun tests this package on the Reuters collection - so that test source is a good place to start, to see how to run a quality test.

> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: lucene-836.benchmark.quality.patch
>
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen reassigned LUCENE-836:
----------------------------------

    Assignee: Doron Cohen

> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Assignee: Doron Cohen
>            Priority: Minor
>         Attachments: lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch
>
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen resolved LUCENE-836.
--------------------------------

       Resolution: Fixed
    Lucene Fields: [Patch Available]  (was: [New])

Committed. 


> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Assignee: Doron Cohen
>            Priority: Minor
>         Attachments: lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch
>
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516004 ] 

Grant Ingersoll commented on LUCENE-836:
----------------------------------------

+1

Applies clean and I like the API, but I think you should have a Jury object too...  

I can't actually run it w/o TREC but the tests pass.  I think I might have TREC Arabic lying around somewhere, maybe I will give a run w/ that some day, but don't wait on me to apply this.

> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch
>
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516084 ] 

Doron Cohen commented on LUCENE-836:
------------------------------------

Thanks for the review Grant!

Note that you can see the output by the default Logger and default SubmissionReport by running the TestQualityRun Junit with -Dtests.verbose=true. The submission report and the log both go to stdio so they will be intermixed, but you'll at least get to see what gets printed.


> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Priority: Minor
>         Attachments: lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch, lucene-836.benchmark.quality.patch
>
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by Grant Ingersoll <gr...@gmail.com>.

I think the Reuters corpus is pretty good and it pretty well known in  
the community.  Probably the most important part would be to build up  
a set of judgments.  I don't think it is too hard to come up w/  
50-100 questions/queries, but creating the relevance pool will be  
more difficult.  I suppose we could setup a social networking site to  
harvest judgments... :-)

The 4M queries would be good for load testing.

Wikipedia stuff is good, but you need to be able to handle/remove the  
redirects, otherwise you have a tendency to get redirect pages as  
your top matches due to length normalization.  Plus it is really big  
to download.


On Mar 20, 2007, at 6:58 AM, Karl Wettin (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-836? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel#action_12482367 ]
>
> Karl Wettin commented on LUCENE-836:
> ------------------------------------
>
> Regarding data and user queries, I have a 150 000 document corpus  
> with 4 000 000 queries that I might be able to convince the owners  
> to release. It is great data, but a bit politically incorrect  
> (torrents).
>
> There is some simple Wikipedia harvesting in LUCENE-826, and I'm in  
> the middle of rewriting it to a more general Wikipedia library for  
> text mining purposes. Perhaps you have some ideas you want to put  
> in there? I plan something like this:
>
> public class WikipediaCorpus {
>   Map<String, String> wikipediaDomainPrefixByLanguageISO
>   Map<URL, WikipediaArticle> harvestedArticle
>
>   public WikipediaArticle getArticle(String languageISO, String  
> title) {
>     ..
>   }
> }
>
> public class WikipediaArticle {
>   WikipediaArticle(URL url) {
>     ..
>   }
>
>   String languageISO;
>   String title;
>   String[] contentParagraphs
>
>   Date[] modified;
>
>   Map<String, String> articleInOtherLanguagesByLanguageISO
>
> }
>
>
>
>> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
>> -----------------------------------------------------------
>>
>>                 Key: LUCENE-836
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>>             Project: Lucene - Java
>>          Issue Type: New Feature
>>          Components: Other
>>            Reporter: Grant Ingersoll
>>            Priority: Minor
>>
>> Would be great if the benchmark contrib had a way of providing  
>> precision/recall benchmark information ala TREC.  I don't know  
>> what the copyright issues are for the TREC queries/data (I think  
>> the queries are available, but not sure about the data), so not  
>> sure if the is even feasible, but I could imagine we could at  
>> least incorporate support for it for those who have access to the  
>> data.  It has been a long time since I have participated in TREC,  
>> so perhaps someone more familiar w/ the latest can fill in the  
>> blanks here.
>> Another option is to ask for volunteers to create queries and make  
>> judgments for the Reuters data, but that is a bit more complex and  
>> probably not necessary.  Even so, an Apache licensed set of  
>> benchmarks may be useful for the community as a whole.  Hmmm....
>> Wikipedia might be another option instead of Reuters to setup as a  
>> download for benchmarking, as it is quite large and I believe the  
>> licensing terms are quite amenable.  Having a larger collection  
>> would be good for stressing Lucene more and would give many users  
>> a demonstration of how Lucene handles large collections.
>> At any rate, this kind of information could be useful for people  
>> looking at different indexing schemes, formats, payloads and  
>> different query strategies.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482367 ] 

Karl Wettin commented on LUCENE-836:
------------------------------------

Regarding data and user queries, I have a 150 000 document corpus with 4 000 000 queries that I might be able to convince the owners to release. It is great data, but a bit politically incorrect (torrents). 

There is some simple Wikipedia harvesting in LUCENE-826, and I'm in the middle of rewriting it to a more general Wikipedia library for text mining purposes. Perhaps you have some ideas you want to put in there? I plan something like this:

public class WikipediaCorpus {  
  Map<String, String> wikipediaDomainPrefixByLanguageISO
  Map<URL, WikipediaArticle> harvestedArticle

  public WikipediaArticle getArticle(String languageISO, String title) {
    ..
  }
}

public class WikipediaArticle {
  WikipediaArticle(URL url) {
    ..
  }
 
  String languageISO;
  String title;
  String[] contentParagraphs

  Date[] modified; 

  Map<String, String> articleInOtherLanguagesByLanguageISO

}



> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
>                 Key: LUCENE-836
>                 URL: https://issues.apache.org/jira/browse/LUCENE-836
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Other
>            Reporter: Grant Ingersoll
>            Priority: Minor
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC.  I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data.  It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary.  Even so, an Apache licensed set of benchmarks may be useful for the community as a whole.  Hmmm.... 
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable.  Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org