You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2007/03/20 11:58:32 UTC
[jira] Commented: (LUCENE-836) Benchmarks Enhancements
(precision/recall, TREC, Wikipedia)
[ https://issues.apache.org/jira/browse/LUCENE-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482367 ]
Karl Wettin commented on LUCENE-836:
------------------------------------
Regarding data and user queries, I have a 150 000 document corpus with 4 000 000 queries that I might be able to convince the owners to release. It is great data, but a bit politically incorrect (torrents).
There is some simple Wikipedia harvesting in LUCENE-826, and I'm in the middle of rewriting it to a more general Wikipedia library for text mining purposes. Perhaps you have some ideas you want to put in there? I plan something like this:
public class WikipediaCorpus {
Map<String, String> wikipediaDomainPrefixByLanguageISO
Map<URL, WikipediaArticle> harvestedArticle
public WikipediaArticle getArticle(String languageISO, String title) {
..
}
}
public class WikipediaArticle {
WikipediaArticle(URL url) {
..
}
String languageISO;
String title;
String[] contentParagraphs
Date[] modified;
Map<String, String> articleInOtherLanguagesByLanguageISO
}
> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
> -----------------------------------------------------------
>
> Key: LUCENE-836
> URL: https://issues.apache.org/jira/browse/LUCENE-836
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Other
> Reporter: Grant Ingersoll
> Priority: Minor
>
> Would be great if the benchmark contrib had a way of providing precision/recall benchmark information ala TREC. I don't know what the copyright issues are for the TREC queries/data (I think the queries are available, but not sure about the data), so not sure if the is even feasible, but I could imagine we could at least incorporate support for it for those who have access to the data. It has been a long time since I have participated in TREC, so perhaps someone more familiar w/ the latest can fill in the blanks here.
> Another option is to ask for volunteers to create queries and make judgments for the Reuters data, but that is a bit more complex and probably not necessary. Even so, an Apache licensed set of benchmarks may be useful for the community as a whole. Hmmm....
> Wikipedia might be another option instead of Reuters to setup as a download for benchmarking, as it is quite large and I believe the licensing terms are quite amenable. Having a larger collection would be good for stressing Lucene more and would give many users a demonstration of how Lucene handles large collections.
> At any rate, this kind of information could be useful for people looking at different indexing schemes, formats, payloads and different query strategies.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: [jira] Commented: (LUCENE-836) Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
Posted by Grant Ingersoll <gr...@gmail.com>.
I think the Reuters corpus is pretty good and it pretty well known in
the community. Probably the most important part would be to build up
a set of judgments. I don't think it is too hard to come up w/
50-100 questions/queries, but creating the relevance pool will be
more difficult. I suppose we could setup a social networking site to
harvest judgments... :-)
The 4M queries would be good for load testing.
Wikipedia stuff is good, but you need to be able to handle/remove the
redirects, otherwise you have a tendency to get redirect pages as
your top matches due to length normalization. Plus it is really big
to download.
On Mar 20, 2007, at 6:58 AM, Karl Wettin (JIRA) wrote:
>
> [ https://issues.apache.org/jira/browse/LUCENE-836?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12482367 ]
>
> Karl Wettin commented on LUCENE-836:
> ------------------------------------
>
> Regarding data and user queries, I have a 150 000 document corpus
> with 4 000 000 queries that I might be able to convince the owners
> to release. It is great data, but a bit politically incorrect
> (torrents).
>
> There is some simple Wikipedia harvesting in LUCENE-826, and I'm in
> the middle of rewriting it to a more general Wikipedia library for
> text mining purposes. Perhaps you have some ideas you want to put
> in there? I plan something like this:
>
> public class WikipediaCorpus {
> Map<String, String> wikipediaDomainPrefixByLanguageISO
> Map<URL, WikipediaArticle> harvestedArticle
>
> public WikipediaArticle getArticle(String languageISO, String
> title) {
> ..
> }
> }
>
> public class WikipediaArticle {
> WikipediaArticle(URL url) {
> ..
> }
>
> String languageISO;
> String title;
> String[] contentParagraphs
>
> Date[] modified;
>
> Map<String, String> articleInOtherLanguagesByLanguageISO
>
> }
>
>
>
>> Benchmarks Enhancements (precision/recall, TREC, Wikipedia)
>> -----------------------------------------------------------
>>
>> Key: LUCENE-836
>> URL: https://issues.apache.org/jira/browse/LUCENE-836
>> Project: Lucene - Java
>> Issue Type: New Feature
>> Components: Other
>> Reporter: Grant Ingersoll
>> Priority: Minor
>>
>> Would be great if the benchmark contrib had a way of providing
>> precision/recall benchmark information ala TREC. I don't know
>> what the copyright issues are for the TREC queries/data (I think
>> the queries are available, but not sure about the data), so not
>> sure if the is even feasible, but I could imagine we could at
>> least incorporate support for it for those who have access to the
>> data. It has been a long time since I have participated in TREC,
>> so perhaps someone more familiar w/ the latest can fill in the
>> blanks here.
>> Another option is to ask for volunteers to create queries and make
>> judgments for the Reuters data, but that is a bit more complex and
>> probably not necessary. Even so, an Apache licensed set of
>> benchmarks may be useful for the community as a whole. Hmmm....
>> Wikipedia might be another option instead of Reuters to setup as a
>> download for benchmarking, as it is quite large and I believe the
>> licensing terms are quite amenable. Having a larger collection
>> would be good for stressing Lucene more and would give many users
>> a demonstration of how Lucene handles large collections.
>> At any rate, this kind of information could be useful for people
>> looking at different indexing schemes, formats, payloads and
>> different query strategies.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org