You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2006/11/06 04:24:19 UTC

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12447346 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

OK, here is a first crack at a standard benchmark contribution based on Andrzej original contribution and some updates/changes by me.  I wasn't nearly as ambitious  as some of the comments attached here, but I think most of them are good things to strive for and will greatly benefit Lucene.

I checked in the basic contrib directory structure, plus some library dependencies, as I wasn't sure how svn diff handles those.  I am posting this in patch format to solicit comments first instead of just committing and accepting patches.  My thoughts are I'll take a round of comments and make updates as warranted and then make an initial commit.  

I am particularly interested in the interface/Driver specification and whether people think this approach is useful or not.  My thoughts behind it were it might be nice to have a standard way of creating/running benchmarks that could be driven by XML configuration files (some examples are in the conf directory).  I am not 100% sold on this and am open to compelling arguments why we should just have each benchmark have it's own main() method.

As for the actual Benchmarker, I have created a "standard" version, which runs off the Reuters collection that is downloaded automatically by the ANT task.  There are two ANT targets for the two benchmarks: run-micro-standard and run-standard.  The micro version takes a few minutes to run on my machine (it indexes 2000 docs), the other one takes a lot longer.

There are several support classes in the stats and util packages.  The stats package supports building and maintaining information about benchmarks.  The utils package contains one class for extracting information out of the Reuters documents for indexing.

The ReutersQueries class contains a set of Queries I created by looking at some of the docs in the collection and are a myriad of term, phrase, span, wildcard and other types of queries.  They aren't exhaustive by any means.

It should be stressed that these benchmarks are best used in gathering before and after numbers.  Furthermore, these aren't the be all end all of benchmarking for Lucene.  I hope the interface nature will encourage others to submit benchmarks for specific areas of Lucene not covered by this version.

Thanks to all who contributed their code/thoughts.  Patch to follow

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org