You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/09/21 07:16:22 UTC

[jira] Created: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Lucene benchmark: objective performance test for Lucene
-------------------------------------------------------

                 Key: LUCENE-675
                 URL: http://issues.apache.org/jira/browse/LUCENE-675
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Andrzej Bialecki 


We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.

Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Grant Ingersoll updated LUCENE-675:
-----------------------------------

    Attachment: benchmark.patch

Initial Benchmark code based on Andrzej original contribution plus some changes by me to use the Reuters "standard" collection maintained at http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz



> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436518 ] 
            
Andrzej Bialecki  commented on LUCENE-675:
------------------------------------------

The dependency on commons-compress could be avoided - I used this just to be able to unpack tar.gz files, we can use Ant for that. If you meant the dependency on the corpus - can't Ant download this too as a dependency?

Re: Project Gutenberg - good point, this is a good source for multi-lingual documents. The "Europarl" collection is another, although a bit more hefty, so that could be suitable for running large-scale benchmarks, and texts from Project Gutenberg for running small-scale tests.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436949 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

My comments are marked by GSI
-----------

In the mean time I've been using Europarl for my testing.

GSI: perhaps you can contribute once this is setup

Also important to realize is there are many dimensions to test. With
lock-less I'm focusing entirely on "wall clock time to open readers
and writers" in different use cases like pure indexing, pure
searching, highly interactive mixed indexing/searching, etc. And this
is actually hard to test cleanly because in certain cases (highly
interactive case, or many readers case), the current Lucene hits many
"commit lock" retries and/or timeouts (whereas lock-less doesn't). So
what's a "fair" comparison in this case?

GSI:  I am planning on taking Andrzej contribution and refactoring it into components that can be reused, as well as creating a "standard" benchmark which will be easy to run through a simple ant task, i.e. ant run-baseline

GSI: From here, anybody can contribute their own (I will provide interfaces to facilitate this) benchmarks which others can choose to run. 


In addition to standardizing on the corpus I think we ideallly need
standardized hardware / OS / software configuration as well, so the
numbers are easily comparable across time. 

GSI: Not really feasible unless you are proposing to buy us machines :-)  I think more important is the ability to do a before and after evaluation (that runs each test several times) as you make changes.  Anybody should be able to do the same.  Run the benchmark, apply the patch and then rerun the benchmark.


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: timedata.zip

I tried it and it is working nice! - 
1st run downloaded the documents from the Web before starting to index. 
2nd run started right off - as input docs are already in place - great. 

Seems the only output is what is printed to stdout, right? 

I got something like this: 
----------------------------
     [echo] Working Directory: work
     [java] Testing 4 different permutations.
     [java] #-- ID: td-00_10_10, Sun Nov 05 22:40:49 PST 2006, heap=1065484288 --
     [java] # source=work\reuters-out, directory=org.apache.lucene.store.FSDirectory@D:\devoss\lucene\java\trunk\contrib\benchmark\work\index
     [java] # maxBufferedDocs=10, mergeFactor=10, compound=true, optimize=true
     [java] # Query data: R-reopen, W-warmup, T-retrieve, N-no
     [java] # qd-0110 R W NT [body:salomon]
     [java] # qd-0111 R W T [body:salomon]
     [java] # qd-0100 R NW NT [body:salomon]
...
     [java] # qd-14011 NR W T [body:fo*]
     [java] # qd-14000 NR NW NT [body:fo*]
     [java] # qd-14001 NR NW T [body:fo*]

     [java] Start Time: Sun Nov 05 22:41:38 PST 2006
     [java]  - processed 500, run id=0
     [java]  - processed 1000, run id=0
     [java]  - processed 1500, run id=0
     [java]  - processed 2000, run id=0
     [java] End Time: Sun Nov 05 22:41:48 PST 2006
     [java] warm = Warm Index Reader
     [java] srch = Search Index
     [java] trav = Traverse Hits list, optionally retrieving document

     [java] # testData id	operation	runCnt	recCnt	rec/s	avgFreeMem	avgTotalMem
     [java] td-00_100_100	addDocument	1	2000	472.0321	4493681	22611558
     [java] td-00_100_100	optimize	1	1	2.857143	4229488	22716416
     [java] td-00_100_100	qd-0110-warm	1	2000	40000.0	4250992	22716416
     [java] td-00_100_100	qd-0110-srch	1	1	Infinity	4221288	22716416
...
     [java] td-00_100_100	qd-4110-srch	1	1	Infinity	3993624	22716416
     [java] td-00_100_100	qd-4110-trav	1	0	NaN	3993624	22716416
     [java] td-00_100_100	qd-4111-warm	1	2000	50000.0	3853192	22716416
...
BUILD SUCCESSFUL
Total time: 1 minute 0 seconds
----------------------------

I think the "infinity" and "NAN" are caused by op time too short for divide-by-sec.
This can be avoided by modifying getRate() in TimeData:
  public double getRate() {
    double rps = (double) count * 1000.0 / (double) (elapsed>0 ? elapsed : 1);
    return rps;
  }

I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex params are handy. 

It seems that all the test logic and some of its data (queries) are java coded. I initially thought of a setting where we define tasks/jobs that are parameterized, like:

- createIndex(params)
- writeToIndex(params):
  - addDocs()
  - optimize()
- readFromIndex(params):
  - searchIndex()
  - fetchData()

..and compose a test by an XML that says which of these simple jobs to run, with what params, in which order, serial/parallel, how long/often etc. 
Then creating a different test is as easy as creating a different XML that configures that test. 

On the other hand, chances are, I know, that most useful cases would be those already defined here - standard and micro-standard, so can ask "why bothering changing to define these building blocks". I am not sure here, but thought I'll bring it up. 

About Using the driver - seems nice and clean to me. I don't know the Digester but it seems to read the config from the XML correctly.

Other comments:
1. I think there is a redundant call to params.showRunData(params.getId()) in runBenchmark(File,Options);
2. Seems that rec/sec would be a bit more accurately computed by aggregating elapsed times (instead of rate) in showRunData()
3. If TimeData not found (only memData) I think additional 0.0 should be printed
4. columns allignments with tabs and floats is imperfect.:-)
5. It would be nice I think to also get a summary of the results by "task" - e.g. srch, optimize, something like:
     [java] # testData id     operation           runCnt     recCnt          rec/s       avgFreeMem      avgTotalMem
     [java]                   warm                    60       2000       42,628.8        8,235,758       23,048,192
     [java]                   srch                   120          1          571.4        8,300,613       23,048,192
     [java]                   optimize                 1          1            2.9        9,375,732       23,048,192
     [java]                   trav                   120        107       30,517.8        8,326,046       23,048,192
     [java]                   addDocument              1       2000          441.8        7,310,929       22,206,872

Attached timedata.zip has modifies TimeData.java and TestData.java for [1 to 5] above, and for the NAN/inifinite. 

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12447346 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

OK, here is a first crack at a standard benchmark contribution based on Andrzej original contribution and some updates/changes by me.  I wasn't nearly as ambitious  as some of the comments attached here, but I think most of them are good things to strive for and will greatly benefit Lucene.

I checked in the basic contrib directory structure, plus some library dependencies, as I wasn't sure how svn diff handles those.  I am posting this in patch format to solicit comments first instead of just committing and accepting patches.  My thoughts are I'll take a round of comments and make updates as warranted and then make an initial commit.  

I am particularly interested in the interface/Driver specification and whether people think this approach is useful or not.  My thoughts behind it were it might be nice to have a standard way of creating/running benchmarks that could be driven by XML configuration files (some examples are in the conf directory).  I am not 100% sold on this and am open to compelling arguments why we should just have each benchmark have it's own main() method.

As for the actual Benchmarker, I have created a "standard" version, which runs off the Reuters collection that is downloaded automatically by the ANT task.  There are two ANT targets for the two benchmarks: run-micro-standard and run-standard.  The micro version takes a few minutes to run on my machine (it indexes 2000 docs), the other one takes a lot longer.

There are several support classes in the stats and util packages.  The stats package supports building and maintaining information about benchmarks.  The utils package contains one class for extracting information out of the Reuters documents for indexing.

The ReutersQueries class contains a set of Queries I created by looking at some of the docs in the collection and are a myriad of term, phrase, span, wildcard and other types of queries.  They aren't exhaustive by any means.

It should be stressed that these benchmarks are best used in gathering before and after numbers.  Furthermore, these aren't the be all end all of benchmarking for Lucene.  I hope the interface nature will encourage others to submit benchmarks for specific areas of Lucene not covered by this version.

Thanks to all who contributed their code/thoughts.  Patch to follow

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462117 ] 

Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Doron,

When I apply your patch, I am getting strange errors.  It seems to go through cleanly, but then the new files (for instance, byTask.stats.Report.java) has the whole file occurring twice in each file, thus causing duplicate class exceptions.  This happens for all the files in the byTask package.  The changes in the other files apply cleanly.

I applied the patch as: patch -p0 -i <patch file> as I always do on a clean version.

I suspect that your last comment may be at the root of the issue. Can you try applying this again to a clean version and see if you still have issues or whether it is something I am missing?  Can you regenerate this patch, perhaps using a command line tool?  Looking at the patch file, I am not sure what the issue is.  

Otherwise, based on the documentation, this sounds really interesting and useful.  Based on some of your other patches, I assume you are using this to do benchmarking, no?

Thanks,
Grant

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Closed: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll closed LUCENE-675.
----------------------------------


This has been committed and is available for use.  New issues can be opened on specific problems.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436979 ] 
            
Michael McCandless commented on LUCENE-675:
-------------------------------------------

I agree: a simple ant-accessible benchmark to enable "before and
after" runs is an awesome step forward.  And that a standardized HW/SW
testing environment is not really realistic now.

> GSI: perhaps you can contribute once this is setup 

I will try!

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436980 ] 
            
Doron Cohen commented on LUCENE-675:
------------------------------------

Few things that would be nice to have in this performance package/framework - 

() indexing only overall time.
() indexing only time changes as the index grows (might be the case that indexing performance starts to misbehave from a certain size or so).
() search single user while indexing
() search only single user
() search only concurrent users
() short queries
() long queries
() wild card queries
() range queries
() queries with rare words
() queries with common words
() tokenization/analysis only (above indexing measurements include tokenization, but it would be important to be able to "prove" to oneself that tokenization/analysis time is not hurt by  a recent change).

() parametric control over:
() () location of test input data.
() () location of output index.
() () location of output log/results.
() ()  total collection size (total number of bytes/characters read from collection)
() () document (average) size (bytes/chars) - test can break input data and recompose it into documents of desired size.
() () "implicit iteration size" - merge-factor, max-buffered-docs
() () "explicit iteration size" - how often the perf test calls
() () long queries text
() () short queries text
() () which parts of the test framework capabilities to run
() () number of users / threads.
() () queries pace - how many queries are fired in, say, a minute.

Additional points:
() Would help if all test run parameters are maintained in a properties (or xml config) file, so one can easily modify the test input/output without having to recompile the code.
() Output to allow easy creation of graphs or so - perhaps best would be to have an result object, so others can easily extend with additional output formats.
() index size as part of output.
() number of index files as part of output (?)
() indexing input module that can loop over the input collection. This allows to test indexing of a collection larger than the actual input collection being used. 



> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: file format incosisentcy (any answer ?) IMPORTANT

Posted by Samir Abdou <Sa...@unine.ch>.

When the field stores offsets and positions of its terms within term vectors
(in the .tvf file), these are not specified in the file format
documentation.

But looking to the TermVectorsWriter within the writeField() method, you'll
see that if offsets and positions are required, then these are written to
(.tvf file)

Hope this w'll help you,

Samir
 

-----Message d'origine-----
De : Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Envoyé : mercredi, 15. novembre 2006 19:36
À : java-dev@lucene.apache.org; samir.abdou@unine.ch
Objet : Re: file format incosisentcy 

: There is an inconsistency between the files format page (from Lucene
: website) and the source code. It concerns the positions and offsets of
term
: vectors. It seems that documentation (website) is not up to date.
According
: to the file format page, offsets and positions are not stored! Is that
: correct?

can you cite exactly what about the fileformats doc leads you to believe
this? ... a quick search for "offsets" and "positions" finds these lines
for me...

 If the third lowest-order bit is set (0x04), term positions are stored with
the term vectors.
 If the fourth lowest-order bit is set (0x08), term offsets are stored with
the term vectors.

...and that's just to start with.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: file format incosisentcy

Posted by Samir Abdou <Sa...@unine.ch>.

When the field stores offsets and positions of its terms within term vectors
(in the .tvf file), these are not specified in the file format
documentation.

But looking to the TermVectorsWriter within the writeField() method, you'll
see that if offsets and positions are required, then these are written to
(.tvf file)

Hope this w'll help you,

Samir
 

-----Message d'origine-----
De : Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Envoyé : mercredi, 15. novembre 2006 19:36
À : java-dev@lucene.apache.org; samir.abdou@unine.ch
Objet : Re: file format incosisentcy 

: There is an inconsistency between the files format page (from Lucene
: website) and the source code. It concerns the positions and offsets of
term
: vectors. It seems that documentation (website) is not up to date.
According
: to the file format page, offsets and positions are not stored! Is that
: correct?

can you cite exactly what about the fileformats doc leads you to believe
this? ... a quick search for "offsets" and "positions" finds these lines
for me...

 If the third lowest-order bit is set (0x04), term positions are stored with
the term vectors.
 If the fourth lowest-order bit is set (0x08), term offsets are stored with
the term vectors.

...and that's just to start with.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: file format incosisentcy

Posted by Chris Hostetter <ho...@fucit.org>.

: There is an inconsistency between the files format page (from Lucene
: website) and the source code. It concerns the positions and offsets of term
: vectors. It seems that documentation (website) is not up to date. According
: to the file format page, offsets and positions are not stored! Is that
: correct?

can you cite exactly what about the fileformats doc leads you to believe
this? ... a quick search for "offsets" and "positions" finds these lines
for me...

 If the third lowest-order bit is set (0x04), term positions are stored with the term vectors.
 If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors.

...and that's just to start with.

-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

file format incosisentcy

Posted by Samir Abdou <Sa...@unine.ch>.

Hi, 

There is an inconsistency between the files format page (from Lucene
website) and the source code. It concerns the positions and offsets of term
vectors. It seems that documentation (website) is not up to date. According
to the file format page, offsets and positions are not stored! Is that
correct?

Many thanks,

Samir


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: taskBenchmark.zip

Attached taskBenchmark.zip as described earlier.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449779 ] 
            
Doron Cohen commented on LUCENE-675:
------------------------------------

Good point on names with numbers - I'm renaming the package to taskBenchmark, as I think of it as "task sequence" based, more than as propetries based. 


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Paul Smith (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436443 ] 
            
Paul Smith commented on LUCENE-675:
-----------------------------------

>From a strict performance point of view, a standard set of important, but don't forget other languages.

>From a tokenization point of view (seperate to this issues), perhaps the Gutenberg project would be useful to test correctness of the analysis phase.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12447348 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

To run, checkout contrib/benchmark and then apply the benchmark.patch in the contrib/benchmark directory.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436934 ] 
            
Mike Klaas commented on LUCENE-675:
-----------------------------------

A few notes on benchmarks:

First, it is important to realize that no benchmark will ever fully-capture all aspects of lucene performance, particularly since so many real-world data distributions are so varied.  That said, they are useful tools, especially if they are componentized to measure various aspects of lucene performance (the narrower the goal of the benchmark it, the better a benchmark can be created).

It is rather unrealistic to expect to standardize hardware / os ... better to compare before/after numbers on a single configuration, rather than comparing the numbers among configurations.  The test process _is_ important, but anything crucial should be built into the test (like the number of iterations; taking the average, etc).  Concerning the specifics of this: Requiring reboots is onerous and not an important criterion (at least for unix systems--I'm not sufficiently familiar with windows to comment).  Better to stipulate a relatively quiscient machine.  Or perhaps not--it might be useful to see how the machine load affects lucene performance.  Also, the arithmetic mean is a terrible way of combining results due to its emphasis on outliers.  Better is the average over minimum times of small sets of runs.  

Of course, any scheme has its problems.  In general, the most important thing when using benchmarks is being aware of the limitations of the benchmark and methodology used.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449450 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

I'm not a big fan of tacking a number on to the end of Java names, as it doesn't let you know much about what's in the file or package.  How about ConfigurableBenchmarker or PropertyBasedBenchmarker or something along those lines, since what you are proposing is a property based one.  I think it can just go in the benchmark package or you could make a sub package under there that is more descriptive.

I will try to commit tonight or tomorrow morning.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12441281 ] 
            
Doug Cutting commented on LUCENE-675:
-------------------------------------

As Marvin points out, quick micro-benchmarks are great to have.  But other effects only show up when things get very large.  So I think we need at least two baselines: micro and macro.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Marvin Humphrey (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12440994 ] 
            
Marvin Humphrey commented on LUCENE-675:
----------------------------------------

The indexing benchmarking apps I wrote take command line arguments for how many docs and how many reps.  My standard test is to do 1000 docs and 6 reps.  Within a couple seconds the first rep is done and the app is printing out results.  For rapid development, having something that speedy is really handy.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by Doron Cohen <DO...@il.ibm.com>.

Steven Rowe <sa...@syr.edu> wrote on 16/11/2006 14:21:37:
> I haven't tried to use TortoiseSVN to create patches, but my experience
> with it for other purposes has been negative enough, especially in
> trying to use it on the same working copy on which I use a Cygwin
> command-line version of SVN, that I have kicked TortoiseSVN off my
> computer, and only use the command-line client.

Thanks Steven.
I have to agree with you, much as I liked the Windows Explorer integration,
cost in delays of Windows Explorer refresh and the fishy behavior about not
being able to apply a patch that it created are too much. I am going after
you and move to use svn & patch command line utilities within cygwin - this
is working!

Thanks for you help,
Doron

>
> The cute little icon overlays never seemed to be in sync with reality
> anyway, even after refreshing Windows Explorer, and GUIs over
> command-line functionality always make me a little paranoid about what
> seems like a loss of control, so I haven't mourned its loss too much.
>
> Steve


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by Steven Rowe <sa...@syr.edu>.

Doron Cohen (JIRA) wrote:
>      [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]
> (I used Tortoise SVN to create the patch).

I haven't tried to use TortoiseSVN to create patches, but my experience
with it for other purposes has been negative enough, especially in
trying to use it on the same working copy on which I use a Cygwin
command-line version of SVN, that I have kicked TortoiseSVN off my
computer, and only use the command-line client.

The cute little icon overlays never seemed to be in sync with reality
anyway, even after refreshing Windows Explorer, and GUIs over
command-line functionality always make me a little paranoid about what
seems like a loss of control, so I haven't mourned its loss too much.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by Daniel John Debrunner <dj...@apache.org>.

Doron Cohen wrote:

> And, for applying the patch, again from cygwin shell:
> - patch  -p0  -i  patchFile.patch
> 
> The -p0 was also something that I was missing.

FYI, I had endless issues with 'patch' with patches from contributors on 
the derby project, depending on if they were running windows or linux etc.

Once I switched to applying patches from Eclipse (and the Subclipse 
plugin) all such problems went away.

(right click on a project/folder/file), Team->Apply patch

Dan.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by Doron Cohen <DO...@il.ibm.com>.

Paul Elschot <pa...@xs4all.nl> wrote on 16/11/2006 14:03:23:

> Did you "svn add" the new files locally before doing "svn diff"?
>
> It took me a quite while to get the hang of that. It is mentioned here:
> http://wiki.apache.org/jakarta-lucene/HowToContribute

Thanks Paul,

In fact TortoiseSVN does the Add for you automatically when you path
"undpecified" files. But it later shows confusing warnings, plus it fails
to apply the patch that it(self) created, I decided to stop using it.

For the benefit of others, here is what finally works for me on Windows,
not only for modifying existing files but also for adding filed and
directories:
- install cygwin (one time ssetup), including its svn and patch pacakages.
- checkout Lucene sources within cygwin shell using its svn (current
version is 1.3.2).
  svn co http://svn.apache.org/repos/asf/lucene/java/trunk
- ... work the code...
- svn status - to see what have changed
- svn add file/dir - for adding to my local svn repository the new files
and directories (that I created)
- svn patch > patchFile.patch

And, for applying the patch, again from cygwin shell:
- patch  -p0  -i  patchFile.patch

The -p0 was also something that I was missing.

Thanks,
Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by Paul Elschot <pa...@xs4all.nl>.

Doron,

On Thursday 16 November 2006 21:17, Doron Cohen (JIRA) wrote:
> ...
> PS. Before submitting the patch file, I tried to apply it myself on a clean 
version of the code, just to make sure that it works. But I got errors like 
this -- Could not retrieve revision 0 of "...\byTask\.." -- for every file 
under a new folder. So I am not sure if it is just my (Windows) svn patch 
applying utility, or is it really impossible to apply a patch that creates 
files in (yet) nonexistent directories.  I searched Lucene mailing lists and 
SVN mailing lists and went again through the SVN book again but nowhere could 
I find what is the expected behavior for applying a patch containing new 
directories. In fact, "svn diff" would not even show you files that are new

Did you "svn add" the new files locally before doing "svn diff"?

It took me a quite while to get the hang of that. It is mentioned here:
http://wiki.apache.org/jakarta-lucene/HowToContribute

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: benchmark.byTask.patch

I am attaching benchmark.byTask.patch - to be applied in the contrib/benchmark directory. 

Root package of byTask classes was modified to org.apache.lucene.benchmark.byTask, in the lines of Grant's suggestion - seems better cause it keeps all benchmark classes under 
lucene.benchmark.

I added one a sample .alg under conf and added some documentation. 

Entry point - documentation wise - is the package doc for org.apache.lucene.benchmark.byTask.

Thanks for any comments on this!

PS. Before submitting the patch file, I tried to apply it myself on a clean version of the code, just to make sure that it works. But I got errors like this -- Could not retrieve revision 0 of "...\byTask\.." -- for every file under a new folder. So I am not sure if it is just my (Windows) svn patch applying utility, or is it really impossible to apply a patch that creates files in (yet) nonexistent directories.  I searched Lucene mailing lists and SVN mailing lists and went again through the SVN book again but nowhere could I find what is the expected behavior for applying a patch containing new directories. In fact, "svn diff" would not even show you files that are new (again, this is the Windows svn 1.4.2 version). (I used Tortoise SVN to create the patch). This is rather annoying and I might be misunderstanding something basic about SVN, but I thought it'd be better to share this experience here - might save some time for others trying to apply this patch or other patches
 ...

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436502 ] 
            
Karl Wettin commented on LUCENE-675:
------------------------------------

It is also interesting to know how much time is consumed to assemble an instance of Document from the storage. According to my own tests this is the major reason to why InstantiatedIndex is so much faster than a FS/RAMDirectory. I also presume it to be the bottleneck of any RDBMS-, RMI- or any other "proxy"-based storage. 

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12447781 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

1st run downloaded the documents from the Web before starting to index. 
2nd run started right off - as input docs are already in place - great. 

Seems the only output is what is printed to stdout, right? 


GSI: The Benchmarker interface does return the TimeData, so other implementations, etc. could use the results programmatically.



I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex params are handy. 

It seems that all the test logic and some of its data (queries) are java coded. I initially thought of a setting where we define tasks/jobs that are parameterized, like:

- createIndex(params)
- writeToIndex(params):
  - addDocs()
  - optimize()
- readFromIndex(params):
  - searchIndex()
  - fetchData()


GSI: I definitely agree that we want a more flexible one to meet people's benchmarking needs.  I wanted at least one test that is "standard" in that you can't change the parameters and test cases, so that we can all be on the same page on a run.  Then, when people are having discussions on performance they can say "I ran the standard benchmark before and after and here are the results" and we all know what they are talking about.  I think all the components are there for a parameterized version, all it takes is someone to extend the Standard one or implement there own that reads in a config file.  I will try to put in a fully parameterized version soon.  


GSI: Thanks for the fixes, I will incorporate into my version and post another patch soon.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Dawid Weiss (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436972 ] 
            
Dawid Weiss commented on LUCENE-675:
------------------------------------

First -- I think it's a good initiative. Grant, when you're thinking about the infrastructure, it would be pretty neat to have a way of logging performance in a way so that one could draw charts from them. You know, for the visual folks :)

Anyway, my other idea is that benchmarking Lucene can be performed on two levels: one is the user level, where the entire operation counts (such as indexing, searching etc). Another aspect is measurement of atomic parts _within_ the big operation so that you know how much of the whole thing each subpart takes. I wrote an interesting piece of code once that allows measuring times for named operation (per-thread) in a recursive way. Looks something like this:

perfLogger.start("indexing");
try {
  .. code (with recursion etc)  ...
  perfLogger.start("subpart");
  try { 

  } finally {
     perfLogger.stop();
  }
} finally {
  perfLogger.stop();
}

in the output you get something like this:

indexing: 5 seconds;
   ->subpart: : 2 seconds;
   -> ...

Of course everything comes at a price and the above logging costs some CPU cycles (my implementation stored a nesting stack in ThreadLocals).

One can always put that code in 'if' clauses attached to final variables and enable logging only for benchmarking targets (the compiler will get rid of logging statements then).

If folks are interested I can dig out that performance logger and maybe adopt it to what Grant comes up with.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: byTask.jre1.4.patch.txt

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12450041 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Committed the benchmark patch plus Doron's update to TestData and TimeData

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by Chris Hostetter <ho...@fucit.org>.

: Oops... I had the impression that compiling with compliance level 1.4 is
: sufficient to prevent this, but guess I need to read again what that
: compliance level setting guarantees exactly.

NOTE: see LUCENE-718 for an explanation of your problem, and a possible
solution i've been toying with.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463830 ] 

Doron Cohen commented on LUCENE-675:
------------------------------------

Oops... I had the impression that compiling with compliance level 1.4 is sufficient to prevent this, but guess I need to read again what that compliance level setting guarantees exactly. 

Anyhow there are a 3 things that require 1.5:
 - Boolean.parseBoolean() --> Boolean.valueOf().booleanValue()
 - String.contains() --> indexOf()
 - Class.getSimpleName() --> ?

Modifying Class.getSimpleName() to Class.getName() would not be very nice - queries prints and task names prints would be quite ugly. To fix that I added a method simpleName(Class) to byTask.util.Format. I am attaching an updated patch - byTask.jre1.4.patch.txt - that includes this method and removes the Java 1.5  dependency.

Thanks for catching this!
Doron

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Grant Ingersoll reassigned LUCENE-675:
--------------------------------------

    Assignee: Grant Ingersoll

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449947 ] 
            
Doron Cohen commented on LUCENE-675:
------------------------------------

Would be nice to get some feedback on what I already have at this point for the "task based benchmark framework for Lucene".  

So I am packing it as a zip file. I would probably resubmit as a patch when Grant commits the current benchmark code.
See attached taskBenchmark.zip.

To try out taskBenchmark, unzip under contrib/benchmark, on top of Grant's benchmark.patch.
This would do 3 changes:

1. replace build.xml - only change there is adding two targets: run-task-standard and run-task-micro-standard.

2. add 4 new files under conf:
 - task-standard.properties
 - task-standard.alg
 - task-micro-standard.properties
 - task-micro-standard.alg

3. add a src package 'taskBenchmark' side by side with current 'benchmark' package.

To try it out, go to contrib/benchmark and try 'ant run-task-standard' or 'ant run-task-micro-standard'. 

See inside the .alg files for how a test is specified.

The algorithm syntax and the entire package is documented in the package javadoc for taskBenchmark (package.html). 

Regards,
Doron

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436516 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Since this has dependencies, do you think we should put it under contrib?  I would be for a Performance directory and we could then organize it from there.  Perhaps into packages for quantitative and qualitative performance.  

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449117 ] 
            
Doron Cohen commented on LUCENE-675:
------------------------------------

I looked at extending the benchmark with:
- different test "scenarios", i.e. other sequences of operations.
- multithreaded tests, e.g. several queries in parallel.
- rate of events, e.g. "2 queries arriving per second", or "one query per second in parallel with 20 new documents in a minute".
- different data sources (input documents, queries).

For this I made lots of changes to the benchmark code, using parts of it and rewriting other parts. 
I would like to submit this code in a few days - it is running already but some functionality is missing.

I would like to describe how it works to hopefully get early feedback. 

There are several "basic tasks" defined - all extending an (abstract) class PerfTask:
- AddDocTask
- OptimizeTask
- CreateIndexTask
etc. 

To further extend the benchmark 'framework', new tasks can be added. Each task must implement the abstract method: doLogic(). For instance, in AddDocTask this method (doLogic) would call indexWriter.addDocument().
There are also setup() and tearDown() methods for performing work that should not be timed for that task. 

A special TaskSequence task contains other tasks. It is either parallel or sequential, which tells if it executes its child tasks serially or in parallel. 
TaskSequence also supports "rate": the pace in which its child tasks are "fired" can be controlled.

With these tasks, it is possible to describe a performance test 'algorithm' in a simple syntax.
('algorithm' may be too big a word for this...?)

A test invocation takes two parameters: 
- test.properties - file with various config properties.
- test.alg               - file with the algorithm.

By convention, for each task class  "OpNameTask",  the command  "OpName"  is valid in test.alg.

Adding a single document is done by:
    AddDoc

Adding 3 documents:
   AddDoc
   AddDoc
   AddDoc

Or, alternatively:
   { AddDoc } : 3

So, '{' and '}' indicate a serial sequence of (child) tasks. 

To fire 100 queries in a row:
  { Search } : 100

To fire 100 queries in parallel:
  [ Search ] : 100

So, '[' and ']' indicate a parallel group of tasks. 

To fire 100 queries in a row, 2 queries per second (120 per minute):
  { Search } : 100 : 120

Similar, but in parallel:
  [ Search ] : 100 : 120

A sequence task can be named for identifying it in reports:
  { "QueriesA" Search } : 100 : 120

And there are tasks that create reports. 

There are more tasks, and more to tell on the alg syntax, but this post is already long..

I find this quite powerful for perf testing.
What do you (and you) think?

- Doron


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462402 ] 

Doron Cohen commented on LUCENE-675:
------------------------------------

This update of the byTask package includes:
- allowing to tailor a perf test "programmically" (without an .alg file).
- maintaining both the "algorithm" and the run-properties in a single .alg file - this is easier to maintain in my opinion.
- some code cleanup.
- build.xml has a single "task related" target now: run-task. an ant property is used to invoke other .alg files.
- documentation updated (package docs under byTask).

To apply the patch from the trunk dir:   patch -p0 -i <byTask.2.patch.txt>
To test it, cd to contrib/benchmark and type:  ant run-task

Grant, I noticed that the patch file contains EOL characters - Unix/DOS thing I guess.
But 'patch' works cleanly for me either with these characters or without them, so I am leaving these characters there.
I hope this patch applies cleanly for you.


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449409 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

OK, how about I commit my changes, then you can add a patch that shows your ideas?


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12449419 ] 
            
Doron Cohen commented on LUCENE-675:
------------------------------------

Sounds good.

In this case I will add my stuff under a new package: org.apache.lucene.benchmark2. (this package would have no dependencies in org.apache.lucene.benchmark.). I will also add tarkets in buid.xml, and add .alg, and .alg files under conf.
Makes sense?

Do you already know when you are going to commit it?

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436519 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Yeah, ANT can do this, I think.  Take a look at the DB contrib package, it downloads.  I think I can setup the necessary stuff in contrib, if people think that is a good idea.  First contribution will be this file and then we can go from there.  I think Otis has run some perf. stuff too, but I am not sure if it can be contributed.  I think someone else has really studied query perf. so it would be cool if that was added too.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Paul Smith (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436437 ] 
            
Paul Smith commented on LUCENE-675:
-----------------------------------

If you're looking for freely available text in bulk, what about:

http://www.gutenberg.org/wiki/Main_Page

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436442 ] 
            
Andrzej Bialecki  commented on LUCENE-675:
------------------------------------------

Yes, that could be a good additional source. However, IMHO the primary corpus should be widely known and standardized, hence my proposal of the Reuters.

(I mistakenly copy&paste-d the urls in the comment above - of course the corpus they're pointing at is the "20 Newsgroups", not the Reuters one. Correct url for the Reuters corpus is  http://www.daviddlewis.com/resources/testcollections/reuters21578/ ).

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: tiny.alg
                tiny.properties

I am attaching a sample tiny.* - the .alg and .properties files I currently use - I think they may help to understand how this works.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved LUCENE-675.
------------------------------------

    Resolution: Fixed

Have committed a baseline benchmarking suite thanks to Doron and Andrzej.   Bugs can now be opened specific to the code in the contrib area.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463792 ] 

Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Hey Doron, 

Your patch uses JDK 1.5.  I am assuming it is safe to use Class.getName in place of Class.getSimpleName, right?  I think once I do that plus change the String.contains calls to String.indexOf it should all be fine, right?  I have it compiling and running, so that is a good sign.  I will look to commit soon.

-Grant

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464410 ] 

Grant Ingersoll commented on LUCENE-675:
----------------------------------------

Doron, 

I have committed your additions.  This truly is great stuff.  Thank you so much for contributing.  The documentation (code and package level) is well done, the output is very readable.  The alg language is a bit cryptic and takes a little deciphering, but you do document it quite nicely.   I like the extendability factor and I think it will make it easier for people to contribute benchmarking capabilities.

I would love to see someone mod the reporting mechanism in the future to allow for printing info to something other than System.out, as I know people have expressed interest in being able to slurp the output into Excel or similar number crunching tools.   This could also lead to the possibility of running some of the algorithms nightly and then integrating with JUnitPerf or some other performance unit testing approach.

We may want to consider deprecating the other benchmarking stuff, although, I suppose it can't hurt to have multiple opinions in this area.

At any rate, this is very much appreciated.  I would encourage everyone who is interested in benchmarking to take a look and provide feedback.  I'm going to mark this bug as finished for now as I think we have a good baseline for benchmarking at this point.

Thanks again,
Grant




> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, byTask.jre1.4.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=all ]

Andrzej Bialecki  updated LUCENE-675:
-------------------------------------

    Attachment: LuceneBenchmark.java

This is just a starting point for discussion - it's a pretty old file I found lying around, so it may not even compile with modern Lucene. Requires commons-compress.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436858 ] 
            
Michael McCandless commented on LUCENE-675:
-------------------------------------------

I think this is an incredibly important initiative: with every
non-trivial change to Lucene (eg lock-less commits) we must verify
performance did not get worse.  But, as things stand now, it's an
ad-hoc thing that each developer needs to do.

So (as a consumer of this), I would love to have a ready-to-use
standard test that I could run to check if I've slowed things down
with lock-less commits.

In the mean time I've been using Europarl for my testing.

Also important to realize is there are many dimensions to test.  With
lock-less I'm focusing entirely on "wall clock time to open readers
and writers" in different use cases like pure indexing, pure
searching, highly interactive mixed indexing/searching, etc.  And this
is actually hard to test cleanly because in certain cases (highly
interactive case, or many readers case), the current Lucene hits many
"commit lock" retries and/or timeouts (whereas lock-less doesn't).  So
what's a "fair" comparison in this case?

In addition to standardizing on the corpus I think we ideallly need
standardized hardware / OS / software configuration as well, so the
numbers are easily comparable across time.  Even the test process
itself is important, eg details like "you should reboot the box before
each run" and "discard results from first run then take average of
next 3 runs as your result", are important.  It would be wonderful if
we could get this into a nightly automated regression test so we could
track over time how the performance has changed (and, for example,
quickly detect accidental regressions).  We should probably open this
as a separate issue which depends first on this issue being complete.


> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462287 ] 

Doron Cohen commented on LUCENE-675:
------------------------------------

Grant, thanks for trying this out - I will update the patch shortly. 
I am using this for benchmarking - quite easy to add new stuff - and in fact I added some stuff lately but did not update here because wasn't sure if others are interested. 
I will verify what I have with svn head and pack it here as an updated patch.
Regards,
Doron

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-675:
-----------------------------------

    Priority: Minor  (was: Major)

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12440990 ] 
            
Grant Ingersoll commented on LUCENE-675:
----------------------------------------

OK, I have a preliminary implementation based on adapting Andrzej's approach.  The interesting thing about this approach, is it is easy to adapt to be more or less exhaustive (i.e. how many of the parameters does one wish to have the system alter as it runs)  Thus, you can have it change the merge factors, max buffered docs, number of documents indexed, number of different queries run, etc.  The tradeoff, of course, is the length of time it takes to run these.

So my question to those interested, is what is a good baseline running time for testing in a standard way?  My initial thought is to have something that takes between 15-30 minutes to run, but I am not sure on this.  Another approach would be to have three "baselines":  1. quick validation (5 minutes to run...) 2. standard (15-45) 3. exhaustive (1-10 hours).  

I know several others have built benchmarking suites for their internal use, what has been your strategy? 

Thoughts, ideas, insights?

Thanks,
Grant

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-675:
-------------------------------

    Attachment: byTask.2.patch.txt

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: https://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Assigned To: Grant Ingersoll
>            Priority: Minor
>         Attachments: benchmark.byTask.patch, benchmark.patch, BenchmarkingIndexer.pm, byTask.2.patch.txt, extract_reuters.plx, LuceneBenchmark.java, LuceneIndexer.java, taskBenchmark.zip, timedata.zip, tiny.alg, tiny.properties
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-675) Lucene benchmark: objective performance test for Lucene

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/LUCENE-675?page=comments#action_12436587 ] 
            
Otis Gospodnetic commented on LUCENE-675:
-----------------------------------------

I still haven't gotten my employer to sign and fax the CCLA, so I'm stuck and can't contribute my search benchmark.

I have a suggestion for a name for this - Lube, for Lucene Benchmark - contrib/lube.

> Lucene benchmark: objective performance test for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-675
>                 URL: http://issues.apache.org/jira/browse/LUCENE-675
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>         Attachments: LuceneBenchmark.java
>
>
> We need an objective way to measure the performance of Lucene, both indexing and querying, on a known corpus. This issue is intended to collect comments and patches implementing a suite of such benchmarking tests.
> Regarding the corpus: one of the widely used and freely available corpora is the original Reuters collection, available from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz or http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/20news-18828.tar.gz. I propose to use this corpus as a base for benchmarks. The benchmarking suite could automatically retrieve it from known locations, and cache it locally.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org