You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Kelvin Tan <ke...@relevanz.com> on 2002/05/03 08:47:48 UTC

Indexing performance benchmarks

I think that it would be really useful if users can post performance
benchmarks for usage of Lucene in their app. I know its been done informally
on an ad hoc basis by various people in the past, but I'd like to propose a
standardized format:

Number of source documents:
Total filesize of source documents:
Average filesize of source documents (in KB/MB):
Source documents storage location (filesystem, DB, http,etc):
File type of source documents:
Parser(s) used, if any:
Time taken (in ms/s as an average of at least 3 indexing runs):
Notes (any special tuning/strategies):

This will really help users know what performance to expect when indexing
and should  help to raise warning flags when indexing times aren't similar
to benchmarks. Any one to start? :)

Regards,
Kelvin


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Indexing performance benchmarks

Posted by Kelvin Tan <ke...@relevanz.com>.

Excellent. Otis suggested number and type of fields as well. I'd really like
to consolidate these figures and stick them somewhere on the website if its
ok with the people who contribute. Thanks Peter for adding the stats on
hardware and others.

Here's an updated and sorted list:

<benchmark>
Hardware environment
Dedicated machine for indexing (yes/no):
CPU (Type, Speed and Quantity):
RAM:
Drive configuration (IDE, SCSI, RAID-1, RAID-5):

Software environment
Java Version:
OS Version:
Location of index directory (local/network):

Lucene indexing variables
Number of source documents:
Total filesize of source documents:
Average filesize of source documents (in KB/MB):
Source documents storage location (filesystem, DB, http,etc):
File type of source documents:
Parser(s) used, if any:
Analyzer(s) used:
Number of fields per document:
Type of fields:
Index persistence (FSDirectory, SqlDirectory, etc):

Time taken (in ms/s as an average of at least 3 indexing runs):
Time taken / 1000 docs indexed:
Memory consumption:

Notes (any special tuning/strategies):
</benchmark>

If you'd like to contribute these stats but wish to remain anonymous, that's
cool too. You can mail me offline or something, and your boss will never
know...:)

----- Original Message -----
From: "Peter Carlson" <ca...@bookandhammer.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, May 03, 2002 9:21 PM
Subject: Re: Indexing performance benchmarks


> I like this idea too.
>
> It would also be good to know about system used
>
> Java Version
> OS Version
> CPU (Type, Speed and Quantity)
> RAM
> Drive configuration (IDE, SCSI, RAID-1, RAID-5)
>
>
>
> On 5/2/02 11:47 PM, "Kelvin Tan" <ke...@relevanz.com> wrote:
>
> >
> > Number of source documents:
> > Total filesize of source documents:
> > Average filesize of source documents (in KB/MB):
> > Source documents storage location (filesystem, DB, http,etc):
> > File type of source documents:
> > Parser(s) used, if any:
> > Time taken (in ms/s as an average of at least 3 indexing runs):
> > Notes (any special tuning/strategies):
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Indexing performance benchmarks

Posted by Peter Carlson <ca...@bookandhammer.com>.

I like this idea too.

It would also be good to know about system used

Java Version
OS Version
CPU (Type, Speed and Quantity)
RAM
Drive configuration (IDE, SCSI, RAID-1, RAID-5)



On 5/2/02 11:47 PM, "Kelvin Tan" <ke...@relevanz.com> wrote:

> 
> Number of source documents:
> Total filesize of source documents:
> Average filesize of source documents (in KB/MB):
> Source documents storage location (filesystem, DB, http,etc):
> File type of source documents:
> Parser(s) used, if any:
> Time taken (in ms/s as an average of at least 3 indexing runs):
> Notes (any special tuning/strategies):


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Performance benchmarks

Posted by Kelvin Tan <ke...@relevanz.com>.

Great Peter. I've posted a new set of attributes based on your submission
and Otis' feedback. Let me think about the best way to consolidate these
numbers and stick them somewhere accessible for all.

----- Original Message -----
From: "Peter Carlson" <ca...@bookandhammer.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Friday, May 03, 2002 9:50 PM
Subject: Performance benchmarks


> Some performance numbers
>
> Java Version: 1.3_01
> OS Version: Windows 2000
> CPU (Type, Speed and Quantity): Pentium 4, 1.5 GHz, 1 CPU
> RAM: 512 MB
> Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE (single)
> Number of source documents: 103009
> Total filesize of source documents: 430MB
> Average filesize of source documents (in KB/MB): 4.3KB
> Source documents storage location (filesystem, DB, http,etc): Filesystem
> File type of source documents: xml
> Parser(s) used, if any: Standard Analyzer
> Number of Fields per document: 8
> Time taken (in ms/s as an average of at least 3 indexing runs): 8387 sec
> (139 min)
> Time taken / 1000 docs indexed: 81 sec / 1000 docs
> Notes (any special tuning/strategies):
> I convert each document to a DOM, and use xpath to get the fields.
> I perform validation on the data and make sure that it meets certain
> criteria like total size > 150 characters, and verify there are no
> duplicates using a Hashmap. Without these checks, the indexing goes faster
> (about 60 seconds/1000 docs).
>
>
> I hope this is helpful.
> --Peter
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Performance benchmarks

Posted by Peter Carlson <ca...@bookandhammer.com>.

Some performance numbers

Java Version: 1.3_01
OS Version: Windows 2000
CPU (Type, Speed and Quantity): Pentium 4, 1.5 GHz, 1 CPU
RAM: 512 MB
Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE (single)
Number of source documents: 103009
Total filesize of source documents: 430MB
Average filesize of source documents (in KB/MB): 4.3KB
Source documents storage location (filesystem, DB, http,etc): Filesystem
File type of source documents: xml
Parser(s) used, if any: Standard Analyzer
Number of Fields per document: 8
Time taken (in ms/s as an average of at least 3 indexing runs): 8387 sec
(139 min)
Time taken / 1000 docs indexed: 81 sec / 1000 docs
Notes (any special tuning/strategies):
I convert each document to a DOM, and use xpath to get the fields.
I perform validation on the data and make sure that it meets certain
criteria like total size > 150 characters, and verify there are no
duplicates using a Hashmap. Without these checks, the indexing goes faster
(about 60 seconds/1000 docs).


I hope this is helpful.
--Peter


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>