You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Hamish Carpenter <ha...@catalyst.net.nz> on 2002/11/29 02:41:30 UTC

Benchmarking Results

Hi Everyone,

I've been lurking on this list for a couple of weeks now.  I thought I 
would contribute my experiences (with timings) of using lucene.

The main issue we have is why does performance decrease significantly 
when searching with multiple threads?

I hope this helps people starting out with lucene to compare with their 
performance.

Hamish

BTW: Optimizing took 3.5 minutes at 500,000 documents and 4.7 minutes at 
1,000,000 documents.  Sorry I don't have memory usage figures.

<benchmark>
Hardware environment
--------------------
Dedicated machine for indexing (yes/no): yes
CPU (Type, Speed and Quantity): Intel x86 P4 1.5Ghz
RAM: 512 DDR
Drive configuration (IDE, SCSI, RAID-1, RAID-5): IDE 7200rpm Raid-1

Software environment
--------------------
Java Version: 1.3.1 IBM JITC Enabled
OS Version: Debian Linux 2.4.18-686
Location of index directory (local/network): local

Lucene indexing variables
-------------------------
Number of source documents: Random generator. Set to make 1M documents
in 2x500,000 batches.
Total filesize of source documents: > 1Gb if stored.
Average filesize of source documents (in KB/MB): 1kb
Source documents storage location (filesystem, DB, http,etc): fs
File type of source documents: generated.
Parser(s) used, if any: default
Analyzer(s) used: default
Number of fields per document: 11
Type of fields: 1 date, 1 id, 9 text
Index persistence (FSDirectory, SqlDirectory, etc): FSDirectory

Time taken (in ms/s as an average of at least 3 indexing runs):
Time taken / 1000 docs indexed: 49seconds
Memory consumption: unsure

Notes (any special tuning/strategies):
--------------------------------------
A windows client ran a random document generator which created
documents based on some arrays of values and an excerpt (approx 1kb)
from a text file of the bible (King James version).
These were submitted via a socket connection (open throughout
indexing process).
The index writer was not closed between index calls.
This created a 400Mb index in 23 files (after optimization).

Query details:
--------------
Set up a threaded class to start x number of simultaneous threads to
search the above created index.

Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) (Teaser:goo* Tea
ser:plan*) (Details:goo* Details:plan*)) -Cancel:y) 
+DisplayStartDate:[mkwsw2jk0
-mq3dj1uq0] +EndDate:[mq3dj1uq0-ntlxuggw0]

This query counted 34000 documents and I limited the returned documents
to 5.

This is using Peter Halacsy's IndexSearcherCache slightly modified to
be a singleton returned cached searchers for a given directory. This
solved an initial problem with too many files open and running out of
linux handles for them.

Threads|Avg Time per query (ms)
1       1009ms
2       2043ms
3       3087ms
4       4045ms
.	.
.	.
10      10091ms

I removed the two date range terms from the query and it made a HUGE
difference in performance. With 4 threads the avg time dropped to 900ms!

Other query optimizations made little difference.
</benchmark>



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Benchmarking Results

Posted by Kelvin Tan <ke...@relevanz.com>.

Thanks for posting your benchmark results Hamish. A while ago, I 
started collecting similar posts from a couple of folks who have been 
generous enough to share their results. 

Let me see if I can find those numbers again, and this time, maybe 
they'll make it onto the website or FAQ, with each author's explicit 
blessings of course...

Regards,
Kelvin

--------
The book giving manifesto     - http://how.to/sharethisbook


On Fri, 29 Nov 2002 14:41:30 +1300, Hamish Carpenter said:
>Hi Everyone,
>
>I've been lurking on this list for a couple of weeks now.  I thought
>I would contribute my experiences (with timings) of using lucene.
>
>The main issue we have is why does performance decrease
>significantly when searching with multiple threads?
>
>I hope this helps people starting out with lucene to compare with
>their performance.
>
>Hamish
>
>BTW: Optimizing took 3.5 minutes at 500,000 documents and 4.7
>minutes at 1,000,000 documents.  Sorry I don't have memory usage
>figures.
>
><benchmark> Hardware environment --------------------
>Dedicated machine for indexing (yes/no): yes CPU (Type, Speed and
>Quantity): Intel x86 P4 1.5Ghz RAM: 512 DDR Drive configuration
>(IDE, SCSI, RAID-1, RAID-5): IDE 7200rpm Raid-1
>
>Software environment --------------------
>Java Version: 1.3.1 IBM JITC Enabled OS Version: Debian Linux
>2.4.18-686 Location of index directory (local/network): local
>
>Lucene indexing variables -------------------------
>Number of source documents: Random generator. Set to make 1M
>documents in 2x500,000 batches.
>Total filesize of source documents: > 1Gb if stored.
>Average filesize of source documents (in KB/MB): 1kb Source
>documents storage location (filesystem, DB, http,etc): fs File type
>of source documents: generated.
>Parser(s) used, if any: default Analyzer(s) used: default Number of
>fields per document: 11 Type of fields: 1 date, 1 id, 9 text Index
>persistence (FSDirectory, SqlDirectory, etc): FSDirectory
>
>Time taken (in ms/s as an average of at least 3 indexing runs): Time
>taken / 1000 docs indexed: 49seconds Memory consumption: unsure
>
>Notes (any special tuning/strategies):
>--------------------------------------
>A windows client ran a random document generator which created
>documents based on some arrays of values and an excerpt (approx 1kb)
>from a text file of the bible (King James version).
>These were submitted via a socket connection (open throughout
>indexing process).
>The index writer was not closed between index calls.
>This created a 400Mb index in 23 files (after optimization).
>
>Query details: --------------
>Set up a threaded class to start x number of simultaneous threads to
>search the above created index.
>
>Query:  +Domain:sos +(+((Name:goo*^2.0 Name:plan*^2.0) (Teaser:goo*
>Tea ser:plan*) (Details:goo* Details:plan*)) -Cancel:y)
>+DisplayStartDate:[mkwsw2jk0 -mq3dj1uq0] +EndDate:[mq3dj1uq0-
>ntlxuggw0]
>
>This query counted 34000 documents and I limited the returned
>documents to 5.
>
>This is using Peter Halacsy's IndexSearcherCache slightly modified
>to be a singleton returned cached searchers for a given directory.
>This solved an initial problem with too many files open and running
>out of linux handles for them.
>
>Threads|Avg Time per query (ms) 1       1009ms 2       2043ms 3
>3087ms 4       4045ms ..        .
>..        .
>10      10091ms
>
>I removed the two date range terms from the query and it made a HUGE
>difference in performance. With 4 threads the avg time dropped to
>900ms!
>
>Other query optimizations made little difference.
></benchmark>
>
>
>
>--
>To unsubscribe, e-mail:   <mailto:lucene-user-
>unsubscribe@jakarta.apache.org> For additional commands, e-mail:
><mailto:lucene-user-
>help@jakarta.apache.org>




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>