You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Scott Smith <SS...@MainstreamData.com> on 2004/01/06 04:26:03 UTC

Performance question

I have an application that is reading in XML files and indexing them.  Each
XML file is 3K-6K bytes.  This application preloads a database that I will
add to "on the fly" later.  However, all I want it to do initially is take
some existing files and create the initial index as quick as I can.  

Since I want to index "on the fly" later, I set the merge factor to 10.  I'm
assuming that I can't create the index initially with one merge factor
(e.g., 100) and then change the merge factor later (true?).

What I see is that it takes 1-3 seconds per xml file to do the index.  This
means I'm indexing around 150k bytes per minute.  I also notice that the CPU
utilization rarely exceeds 5% (looking at task manager on a Windows box).  I
use Xerces to read in the files (SAX interface) and I don't close or
optimize the index between stories nor do I sleep anyplace.  I've looked at
the page fault numbers and they aren't changing much.  I guess I would have
expected that I would have pretty much pegged the CPU and seen much faster
indexing.

Any ideas/suggestions? 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Performance question

Posted by Terry Steichen <te...@net-frame.com>.
Scott,

Here are some figures to use for comparision.  Using the latest Lucene
release, I index about 200 similar-sized XML files at a time, on a Windows
XP machine (2Ghz).  First I create a new index, which adds the documents at
a rate of about 8 per second (I don't recall what the cpu % is during this).
Then I merge this new index with the master one (using, I think, the default
merge factor), which takes about 4.5 minutes (during which time the cpu
utilization stays near 100%).  The master index currently holds about
115,000 such documents.

HTH,

Regards,

Terry

----- Original Message -----
From: "Scott Smith" <SS...@MainstreamData.com>
To: <lu...@jakarta.apache.org>
Sent: Monday, January 05, 2004 10:26 PM
Subject: Performance question


> I have an application that is reading in XML files and indexing them.
Each
> XML file is 3K-6K bytes.  This application preloads a database that I will
> add to "on the fly" later.  However, all I want it to do initially is take
> some existing files and create the initial index as quick as I can.
>
> Since I want to index "on the fly" later, I set the merge factor to 10.
I'm
> assuming that I can't create the index initially with one merge factor
> (e.g., 100) and then change the merge factor later (true?).
>
> What I see is that it takes 1-3 seconds per xml file to do the index.
This
> means I'm indexing around 150k bytes per minute.  I also notice that the
CPU
> utilization rarely exceeds 5% (looking at task manager on a Windows box).
I
> use Xerces to read in the files (SAX interface) and I don't close or
> optimize the index between stories nor do I sleep anyplace.  I've looked
at
> the page fault numbers and they aren't changing much.  I guess I would
have
> expected that I would have pretty much pegged the CPU and seen much faster
> indexing.
>
> Any ideas/suggestions?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Performance question

Posted by Otis Gospodnetic <ot...@yahoo.com>.
--- Scott Smith <SS...@MainstreamData.com> wrote:
> I have an application that is reading in XML files and indexing them.
>  Each
> XML file is 3K-6K bytes.  This application preloads a database that I
> will
> add to "on the fly" later.  However, all I want it to do initially is
> take
> some existing files and create the initial index as quick as I can.  
> 
> Since I want to index "on the fly" later, I set the merge factor to
> 10.  I'm
> assuming that I can't create the index initially with one merge
> factor
> (e.g., 100) and then change the merge factor later (true?).

I believe this is wrong.  You can change the merge factor at any time. 
I haven't tested this, though.

> What I see is that it takes 1-3 seconds per xml file to do the index.
>  This
> means I'm indexing around 150k bytes per minute.  I also notice that
> the CPU
> utilization rarely exceeds 5% (looking at task manager on a Windows
> box).  I
> use Xerces to read in the files (SAX interface) and I don't close or
> optimize the index between stories nor do I sleep anyplace.  I've
> looked at
> the page fault numbers and they aren't changing much.  I guess I
> would have
> expected that I would have pretty much pegged the CPU and seen much
> faster
> indexing.
> 
> Any ideas/suggestions? 

Check how much time XML parsing is taking, and how much the actual
indexing.  Lucene indexing is IO bound, not CPU bound, so what you are
seeing (5% CPU usage) sounds like Lucene may be the bottleneck.  But
check your XML parsing code.
Post the code, if you want.
In 1.3 version there are 2 other indexing parameters that you can use
for tuning.  You can try playing with those.  You can also give JVM
more memory.  One of my articles on the Resources page of Lucene's site
mentions this type of stuff.

Otis


__________________________________
Do you Yahoo!?
Yahoo! Hotjobs: Enter the "Signing Bonus" Sweepstakes
http://hotjobs.sweepstakes.yahoo.com/signingbonus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org