You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Murdoch, Paul" <PA...@saic.com> on 2010/03/15 15:41:47 UTC

Batch Indexing - best practice?

Hi,

 

I'm using Lucene 2.9.2.  Currently, when creating my index, I'm calling
indexWriter.addDocument(doc) for each Document I want to index.  The
Documents aren't large and I'm averaging indexing about 500 documents
every 90 seconds.  I'd like to try and speed this up....unless 90
seconds for 500 Documents is reasonable.  I have the merge factor set to
1000.  Do you have any suggestions for batch indexing?  Is there
something like indexWriter.addDocuments(Document[] docs) in the API?

 

Thanks.

Paul 

 


Re: Batch Indexing - best practice?

Posted by Mark Miller <ma...@gmail.com>.
Really depends - StandardAnalyzer is probably a slower analyzer. But for 
example, with my quad core desktop machine, indexing with 3 or 4 
threads, I can do at least a couple hundred wikipedia docs per second 
(though I'm not using StandardAnalyzer). I'm indexing 10,000 docs in 
about a minute.

Does depend on your analzyers, and the size of the docs, but that looks 
pretty slow to me. Giving the JVM enough heap?

On 03/15/2010 11:02 AM, Murdoch, Paul wrote:
> Thanks.  I'll try lowering the merge factor and see if speed increases.
> The indexing is threaded....similar to the utility class in Listing 10.1
> from Lucene in Action.  Search speed is great once the index is
> built....close to real time.  So my main problem is getting the indexing
> speed fixed.  I do use the StandardAnalyzer for most of my fields.  What
> type of performance level should I be trying to hit for indexing
> (docs/sec)...just to give me an idea of what to shoot for?
>
> Paul
>
> -----Original Message-----
> From: java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> [mailto:java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> ] On Behalf Of Mark Miller
> Sent: Monday, March 15, 2010 10:48 AM
> To: java-user@lucene.apache.org
> Subject: Re: Batch Indexing - best practice?
>
> On 03/15/2010 10:41 AM, Murdoch, Paul wrote:
>    
>> Hi,
>>
>>
>>
>> I'm using Lucene 2.9.2.  Currently, when creating my index, I'm
>>      
> calling
>    
>> indexWriter.addDocument(doc) for each Document I want to index.  The
>> Documents aren't large and I'm averaging indexing about 500 documents
>> every 90 seconds.  I'd like to try and speed this up....unless 90
>> seconds for 500 Documents is reasonable.  I have the merge factor set
>>      
> to
>    
>> 1000.  Do you have any suggestions for batch indexing?  Is there
>> something like indexWriter.addDocuments(Document[] docs) in the API?
>>
>>
>>
>> Thanks.
>>
>> Paul
>>
>>
>>
>>
>>
>>      
> You should lower that merge factor - thats *really* high.
>
> You shouldn't really need much more than 50 or so ... and for search
> speed your going to want fewer segments anyway -
> if your just going to end up optimizing at the end, there is no reason
> for such a large merge factor - you will pay for most of what
> you saved when you optimize.
>
> That is very slow by the way. Should be much faster - especially if you
> are using multiple threads.
>
>    


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Batch Indexing - best practice?

Posted by "Murdoch, Paul" <PA...@saic.com>.
Thanks.  Timing the different parts of the indexing process led me to
the real cause of the problem.  I wasn't reusing my threaded
indexWriter.  By keeping the indexWriter open, I'm now able to index 500
documents in less than 1 second.  That's huge improvement.

Thanks again,

Paul


-----Original Message-----
From: java-user-return-45439-PAUL.B.MURDOCH=saic.com@lucene.apache.org
[mailto:java-user-return-45439-PAUL.B.MURDOCH=saic.com@lucene.apache.org
] On Behalf Of Erick Erickson
Sent: Monday, March 15, 2010 12:45 PM
To: java-user@lucene.apache.org
Subject: Re: Batch Indexing - best practice?

What's a document? What's indexing?

Here's what I'd do as a very first step. Time the actual
indexing and report it out. By that I mean how long does
IndexWriter.addDocument() take? If you actually get the
document from wherever first then add all the fields
and add the document, I'd time adding the fields too. The point
is to separate the Lucene stuff from whatever else you do
before trying to fix anything.

The first point of the link Ian provided has the easily-overlooked
phrase "and the slowness is indeed inside Lucene"...

Best
Erick



On Mon, Mar 15, 2010 at 11:02 AM, Murdoch, Paul
<PA...@saic.com>wrote:

> Thanks.  I'll try lowering the merge factor and see if speed
increases.
> The indexing is threaded....similar to the utility class in Listing
10.1
> from Lucene in Action.  Search speed is great once the index is
> built....close to real time.  So my main problem is getting the
indexing
> speed fixed.  I do use the StandardAnalyzer for most of my fields.
What
> type of performance level should I be trying to hit for indexing
> (docs/sec)...just to give me an idea of what to shoot for?
>
> Paul
>
> -----Original Message-----
> From: java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
>
[mailto:java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> ] On Behalf Of Mark Miller
> Sent: Monday, March 15, 2010 10:48 AM
> To: java-user@lucene.apache.org
> Subject: Re: Batch Indexing - best practice?
>
> On 03/15/2010 10:41 AM, Murdoch, Paul wrote:
> > Hi,
> >
> >
> >
> > I'm using Lucene 2.9.2.  Currently, when creating my index, I'm
> calling
> > indexWriter.addDocument(doc) for each Document I want to index.  The
> > Documents aren't large and I'm averaging indexing about 500
documents
> > every 90 seconds.  I'd like to try and speed this up....unless 90
> > seconds for 500 Documents is reasonable.  I have the merge factor
set
> to
> > 1000.  Do you have any suggestions for batch indexing?  Is there
> > something like indexWriter.addDocuments(Document[] docs) in the API?
> >
> >
> >
> > Thanks.
> >
> > Paul
> >
> >
> >
> >
> >
> You should lower that merge factor - thats *really* high.
>
> You shouldn't really need much more than 50 or so ... and for search
> speed your going to want fewer segments anyway -
> if your just going to end up optimizing at the end, there is no reason
> for such a large merge factor - you will pay for most of what
> you saved when you optimize.
>
> That is very slow by the way. Should be much faster - especially if
you
> are using multiple threads.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Batch Indexing - best practice?

Posted by Erick Erickson <er...@gmail.com>.
What's a document? What's indexing?

Here's what I'd do as a very first step. Time the actual
indexing and report it out. By that I mean how long does
IndexWriter.addDocument() take? If you actually get the
document from wherever first then add all the fields
and add the document, I'd time adding the fields too. The point
is to separate the Lucene stuff from whatever else you do
before trying to fix anything.

The first point of the link Ian provided has the easily-overlooked
phrase "and the slowness is indeed inside Lucene"...

Best
Erick



On Mon, Mar 15, 2010 at 11:02 AM, Murdoch, Paul <PA...@saic.com>wrote:

> Thanks.  I'll try lowering the merge factor and see if speed increases.
> The indexing is threaded....similar to the utility class in Listing 10.1
> from Lucene in Action.  Search speed is great once the index is
> built....close to real time.  So my main problem is getting the indexing
> speed fixed.  I do use the StandardAnalyzer for most of my fields.  What
> type of performance level should I be trying to hit for indexing
> (docs/sec)...just to give me an idea of what to shoot for?
>
> Paul
>
> -----Original Message-----
> From: java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> [mailto:java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> ] On Behalf Of Mark Miller
> Sent: Monday, March 15, 2010 10:48 AM
> To: java-user@lucene.apache.org
> Subject: Re: Batch Indexing - best practice?
>
> On 03/15/2010 10:41 AM, Murdoch, Paul wrote:
> > Hi,
> >
> >
> >
> > I'm using Lucene 2.9.2.  Currently, when creating my index, I'm
> calling
> > indexWriter.addDocument(doc) for each Document I want to index.  The
> > Documents aren't large and I'm averaging indexing about 500 documents
> > every 90 seconds.  I'd like to try and speed this up....unless 90
> > seconds for 500 Documents is reasonable.  I have the merge factor set
> to
> > 1000.  Do you have any suggestions for batch indexing?  Is there
> > something like indexWriter.addDocuments(Document[] docs) in the API?
> >
> >
> >
> > Thanks.
> >
> > Paul
> >
> >
> >
> >
> >
> You should lower that merge factor - thats *really* high.
>
> You shouldn't really need much more than 50 or so ... and for search
> speed your going to want fewer segments anyway -
> if your just going to end up optimizing at the end, there is no reason
> for such a large merge factor - you will pay for most of what
> you saved when you optimize.
>
> That is very slow by the way. Should be much faster - especially if you
> are using multiple threads.
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Batch Indexing - best practice?

Posted by "Murdoch, Paul" <PA...@saic.com>.
Thanks.  I'll try lowering the merge factor and see if speed increases.
The indexing is threaded....similar to the utility class in Listing 10.1
from Lucene in Action.  Search speed is great once the index is
built....close to real time.  So my main problem is getting the indexing
speed fixed.  I do use the StandardAnalyzer for most of my fields.  What
type of performance level should I be trying to hit for indexing
(docs/sec)...just to give me an idea of what to shoot for?

Paul 

-----Original Message-----
From: java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
[mailto:java-user-return-45433-PAUL.B.MURDOCH=saic.com@lucene.apache.org
] On Behalf Of Mark Miller
Sent: Monday, March 15, 2010 10:48 AM
To: java-user@lucene.apache.org
Subject: Re: Batch Indexing - best practice?

On 03/15/2010 10:41 AM, Murdoch, Paul wrote:
> Hi,
>
>
>
> I'm using Lucene 2.9.2.  Currently, when creating my index, I'm
calling
> indexWriter.addDocument(doc) for each Document I want to index.  The
> Documents aren't large and I'm averaging indexing about 500 documents
> every 90 seconds.  I'd like to try and speed this up....unless 90
> seconds for 500 Documents is reasonable.  I have the merge factor set
to
> 1000.  Do you have any suggestions for batch indexing?  Is there
> something like indexWriter.addDocuments(Document[] docs) in the API?
>
>
>
> Thanks.
>
> Paul
>
>
>
>
>    
You should lower that merge factor - thats *really* high.

You shouldn't really need much more than 50 or so ... and for search 
speed your going to want fewer segments anyway -
if your just going to end up optimizing at the end, there is no reason 
for such a large merge factor - you will pay for most of what
you saved when you optimize.

That is very slow by the way. Should be much faster - especially if you 
are using multiple threads.

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Batch Indexing - best practice?

Posted by Mark Miller <ma...@gmail.com>.
On 03/15/2010 10:41 AM, Murdoch, Paul wrote:
> Hi,
>
>
>
> I'm using Lucene 2.9.2.  Currently, when creating my index, I'm calling
> indexWriter.addDocument(doc) for each Document I want to index.  The
> Documents aren't large and I'm averaging indexing about 500 documents
> every 90 seconds.  I'd like to try and speed this up....unless 90
> seconds for 500 Documents is reasonable.  I have the merge factor set to
> 1000.  Do you have any suggestions for batch indexing?  Is there
> something like indexWriter.addDocuments(Document[] docs) in the API?
>
>
>
> Thanks.
>
> Paul
>
>
>
>
>    
You should lower that merge factor - thats *really* high.

You shouldn't really need much more than 50 or so ... and for search 
speed your going to want fewer segments anyway -
if your just going to end up optimizing at the end, there is no reason 
for such a large merge factor - you will pay for most of what
you saved when you optimize.

That is very slow by the way. Should be much faster - especially if you 
are using multiple threads.

-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Batch Indexing - best practice?

Posted by Ian Lea <ia...@gmail.com>.
See http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for plenty
of tips. Suggested by Mike just a few hours ago in another thread ...


--
Ian.


On Mon, Mar 15, 2010 at 2:41 PM, Murdoch, Paul <PA...@saic.com> wrote:
> Hi,
>
>
>
> I'm using Lucene 2.9.2.  Currently, when creating my index, I'm calling
> indexWriter.addDocument(doc) for each Document I want to index.  The
> Documents aren't large and I'm averaging indexing about 500 documents
> every 90 seconds.  I'd like to try and speed this up....unless 90
> seconds for 500 Documents is reasonable.  I have the merge factor set to
> 1000.  Do you have any suggestions for batch indexing?  Is there
> something like indexWriter.addDocuments(Document[] docs) in the API?
>
>
>
> Thanks.
>
> Paul
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org