You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Kimm <ch...@seeqa.com> on 2004/03/11 18:35:41 UTC

update performance

The standard pattern for updating an index - removing a document then 
re-adding the modified document to the index - is currently a 
significant performance bottleneck in my application.  I sometimes need 
to update ~1000 documents at a time.  The major cost of this pattern as 
far as I can see is IndexWriter.close ().   Average times for an update 
to an FSDirectory look like this:

delete document: 7 ms
create document: 6 ms
add document: 11 ms
IndexWriter.close: 59 ms

Is there a way to synchronize IndexWriter and IndexReader so that a call 
to IndexWriter.close is not required for each update?  I guess I mean to 
ask if there is a *simple* way to do this.  I imagine that one could 
write an IndexUpdater class which manages the synchronization of Locks, 
temp files, etc.

Thanks,

Chris




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: update performance

Posted by "Kevin A. Burton" <bu...@newsmonster.org>.
Chris Kimm wrote:

> Unfortunately, I'm not able to batch the updates.  The application 
> needs to make some descisions based on what each document looks like 
> before and  after the update, so I have to do it one at a time.  I 
> guess this is not a common useage scenario for Lucene.  Otherwise, an 
> update() might already be built in somewhere.
>
> Is there anything in the locking/sync framework which precludes saving 
> the cost of closing the Directory object and deleting the temp lock 
> file each time an update is made?
>
Use a RAM directory... then when you're pretty sure you're done call 
IndexWriter.addIndexes() on the disk index.

Will that work for you?

You can also do this every N documents, or minutes, or memory usage, and 
have the commit work with a synchronized thread.

Kevin

-- 

Please reply using PGP.

    http://peerfear.org/pubkey.asc    
    
    NewsMonster - http://www.newsmonster.org/
    
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


Re: update performance

Posted by Doug Cutting <cu...@apache.org>.
Chris Kimm wrote:
> Unfortunately, I'm not able to batch the updates.  The application needs 
> to make some descisions based on what each document looks like before 
> and  after the update, so I have to do it one at a time.

Are these decisions dependent on other documents?  If not, you should be 
able to queue the updates and apply them as a batch, no?

> I guess this 
> is not a common useage scenario for Lucene.  Otherwise, an update() 
> might already be built in somewhere.

Rather, Lucene's API makes it convenient to do what is efficient, and 
less convenient to do what is inefficient.  Batching is inherently more 
efficient.

> Is there anything in the locking/sync framework which precludes saving 
> the cost of closing the Directory object and deleting the temp lock file 
> each time an update is made?

You could disable locking, but I doubt it will make it much faster.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: update performance

Posted by Chris Kimm <ch...@seeqa.com>.
Unfortunately, I'm not able to batch the updates.  The application needs 
to make some descisions based on what each document looks like before 
and  after the update, so I have to do it one at a time.  I guess this 
is not a common useage scenario for Lucene.  Otherwise, an update() 
might already be built in somewhere.

Is there anything in the locking/sync framework which precludes saving 
the cost of closing the Directory object and deleting the temp lock file 
each time an update is made?

-Chris

Doug Cutting wrote:

> It sounds like you're not batching your updates.
> 
> The most efficient approch to update 1000 documents would be to:
> 
>   1. Open an IndexReader;
>   2. Delete all 1000 documents.
>   3. Close the reader;
>   4. Open an IndexWriter;
>   5. Add all 1000 updated documents;
>   6. Close the IndexWriter.
> 
> Is that what you're doing?
> 
> Doug
> 
> Chris Kimm wrote:
> 
>> The standard pattern for updating an index - removing a document then 
>> re-adding the modified document to the index - is currently a 
>> significant performance bottleneck in my application.  I sometimes 
>> need to update ~1000 documents at a time.  The major cost of this 
>> pattern as far as I can see is IndexWriter.close ().   Average times 
>> for an update to an FSDirectory look like this:
>>
>> delete document: 7 ms
>> create document: 6 ms
>> add document: 11 ms
>> IndexWriter.close: 59 ms
>>
>> Is there a way to synchronize IndexWriter and IndexReader so that a 
>> call to IndexWriter.close is not required for each update?  I guess I 
>> mean to ask if there is a *simple* way to do this.  I imagine that one 
>> could write an IndexUpdater class which manages the synchronization of 
>> Locks, temp files, etc.
>>
>> Thanks,
>>
>> Chris
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: update performance

Posted by Doug Cutting <cu...@apache.org>.
It sounds like you're not batching your updates.

The most efficient approch to update 1000 documents would be to:

   1. Open an IndexReader;
   2. Delete all 1000 documents.
   3. Close the reader;
   4. Open an IndexWriter;
   5. Add all 1000 updated documents;
   6. Close the IndexWriter.

Is that what you're doing?

Doug

Chris Kimm wrote:
> The standard pattern for updating an index - removing a document then 
> re-adding the modified document to the index - is currently a 
> significant performance bottleneck in my application.  I sometimes need 
> to update ~1000 documents at a time.  The major cost of this pattern as 
> far as I can see is IndexWriter.close ().   Average times for an update 
> to an FSDirectory look like this:
> 
> delete document: 7 ms
> create document: 6 ms
> add document: 11 ms
> IndexWriter.close: 59 ms
> 
> Is there a way to synchronize IndexWriter and IndexReader so that a call 
> to IndexWriter.close is not required for each update?  I guess I mean to 
> ask if there is a *simple* way to do this.  I imagine that one could 
> write an IndexUpdater class which manages the synchronization of Locks, 
> temp files, etc.
> 
> Thanks,
> 
> Chris
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org