You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucy.apache.org by Nick Wellnhofer <we...@aevum.de> on 2012/12/17 20:52:55 UTC

[lucy-user] Reindexing and concurrent updates

Hello lucy-user,

what's the best way to reindex the database completely while still allowing concurrent updates? My current plan is to start indexing documents until half of write_lock_timeout has passed and then sleep for half of write_lock_timeout. Does this make sense? Is there a better way?

Nick


Re: [lucy-user] Reindexing and concurrent updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Dec 17, 2012 at 2:50 PM, Nick Wellnhofer <we...@aevum.de> wrote:
> Another question regarding LightMergeManager: The cookbook entry recommends
> to use a ceiling of 10 documents per segment. That seems a bit low to me.
> Shouldn't something like 50, 100 or even more docs be OK on typical
> hardware?

Hmm, could be.  It's tricky to get this right.  That algo doesn't absolutely
guarantee great worst-case performance regardless of the max seg size
accepted -- you get pathological behavior for an index with many segments at
the threshold size.  However, that would only happen once...

It's probably time we encapsulated this behavior into a MergePolicy class and
started collecting this wisdom into MergePolicy subclasses.

> And regarding the BackgroundMerger: Can the background merging simply be run
> periodically as a cron job?

Under many circumstances, yes -- it depends on your usage pattern.

If updates arrive at a reasonably steady pace, a cron will work fine.

On the other hand, if updates are bursty and you get a lot of them in a short
interval before the cron has a chance to run, the index may get awfully
fragmented for a little while -- in which case you might have been better off
running the BackgroundMerger after N inserts instead of on a cron timer.

Marvin Humphrey

Re: [lucy-user] Reindexing and concurrent updates

Posted by Nick Wellnhofer <we...@aevum.de>.
On Dec 17, 2012, at 23:21 , Marvin Humphrey <ma...@rectangular.com> wrote:

>    http://lucy.apache.org/docs/perl/Lucy/Docs/Cookbook/FastUpdates.html#ABSTRACT
> 
>    While index updates are fast on average, worst-case update performance may
>    be significantly slower. To make index updates consistently quick, we must
>    manually intervene to control the process of index segment consolidation.
> 
> To guarantee good responsiveness by the "indexer" process, both "indexer" and
> "updater" need to limit the amount of existing content that they will recycle
> and you need an additional BackgroundMerger process as described in
> Lucy::Docs::Cookbook::FastUpdates to keep the number of segments from growing
> out of control.

My "indexer" process already uses the LightMergeManager as described in the FastUpdates cookbook entry. But you're right, the "updater" also has to use a LightMergeManager.

Another question regarding LightMergeManager: The cookbook entry recommends to use a ceiling of 10 documents per segment. That seems a bit low to me. Shouldn't something like 50, 100 or even more docs be OK on typical hardware?

And regarding the BackgroundMerger: Can the background merging simply be run periodically as a cron job?

Nick




Re: [lucy-user] Reindexing and concurrent updates

Posted by Marvin Humphrey <ma...@rectangular.com>.
On Mon, Dec 17, 2012 at 11:52 AM, Nick Wellnhofer <we...@aevum.de> wrote:
> what's the best way to reindex the database completely while still allowing
> concurrent updates? My current plan is to start indexing documents until
> half of write_lock_timeout has passed and then sleep for half of
> write_lock_timeout. Does this make sense?

Let's assume that there are a maximum of two processes contending for write
access to a single index:

*   The "indexer", which accepts new content.
*   The "updater", which reindexes documents which were already in the system.

We'll ignore the potential issue of multiple new documents arriving
simultaneously -- we assume that adds are serialized through the "indexer"
process somehow, perhaps by queueing.

To guarantee that the "indexer" never times out waiting for a write lock, the
"updater" needs good worst-case performance.  The algorithm you describe will
give good average performance, but not good enough worst-case perforance.

    http://lucy.apache.org/docs/perl/Lucy/Docs/Cookbook/FastUpdates.html#ABSTRACT

    While index updates are fast on average, worst-case update performance may
    be significantly slower. To make index updates consistently quick, we must
    manually intervene to control the process of index segment consolidation.

To guarantee good responsiveness by the "indexer" process, both "indexer" and
"updater" need to limit the amount of existing content that they will recycle
and you need an additional BackgroundMerger process as described in
Lucy::Docs::Cookbook::FastUpdates to keep the number of segments from growing
out of control.

> Is there a better way?

One alternative is to reindex into a new directory off to the side, queueing
new adds as they come in and adding them after reindexing finishes.  When the
new index is caught up, swap it into place.

Disk space is cheap, so that's generally not an issue.  However, you may have
to watch out for IO cache memory usage by the side process if a production
searcher depends on having most of the live index cached in RAM to achieve
good search-time performance.

Marvin Humphrey