You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ram Peters <ra...@gmail.com> on 2007/05/09 00:33:10 UTC

Periodic Indexing DESIGN QUESTION

I am indexing documents periodically every hour.  I have a scenario.
For example, when you are indexing every hour and large document set
is present, it takes >1 hr to index the documents.  Now you are
already behind indexing for the next hour.  How do you design
something that is robust?

thanks.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Periodic Indexing DESIGN QUESTION

Posted by Chris Hostetter <ho...@fucit.org>.

: For example, when you are indexing every hour and large document set
: is present, it takes >1 hr to index the documents.  Now you are
: already behind indexing for the next hour.  How do you design
: something that is robust?

fundementally, this question is really about issues in a producer/consumer
model then it is specificly about indexing... given a situation where data
comes into a queue (from some set of producers) and you wnat to process
that data (by some set of consumers) what do you do if the producers
produce faster then the consumers consume.

i know of 7 options:
  1) decrease the number of producers
  2) make the producers produce slower
  3) make the queue infinitely large
  4) make the queue block
  5) make the consumers consumer faster
  6) increase the number of consumers
  7) throw away data

#1, #2 and #3 are not usually practical but are listed for completelness.
#4 may be practical in some situations, but there are no easy rules to
know when.  #5 tends to be very feasible in a well designed system where
things can be parallelized while #6 can be frequently be achieved either
by profiling and optimizing your code, or by making your code do
less; which segues nicely to #7 -- it may sound like a joke but frequently
big throughput gains can be made by reducing the amount of data being sent
to the consumers ... sometimes it's a matter of taking some work that the
consumers do making the producers do it (ie: eliminating data from the
that you know you aren't going to index), in other cases it may truely be
throwing away data because you can see that your queue is so full you
switch into "critial info only mode" where you don't bother to process
every little bit of data -- just the important stuff.  you make the
concious choice that it's better to be caught up on the big stuff then to
fall way way behind dealing with the little stuff.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Periodic Indexing DESIGN QUESTION

Posted by Erick Erickson <er...@gmail.com>.

Don't do it that way <G>? Is this an actual or theoretical
scenario? And do you reasonably expect it to become actual?
Otherwise, why bother?

And you've got other problems here. If you're indexing that
much data, you'll soon outgrow your disk. Unless you're
replacing most of the documents.

But assuming that all this is somehow not a problem, I'd
consider something like indexing by directory. That is, for an
hour, collect all the incoming documents in directory d1. Then
turn an indexer process loose on d1 and start collecting docs in
d2. At the end of the next hour, start indexing d2 and collecting d3.

When each indexing process finishes, you can use
IndexWriter.addIndexes. Or you could batch them up and add
all the indexes that have been created in the last, say, 6 hours
at once. You could even split this across multiple machines if
you get CPU bound.

That said, I can't stress enough that you really need to consider
how long you can keep indexing data at that rate and have any
performance to speak of at search time.

If you're not indexing that much data, *and* you still have speed
problems, I'd look long and hard at my code to see why indexing
is taking so long. Are you closing/reopening the IndexWriter? Are
you optimizing too often? Is the way you access the data (perhaps
querying a database) painful?

Some real numbers would help. Things like:
How many documents are in your index?
How many arrive each hour?
How long does it take to index, say, 100 docs?
How big are the docs upon input? How much bigger do they make the index?

Have you measured *any* of these things? if so, please post
the numbers. Think about doing everything *except* indexing and see
if your bottleneck is somewhere unexpected.

Anyway, hope this helps
Erick

On 5/8/07, Ram Peters <ra...@gmail.com> wrote:
>
> I am indexing documents periodically every hour.  I have a scenario.
> For example, when you are indexing every hour and large document set
> is present, it takes >1 hr to index the documents.  Now you are
> already behind indexing for the next hour.  How do you design
> something that is robust?
>
> thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>