You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Aleksey <bi...@gmail.com> on 2013/04/26 05:10:44 UTC

Optimizing NRT search

Hey guys,

I'm new to Lucene and I was trying to estimate how fast I can make updates
to the index and reopen it. The behavior I'm seeing seems odd.
I'm using Lucene4.2 and SearchManager instance that takes an index writer.

I make a loop where I update 1 document, then call maybeRefresh and acquire
new searcher and make a search to verify that the update is there. On my
laptop this does about 100 iterations per second.

Then I run another loop but make 10 updates before reopening the index, and
this only does 10 iterations per second, proportionally less. I was
expecting that if I batch the updates, I can get higher overall throughput,
but that does not seem to be the case. The size of the index I'm updating
doesn't make a difference either, I tried 3K and 100K document sets, 1-2K
each doc, but both produce same update speed (though I'm not calling commit
in these instances).

Can anyone point me into the right direction to investigate this or hint
how to maximize write throughput to the index, while still <.5 second
delays in seeing the updates.

Thank you in advance,

Aleksey

Re: Optimizing NRT search

Posted by Aleksey <bi...@gmail.com>.

Yes, GC gets pretty bad even with only 8G of RAM. I also tried using RAM
disk and use SimpleFSDirectory, which performs well and keeps the java heap
small, but with this many indices, it actually keeps hundreds of thousands
files open. This is not really a Lucene question, but could that cause
problems down the line? I haven't yet run it that way for an extended
period of time.

I'm using an individual reader for each index, because I don't really need
to search across them, so no need for MultiReader.

I was actually going to ask about filters in general. I'm unclear how they
work. They look very similar to queries, but on the web some say they are
used to narrow down search result and others say that they can limit the
search space, which seems completely opposite.
Also this ticket https://issues.apache.org/jira/browse/LUCENE-3212 confuses
me a little as it says filtered reader "hides filtered documents by
returning them in getDeletedDocs()". Why "deleted" as opposed to
"filtered"? Are the docs really deleted when filter is applied?
So what kind of scenario will filters provide best performance for over
queries? How about "recycled" docs, say in an application you could move
docs in the "trash" and restore them so that main searches are done over
smaller set. Is that a good use?
What about the docs that are filtered out, can searches/sorting be done
over that or would I need a second negated filter?

Aleksey

Aleksey

On Sat, Apr 27, 2013 at 5:02 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> On Fri, Apr 26, 2013 at 5:04 PM, Aleksey <bi...@gmail.com> wrote:
> > Thanks for the response, Mike. Yes, I've come upon your blog before, it's
> > very helpful.
> >
> > I tried bigger batches, it seems the highest throughput I can get is
> > roughly 250 docs a second. From your blog, you updated your index at
> about
> > 1MB per second, with 1K documents, which is 1000/s, but you had 24 core
> > machine, while my laptop has 2 cores (and SSD). So does it mean like the
> > performance I'm seeing is actually better than back in 2011? (By the way
> > I'm using RAMDirectory, rather than MMap, but MMap seems similar).
>
> Be careful with RAMDir ... it's very GC heavy as the index gets larger
> since it breaks each file into 1K byte[]s.  It's best for smallish
> indices.
>
> Your tests are all with one thread?  (My tests were using multiple
> threads on the 24 core machine).  So on a laptop with one thread, 250
> docs/sec where each doc is 1-2 KB seems reasonable.
>
> Still it's odd you don't see larger gains from batching up the changes
> between reopens.
>
> > Interesting thing is that NRTDirectory is about 2x faster when I'm
> updating
> > one document at a time, but batches of 250 take about 1 second for both.
> > I have not tried tuning any components yet because I don't bet understand
> > what exactly all the knobs do.
>
> Well if you're using RAMDir then NRTCachingDir really should not be
> helping much at all!
>
> > Actually, perhaps I should describe my overall use case to see if I
> should
> > be using Lucene in this way at all.
> > My searches never needs to be over entire data set, only over a tiny
> > portion at a time, so I was prototyping a solution that acts kind of
> like a
> > cache. The search fleet holds lots of small Directory instances that can
> be
> > quickly loaded up when necessary and evicted when not in use. Each one is
> > 200-200K docs in size. Updates also happen to individual directories and
> > they are typically in tens of docs rather than hundreds or thousands.
> > I know that having lots of separate directories and searchers is an
> > overhead, but if I had everything in one, then I supposed it would be
> > harder to load and evict portions of it. So am I structuring my
> application
> > in a reasonable way or is there a better way to go about it?
>
> This approach should work.  You use MultiReader to search across them?
>
> You could also use a single reader + filter, or a single reader and
> periodically delete the docs to be evicted.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Optimizing NRT search

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Fri, Apr 26, 2013 at 5:04 PM, Aleksey <bi...@gmail.com> wrote:
> Thanks for the response, Mike. Yes, I've come upon your blog before, it's
> very helpful.
>
> I tried bigger batches, it seems the highest throughput I can get is
> roughly 250 docs a second. From your blog, you updated your index at about
> 1MB per second, with 1K documents, which is 1000/s, but you had 24 core
> machine, while my laptop has 2 cores (and SSD). So does it mean like the
> performance I'm seeing is actually better than back in 2011? (By the way
> I'm using RAMDirectory, rather than MMap, but MMap seems similar).

Be careful with RAMDir ... it's very GC heavy as the index gets larger
since it breaks each file into 1K byte[]s.  It's best for smallish
indices.

Your tests are all with one thread?  (My tests were using multiple
threads on the 24 core machine).  So on a laptop with one thread, 250
docs/sec where each doc is 1-2 KB seems reasonable.

Still it's odd you don't see larger gains from batching up the changes
between reopens.

> Interesting thing is that NRTDirectory is about 2x faster when I'm updating
> one document at a time, but batches of 250 take about 1 second for both.
> I have not tried tuning any components yet because I don't bet understand
> what exactly all the knobs do.

Well if you're using RAMDir then NRTCachingDir really should not be
helping much at all!

> Actually, perhaps I should describe my overall use case to see if I should
> be using Lucene in this way at all.
> My searches never needs to be over entire data set, only over a tiny
> portion at a time, so I was prototyping a solution that acts kind of like a
> cache. The search fleet holds lots of small Directory instances that can be
> quickly loaded up when necessary and evicted when not in use. Each one is
> 200-200K docs in size. Updates also happen to individual directories and
> they are typically in tens of docs rather than hundreds or thousands.
> I know that having lots of separate directories and searchers is an
> overhead, but if I had everything in one, then I supposed it would be
> harder to load and evict portions of it. So am I structuring my application
> in a reasonable way or is there a better way to go about it?

This approach should work.  You use MultiReader to search across them?

You could also use a single reader + filter, or a single reader and
periodically delete the docs to be evicted.

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Optimizing NRT search

Posted by Aleksey <bi...@gmail.com>.

Thanks for the response, Mike. Yes, I've come upon your blog before, it's
very helpful.

I tried bigger batches, it seems the highest throughput I can get is
roughly 250 docs a second. From your blog, you updated your index at about
1MB per second, with 1K documents, which is 1000/s, but you had 24 core
machine, while my laptop has 2 cores (and SSD). So does it mean like the
performance I'm seeing is actually better than back in 2011? (By the way
I'm using RAMDirectory, rather than MMap, but MMap seems similar).
Interesting thing is that NRTDirectory is about 2x faster when I'm updating
one document at a time, but batches of 250 take about 1 second for both.
I have not tried tuning any components yet because I don't bet understand
what exactly all the knobs do.

Actually, perhaps I should describe my overall use case to see if I should
be using Lucene in this way at all.
My searches never needs to be over entire data set, only over a tiny
portion at a time, so I was prototyping a solution that acts kind of like a
cache. The search fleet holds lots of small Directory instances that can be
quickly loaded up when necessary and evicted when not in use. Each one is
200-200K docs in size. Updates also happen to individual directories and
they are typically in tens of docs rather than hundreds or thousands.
I know that having lots of separate directories and searchers is an
overhead, but if I had everything in one, then I supposed it would be
harder to load and evict portions of it. So am I structuring my application
in a reasonable way or is there a better way to go about it?

Thank you in advance,

Aleksey

On Fri, Apr 26, 2013 at 3:46 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Batching the updates really ought to improve overall throughput.  Have you
> tried with even bigger batches (100,1000 docs)?
>
> But, how large is each update?  Are you changing any IndexWriter settings,
> e.g. ramBufferSizeMB.
>
> Using threads should help too, at least a separate thread doing indexing
> from calling SearcherManager.maybeRefresh (and, separate threads doing
> searching).
>
> You can also check out
>
> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.htmlwhere
> I go into some detail on speeding up indexing rate and refresh speed
> with near-real-time ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Apr 25, 2013 at 11:10 PM, Aleksey <bi...@gmail.com> wrote:
>
> > Hey guys,
> >
> > I'm new to Lucene and I was trying to estimate how fast I can make
> updates
> > to the index and reopen it. The behavior I'm seeing seems odd.
> > I'm using Lucene4.2 and SearchManager instance that takes an index
> writer.
> >
> > I make a loop where I update 1 document, then call maybeRefresh and
> acquire
> > new searcher and make a search to verify that the update is there. On my
> > laptop this does about 100 iterations per second.
> >
> > Then I run another loop but make 10 updates before reopening the index,
> and
> > this only does 10 iterations per second, proportionally less. I was
> > expecting that if I batch the updates, I can get higher overall
> throughput,
> > but that does not seem to be the case. The size of the index I'm updating
> > doesn't make a difference either, I tried 3K and 100K document sets, 1-2K
> > each doc, but both produce same update speed (though I'm not calling
> commit
> > in these instances).
> >
> > Can anyone point me into the right direction to investigate this or hint
> > how to maximize write throughput to the index, while still <.5 second
> > delays in seeing the updates.
> >
> > Thank you in advance,
> >
> > Aleksey
> >
>

Re: Optimizing NRT search

Posted by Michael McCandless <lu...@mikemccandless.com>.

Batching the updates really ought to improve overall throughput.  Have you
tried with even bigger batches (100,1000 docs)?

But, how large is each update?  Are you changing any IndexWriter settings,
e.g. ramBufferSizeMB.

Using threads should help too, at least a separate thread doing indexing
from calling SearcherManager.maybeRefresh (and, separate threads doing
searching).

You can also check out
http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.htmlwhere
I go into some detail on speeding up indexing rate and refresh speed
with near-real-time ...

Mike McCandless

http://blog.mikemccandless.com

On Thu, Apr 25, 2013 at 11:10 PM, Aleksey <bi...@gmail.com> wrote:

> Hey guys,
>
> I'm new to Lucene and I was trying to estimate how fast I can make updates
> to the index and reopen it. The behavior I'm seeing seems odd.
> I'm using Lucene4.2 and SearchManager instance that takes an index writer.
>
> I make a loop where I update 1 document, then call maybeRefresh and acquire
> new searcher and make a search to verify that the update is there. On my
> laptop this does about 100 iterations per second.
>
> Then I run another loop but make 10 updates before reopening the index, and
> this only does 10 iterations per second, proportionally less. I was
> expecting that if I batch the updates, I can get higher overall throughput,
> but that does not seem to be the case. The size of the index I'm updating
> doesn't make a difference either, I tried 3K and 100K document sets, 1-2K
> each doc, but both produce same update speed (though I'm not calling commit
> in these instances).
>
> Can anyone point me into the right direction to investigate this or hint
> how to maximize write throughput to the index, while still <.5 second
> delays in seeing the updates.
>
> Thank you in advance,
>
> Aleksey
>