You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Arvind Kalyan <ba...@gmail.com> on 2014/03/05 10:18:07 UTC

Lucene 4 single segment performance improvement tips?

Hi folks,

We are currently using Lucene 4.5 and we are hitting some bottlenecks and
appreciate some input from the community.

This particular index (the disk size for which is about 10GB) is guaranteed
to not have any updates, so we made it a single segment index by doing a
forceMerge(1). The index is guaranteed to be in-memory as well: we use the
MMapDirectory and the whole thing is mlocked after load. So there is no
disk I/O.

Our runtime/search use-case is very simple: run filters to select all docs
that match some conditions specified in a filter query (we do not use
Lucene scoring) and return the first 100 docs that match (this is an
over-simplification)

On a machine with nothing else running, we are unable to move the needle on
CPU utilization to serve higher QPS. We see that most of the time is spent
in BlockTreeTermsReader.FieldReader.iterator() when we run profiling tools
to see where time is being spent. The CPU usage doesn't cross 30% (we have
multiple threads one per each client connected over a Jetty connection all
taken from a bounded thread-pool). We tried the usual suspects like
tweaking size of the threadpool, changing some jvm parameters like newsize,
heapsize, using cms for old gen, parnew for newgen, etc.

Does anyone here any pointers or general suggestions on how we can get good
performance out of Lucene 4.x? Specifically IndexSearcher performance
improvements for large, single-segment, atomicreaders.

I'll share more specifics if necessary but I'd like to hear from folks here
what your experience has been and what you did to speed up your
IndexSearchers to improve throughput *and/or* latency.

Thanks!

-- 
Arvind Kalyan
http://www.linkedin.com/in/base16

Re: Lucene 4 single segment performance improvement tips?

Posted by Arvind Kalyan <ba...@gmail.com>.

Thanks Mike. Good idea.. we have a pretty thick stack and I got it down to
the jetty+lucene thinking it is barebones enough.. but good call on running
it purely on lucene. I'll see if it moves any needle (hopefully it does).


On Wed, Mar 5, 2014 at 4:25 AM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> What sorts of queries are you running?  It seems like they must be
> very terms-dict intensive, e.g. primary key lookups or multi-term
> queries, and maybe not matching too many documents?
>
> It's strange you can't get CPU usage up, as you add threads.  Maybe
> simplify the test to remove Jetty?  Ie, a standalone test just
> invoking Lucene APIs directly using multiple threads.
>
> Does the profiler reveal and hot locks, where threads are having to
> wait to acquire the lock?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Mar 5, 2014 at 4:18 AM, Arvind Kalyan <ba...@gmail.com> wrote:
> > Hi folks,
> >
> > We are currently using Lucene 4.5 and we are hitting some bottlenecks and
> > appreciate some input from the community.
> >
> > This particular index (the disk size for which is about 10GB) is
> guaranteed
> > to not have any updates, so we made it a single segment index by doing a
> > forceMerge(1). The index is guaranteed to be in-memory as well: we use
> the
> > MMapDirectory and the whole thing is mlocked after load. So there is no
> > disk I/O.
> >
> > Our runtime/search use-case is very simple: run filters to select all
> docs
> > that match some conditions specified in a filter query (we do not use
> > Lucene scoring) and return the first 100 docs that match (this is an
> > over-simplification)
> >
> > On a machine with nothing else running, we are unable to move the needle
> on
> > CPU utilization to serve higher QPS. We see that most of the time is
> spent
> > in BlockTreeTermsReader.FieldReader.iterator() when we run profiling
> tools
> > to see where time is being spent. The CPU usage doesn't cross 30% (we
> have
> > multiple threads one per each client connected over a Jetty connection
> all
> > taken from a bounded thread-pool). We tried the usual suspects like
> > tweaking size of the threadpool, changing some jvm parameters like
> newsize,
> > heapsize, using cms for old gen, parnew for newgen, etc.
> >
> > Does anyone here any pointers or general suggestions on how we can get
> good
> > performance out of Lucene 4.x? Specifically IndexSearcher performance
> > improvements for large, single-segment, atomicreaders.
> >
> > I'll share more specifics if necessary but I'd like to hear from folks
> here
> > what your experience has been and what you did to speed up your
> > IndexSearchers to improve throughput *and/or* latency.
> >
> > Thanks!
> >
> > --
> > Arvind Kalyan
> > http://www.linkedin.com/in/base16
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Arvind Kalyan
http://www.linkedin.com/in/base16
cell: (408) 761-2030

Re: Lucene 4 single segment performance improvement tips?

Posted by Michael McCandless <lu...@mikemccandless.com>.

What sorts of queries are you running?  It seems like they must be
very terms-dict intensive, e.g. primary key lookups or multi-term
queries, and maybe not matching too many documents?

It's strange you can't get CPU usage up, as you add threads.  Maybe
simplify the test to remove Jetty?  Ie, a standalone test just
invoking Lucene APIs directly using multiple threads.

Does the profiler reveal and hot locks, where threads are having to
wait to acquire the lock?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Mar 5, 2014 at 4:18 AM, Arvind Kalyan <ba...@gmail.com> wrote:
> Hi folks,
>
> We are currently using Lucene 4.5 and we are hitting some bottlenecks and
> appreciate some input from the community.
>
> This particular index (the disk size for which is about 10GB) is guaranteed
> to not have any updates, so we made it a single segment index by doing a
> forceMerge(1). The index is guaranteed to be in-memory as well: we use the
> MMapDirectory and the whole thing is mlocked after load. So there is no
> disk I/O.
>
> Our runtime/search use-case is very simple: run filters to select all docs
> that match some conditions specified in a filter query (we do not use
> Lucene scoring) and return the first 100 docs that match (this is an
> over-simplification)
>
> On a machine with nothing else running, we are unable to move the needle on
> CPU utilization to serve higher QPS. We see that most of the time is spent
> in BlockTreeTermsReader.FieldReader.iterator() when we run profiling tools
> to see where time is being spent. The CPU usage doesn't cross 30% (we have
> multiple threads one per each client connected over a Jetty connection all
> taken from a bounded thread-pool). We tried the usual suspects like
> tweaking size of the threadpool, changing some jvm parameters like newsize,
> heapsize, using cms for old gen, parnew for newgen, etc.
>
> Does anyone here any pointers or general suggestions on how we can get good
> performance out of Lucene 4.x? Specifically IndexSearcher performance
> improvements for large, single-segment, atomicreaders.
>
> I'll share more specifics if necessary but I'd like to hear from folks here
> what your experience has been and what you did to speed up your
> IndexSearchers to improve throughput *and/or* latency.
>
> Thanks!
>
> --
> Arvind Kalyan
> http://www.linkedin.com/in/base16

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene 4 single segment performance improvement tips?

Posted by Arvind Kalyan <ba...@gmail.com>.

On Wed, Mar 5, 2014 at 8:14 AM, Chris Hostetter <ho...@fucit.org>wrote:

> : Our runtime/search use-case is very simple: run filters to select all
> docs
> : that match some conditions specified in a filter query (we do not use
> : Lucene scoring) and return the first 100 docs that match (this is an
> : over-simplification)
>
> "first" as defined how? in order collected by a custom collector, or via
> some sort?
>

We sort the docs in some order and freeze the single segment index.

>
> : On a machine with nothing else running, we are unable to move the needle
> on
> : CPU utilization to serve higher QPS. We see that most of the time is
> spent
> : in BlockTreeTermsReader.FieldReader.iterator() when we run profiling
> tools
> : to see where time is being spent. The CPU usage doesn't cross 30% (we
> have
> : multiple threads one per each client connected over a Jetty connection
> all
> : taken from a bounded thread-pool). We tried the usual suspects like
> : tweaking size of the threadpool, changing some jvm parameters like
> newsize,
> : heapsize, using cms for old gen, parnew for newgen, etc.
>
> You said you have one thread per client, but you didn't mention anything
> about varying the number of clients -- did you try increasing the number
> of clients hitting your application concurrently?  It's possible that your
> box is "beefy" enough that 30% of the available CPU is all that's needed
> for the number of active concurrnt threads you are using (increasing hte
> size of the threadpool isn't going to affect anything if there aren't more
> clients utilizing those threads)
>
>

Yes, the number of threads is bounded (varied this to see how things
change), and we increased the qps from the client side. The client requests
essentially pile up and do not go beyond 300qps. The fact that we are
unable to go beyond that qps and still not utilize more than 30% cpu is
what's concerning. There are no monitors/locks that come up during
profiling, too. Only the ReferenceQueue.poll() <
http://docs.oracle.com/javase/7/docs/api/java/lang/ref/ReferenceQueue.html>
comes up. There's still enough memory available in the heap (allocated 6gb
heap, 2 gb newgen, 1:8 survivor ratio, 70% cms threshold, parnew gc) and
the machine has 64GB RAM.

I'm going to repeat the experiment with just Lucene (and no jetty) as what
Mike suggested but meanwhile if any of you have any other pointers it'd be
great.

Thanks

-- 
Arvind Kalyan
http://www.linkedin.com/in/base16

Re: Lucene 4 single segment performance improvement tips?

Posted by Chris Hostetter <ho...@fucit.org>.

: Our runtime/search use-case is very simple: run filters to select all docs
: that match some conditions specified in a filter query (we do not use
: Lucene scoring) and return the first 100 docs that match (this is an
: over-simplification)

"first" as defined how? in order collected by a custom collector, or via 
some sort?

: On a machine with nothing else running, we are unable to move the needle on
: CPU utilization to serve higher QPS. We see that most of the time is spent
: in BlockTreeTermsReader.FieldReader.iterator() when we run profiling tools
: to see where time is being spent. The CPU usage doesn't cross 30% (we have
: multiple threads one per each client connected over a Jetty connection all
: taken from a bounded thread-pool). We tried the usual suspects like
: tweaking size of the threadpool, changing some jvm parameters like newsize,
: heapsize, using cms for old gen, parnew for newgen, etc.

You said you have one thread per client, but you didn't mention anything 
about varying the number of clients -- did you try increasing the number 
of clients hitting your application concurrently?  It's possible that your 
box is "beefy" enough that 30% of the available CPU is all that's needed 
for the number of active concurrnt threads you are using (increasing hte 
size of the threadpool isn't going to affect anything if there aren't more 
clients utilizing those threads)


-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org