You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stephen Delano <st...@gmail.com> on 2013/11/05 08:15:11 UTC

Solr 1.4 - Performance Issues

Hi all,

I wanted to share the issues we're having with Solr 1.4 to get some ideas
of things we can do in the short term that will buy us enough time to
validate Solr 4 before upgrading and not have 1.4 burn to the ground before
we get there.

We've been running Solr 1.4 in production for over 3 years now, but are
really starting to hit some performance bottlenecks that are beginning to
affect our users. Here are the details of our setup:

We're running 2 4-CPU Solr servers. The data is on a 4-disk RAID 10 array
and we're using block-level replication via DRBD over GigE to write to the
standby node. Only one server is serving traffic at a time.

Some tuning information:
- Merge Factor: 25
- Auto Commit: 60s / 1000 docs

What we're seeing:
In roughly 14 hour cycles, the CPU usage climbs from 100% to between 200
and 250%. At the end of the cycle, we get one long commit of roughly 500
seconds, blocking all writes. Around the same time queries begin to get
very slow, often causing timeouts from connecting clients. This behavior is
cyclical, and is getting progressively worse.

What is this, and what can we do about it?

I've attached relevant graphs. Apologies in advance for the obscenely large
image sizes.

Cheers,
Stephen

client-requests-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUN1lhd1hfSE9Jc2M/edit?usp=drive_web>

cpu-usage.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUSHpsY1B2T01iVGM/edit?usp=drive_web>

disk-ios-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUNEpkMGRkR3dhYVk/edit?usp=drive_web>

mem-usage-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUWnFVZlU3aUxYNXc/edit?usp=drive_web>

tcp-connections-2.png<https://docs.google.com/file/d/0B7_6ZI9PZjjUYmdvMmpDSlVvQUE/edit?usp=drive_web>

Re: Solr 1.4 - Performance Issues

Posted by Erick Erickson <er...@gmail.com>.

1.4 is ancient, but you know that already :)....

Anyway, what are your autocommit settings?
That vintage of Solr blocks indexing when committing
which may include rewriting the entire index.
So part of your regular slowdown is
likely segment merging happening with the
commit. The 14 hour cycle is a bit weird though. One
thing I'd be curious about is whether, when that happens,
you look at your index on disk and see whether you've
merged down to just a few (or one) segment. One possible
explanation is that roughly that often, the merge that happens
rewrites the entire index and that takes 500 seconds. If that's
true, you should see a few massive segments in your index
right after it happens.

I'm assuming your autocommit settings aren't, like, 14 hours.....

Does anything issue an optimize command? That will also block
updates until it rewrites the entire index.

I don't know of a good stop-gap though. Even a master/slave
would still have this problem on the master. You might be able
to do something with stopping the indexing process, issue a manual
optimize and then start the indexing up again. About all that would
do, though, is make the slowdown's predictable.

Not much help I know. Here's a writeup though:

http://blog.trifork.com/2011/04/01/gimme-all-resources-you-have-i-can-use-them/

Best,
Erick

On Tue, Nov 5, 2013 at 2:15 AM, Stephen Delano <st...@gmail.com> wrote:

> Hi all,
>
> I wanted to share the issues we're having with Solr 1.4 to get some ideas
> of things we can do in the short term that will buy us enough time to
> validate Solr 4 before upgrading and not have 1.4 burn to the ground before
> we get there.
>
> We've been running Solr 1.4 in production for over 3 years now, but are
> really starting to hit some performance bottlenecks that are beginning to
> affect our users. Here are the details of our setup:
>
> We're running 2 4-CPU Solr servers. The data is on a 4-disk RAID 10 array
> and we're using block-level replication via DRBD over GigE to write to the
> standby node. Only one server is serving traffic at a time.
>
> Some tuning information:
> - Merge Factor: 25
> - Auto Commit: 60s / 1000 docs
>
> What we're seeing:
> In roughly 14 hour cycles, the CPU usage climbs from 100% to between 200
> and 250%. At the end of the cycle, we get one long commit of roughly 500
> seconds, blocking all writes. Around the same time queries begin to get
> very slow, often causing timeouts from connecting clients. This behavior is
> cyclical, and is getting progressively worse.
>
> What is this, and what can we do about it?
>
> I've attached relevant graphs. Apologies in advance for the obscenely large
> image sizes.
>
> Cheers,
> Stephen
>
>  client-requests-2.png<
> https://docs.google.com/file/d/0B7_6ZI9PZjjUN1lhd1hfSE9Jc2M/edit?usp=drive_web
> >
>
>  cpu-usage.png<
> https://docs.google.com/file/d/0B7_6ZI9PZjjUSHpsY1B2T01iVGM/edit?usp=drive_web
> >
>
>  disk-ios-2.png<
> https://docs.google.com/file/d/0B7_6ZI9PZjjUNEpkMGRkR3dhYVk/edit?usp=drive_web
> >
>
>  mem-usage-2.png<
> https://docs.google.com/file/d/0B7_6ZI9PZjjUWnFVZlU3aUxYNXc/edit?usp=drive_web
> >
>
>  tcp-connections-2.png<
> https://docs.google.com/file/d/0B7_6ZI9PZjjUYmdvMmpDSlVvQUE/edit?usp=drive_web
> >
>
>