You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by jimtronic <ji...@gmail.com> on 2013/03/11 16:41:20 UTC

Some nodes have all the load

I was doing some rolling updates of my cluster ( 12 cores, 4 servers ) and I
ended up in a situation where one node was elected leader by all the cores.
This seemed very taxing to that one node. It was also still trying to serve
query requests so it slowed everything down. I'm trying to do a lot of
frequent atomic updates along with some periodic DIH syncs.

My "solution" to this situation was to try to take the supreme leader out of
the cluster and let the leader election start. This was not easy as there
was so much load on it, I couldn't take it out gracefully. Some of my cores
became unreachable for a while.

This was all under fictitious load, but it made me nervous about high load
production situation.

I'm sure there's several things I'm doing wrong in all this, so I thought
I'd see what you guys think.

Jim



--
View this message in context: http://lucene.472066.n3.nabble.com/Some-nodes-have-all-the-load-tp4046349.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Some nodes have all the load

Posted by Mark Miller <ma...@gmail.com>.

On Mar 11, 2013, at 5:52 PM, jimtronic <ji...@gmail.com> wrote:

> Should I omit commitWithin and set DIH to commit=false and just let soft and
> autocommit do their jobs?

Yeah, that's one valid option. You def are not able to keep up with the current commit / open searcher level. It looks like DIH will do a hard commit which will likely open a new searcher as well - that's not good - you should stick to soft commits and the infrequent hard commit. Then the commitWithin is fairly aggressive at 500ms. Whether or not you can keep up with this varies with a lot of factors and features and settings - clearly you are not currently able to keep up.

- Mark

Re: Some nodes have all the load

Posted by jimtronic <ji...@gmail.com>.

Thanks guys!

I'm going to try the new commit settings and pull some of the cores out
into more logical clusters. I really was pounding it pretty hard --  very
curious to see what I could get away with. Eg I was logging all the
requests to a core, too. I'm not surprised I was hitting limits, I just
wanted to know what the limits might be.

SolrCloud is really an amazing solution.

On Mon, Mar 11, 2013 at 9:41 PM, Mark Miller-3 [via Lucene] <
ml-node+s472066n4046529h15@n3.nabble.com> wrote:

>
> On Mar 11, 2013, at 7:47 PM, Shawn Heisey <[hidden email]<http://user/SendEmail.jtp?type=node&node=4046529&i=0>>
> wrote:
>
> >>
> >
> > I've just locate a previous message on this list from Mark Miller saying
> that in Solr 4, commitWithin is a soft commit.
>
> Yes, that's true.
>
> >
> > You should definitely wait for Mark or another committer to verify what
> I'm saying in the small novel I am writing below.
> >
> > My personal opinion is that you should have frequent soft commits (auto,
> manual, commitWithin, or some combination) along with less frequent (but
> not infrequent) autoCommit with openSearcher=false.  The autoCommit (which
> is a hard commit) does two things - ensures that the transaction logs do
> not grow out of control, and persists changes to disk.  If you have auto
> soft commits and updateLog is enabled, I would say that you are pretty safe
> using commit=false on your DIH updates.
>
> Right.
>
> >
> > If Mark agrees with what I have said, and your config/schema checks out
> OK with expected norms, you may be running into bugs.  It might also be a
> case of not enough CPU/RAM resources for the system load.  You never
> responded in another thread with the output of the 'free' command, or the
> size of your indexes.  Putting 13 busy Solr cores onto one box is overkill,
> unless the machine has 16-32 CPU cores *and* plenty of fast RAM to cache
> all your indexes in the OS disk cache.  Based on what you're saying here
> and in the other thread, you probably need a java heap size of 4GB or 8GB,
> heavily tuned JVM garbage collection options, and depending on the size of
> your indexes, 16GB may not be enough total system RAM.
> >
> > IMHO, you should not use trunk (5.0) for anything that you plan to one
> day run in production.  Trunk is very volatile, large-scale changes
> sometimes get committed with only minimal testing.  The dev branch named
> branch_4x (currently 4.3) is kept reasonably stable almost all of the time.
>  Version 4.2 has just been released - it is already available on the faster
> mirrors and there should be a release announcement within a day from now.
> >
> > If this is not being set up in anticipation for a production deployment,
> then trunk would be fine, but bugs are to be expected.  If the same
> problems do not happen in 4.2 or branch_4x, then I would move the
> discussion to the dev list.
> >
> > Thanks,
> > Shawn
> >
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Some-nodes-have-all-the-load-tp4046349p4046529.html
>  To unsubscribe from Some nodes have all the load, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4046349&code=amltdHJvbmljQGdtYWlsLmNvbXw0MDQ2MzQ5fDEzMjQ4NDk0MTQ=>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://lucene.472066.n3.nabble.com/Some-nodes-have-all-the-load-tp4046349p4046537.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Some nodes have all the load

Posted by Mark Miller <ma...@gmail.com>.

On Mar 11, 2013, at 7:47 PM, Shawn Heisey <so...@elyograg.org> wrote:

>> 
> 
> I've just locate a previous message on this list from Mark Miller saying that in Solr 4, commitWithin is a soft commit.

Yes, that's true.

> 
> You should definitely wait for Mark or another committer to verify what I'm saying in the small novel I am writing below.
> 
> My personal opinion is that you should have frequent soft commits (auto, manual, commitWithin, or some combination) along with less frequent (but not infrequent) autoCommit with openSearcher=false.  The autoCommit (which is a hard commit) does two things - ensures that the transaction logs do not grow out of control, and persists changes to disk.  If you have auto soft commits and updateLog is enabled, I would say that you are pretty safe using commit=false on your DIH updates.

Right.

> 
> If Mark agrees with what I have said, and your config/schema checks out OK with expected norms, you may be running into bugs.  It might also be a case of not enough CPU/RAM resources for the system load.  You never responded in another thread with the output of the 'free' command, or the size of your indexes.  Putting 13 busy Solr cores onto one box is overkill, unless the machine has 16-32 CPU cores *and* plenty of fast RAM to cache all your indexes in the OS disk cache.  Based on what you're saying here and in the other thread, you probably need a java heap size of 4GB or 8GB, heavily tuned JVM garbage collection options, and depending on the size of your indexes, 16GB may not be enough total system RAM.
> 
> IMHO, you should not use trunk (5.0) for anything that you plan to one day run in production.  Trunk is very volatile, large-scale changes sometimes get committed with only minimal testing.  The dev branch named branch_4x (currently 4.3) is kept reasonably stable almost all of the time.  Version 4.2 has just been released - it is already available on the faster mirrors and there should be a release announcement within a day from now.
> 
> If this is not being set up in anticipation for a production deployment, then trunk would be fine, but bugs are to be expected.  If the same problems do not happen in 4.2 or branch_4x, then I would move the discussion to the dev list.
> 
> Thanks,
> Shawn
>

Re: Some nodes have all the load

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/11/2013 3:52 PM, jimtronic wrote:
> The load test was fairly heavy (ie lots of users) and designed to mimic a
> fully operational system with lots of users doing normal things.
>
> There were two things I gleaned from the logs:
>
> PERFORMANCE WARNING: Overlapping onDeckSearchers=2 appeared for several of
> my more active cores
>
> and
>
> The non-leaders were throwing errors saying that the leader as not
> responding while trying to forward updates. (sorry can't find that specific
> error now)
>
> My best guess is that it has something to do with the commits.
>
>   a. frequent user generated writes using
> /update?commitWithin=500&waitFlush=false&waitSearcher=false
>   b. softCommit set to 3000
>   c. autoCommit set to 300,000 and openSearcher false
>   d. I'm also doing frequent periodic DIH updates. I guess this is
> commit=true by default.
>
> Should I omit commitWithin and set DIH to commit=false and just let soft and
> autocommit do their jobs?

I've just locate a previous message on this list from Mark Miller saying 
that in Solr 4, commitWithin is a soft commit.

You should definitely wait for Mark or another committer to verify what 
I'm saying in the small novel I am writing below.

My personal opinion is that you should have frequent soft commits (auto, 
manual, commitWithin, or some combination) along with less frequent (but 
not infrequent) autoCommit with openSearcher=false.  The autoCommit 
(which is a hard commit) does two things - ensures that the transaction 
logs do not grow out of control, and persists changes to disk.  If you 
have auto soft commits and updateLog is enabled, I would say that you 
are pretty safe using commit=false on your DIH updates.

If Mark agrees with what I have said, and your config/schema checks out 
OK with expected norms, you may be running into bugs.  It might also be 
a case of not enough CPU/RAM resources for the system load.  You never 
responded in another thread with the output of the 'free' command, or 
the size of your indexes.  Putting 13 busy Solr cores onto one box is 
overkill, unless the machine has 16-32 CPU cores *and* plenty of fast 
RAM to cache all your indexes in the OS disk cache.  Based on what 
you're saying here and in the other thread, you probably need a java 
heap size of 4GB or 8GB, heavily tuned JVM garbage collection options, 
and depending on the size of your indexes, 16GB may not be enough total 
system RAM.

IMHO, you should not use trunk (5.0) for anything that you plan to one 
day run in production.  Trunk is very volatile, large-scale changes 
sometimes get committed with only minimal testing.  The dev branch named 
branch_4x (currently 4.3) is kept reasonably stable almost all of the 
time.  Version 4.2 has just been released - it is already available on 
the faster mirrors and there should be a release announcement within a 
day from now.

If this is not being set up in anticipation for a production deployment, 
then trunk would be fine, but bugs are to be expected.  If the same 
problems do not happen in 4.2 or branch_4x, then I would move the 
discussion to the dev list.

Thanks,
Shawn

Re: Some nodes have all the load

Posted by jimtronic <ji...@gmail.com>.

The load test was fairly heavy (ie lots of users) and designed to mimic a
fully operational system with lots of users doing normal things.

There were two things I gleaned from the logs:

PERFORMANCE WARNING: Overlapping onDeckSearchers=2 appeared for several of
my more active cores

and

The non-leaders were throwing errors saying that the leader as not
responding while trying to forward updates. (sorry can't find that specific
error now)

My best guess is that it has something to do with the commits.

 a. frequent user generated writes using
/update?commitWithin=500&waitFlush=false&waitSearcher=false
 b. softCommit set to 3000
 c. autoCommit set to 300,000 and openSearcher false
 d. I'm also doing frequent periodic DIH updates. I guess this is
commit=true by default.

Should I omit commitWithin and set DIH to commit=false and just let soft and
autocommit do their jobs?

Cheers,
Jim





--
View this message in context: http://lucene.472066.n3.nabble.com/Some-nodes-have-all-the-load-tp4046349p4046476.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Some nodes have all the load

Posted by Mark Miller <ma...@gmail.com>.

There is an open JIRA issue about trying to "spread the leader load" during elections. Was waiting to get reports that it was really a problem for someone though.

How much load were you putting on? How long were the nodes unresponsive? Unresponsive to everything? Just updates? Searches? What version of Solr? How many shards do you have? Collections?

- Mark

On Mar 11, 2013, at 11:41 AM, jimtronic <ji...@gmail.com> wrote:

> I was doing some rolling updates of my cluster ( 12 cores, 4 servers ) and I
> ended up in a situation where one node was elected leader by all the cores.
> This seemed very taxing to that one node. It was also still trying to serve
> query requests so it slowed everything down. I'm trying to do a lot of
> frequent atomic updates along with some periodic DIH syncs.
> 
> My "solution" to this situation was to try to take the supreme leader out of
> the cluster and let the leader election start. This was not easy as there
> was so much load on it, I couldn't take it out gracefully. Some of my cores
> became unreachable for a while.
> 
> This was all under fictitious load, but it made me nervous about high load
> production situation.
> 
> I'm sure there's several things I'm doing wrong in all this, so I thought
> I'd see what you guys think.
> 
> Jim
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Some-nodes-have-all-the-load-tp4046349.html
> Sent from the Solr - User mailing list archive at Nabble.com.