You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Thomas Lamy <t....@cytainment.de> on 2015/01/07 11:01:29 UTC

leader split-brain at least once a day - need help

Hi there,

we are running a 3 server cloud serving a dozen 
single-shard/replicate-everywhere collections. The 2 biggest collections 
are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, 
Tomcat 7.0.56, Oracle Java 1.7.0_72-b14

10 of the 12 collections (the small ones) get filled by DIH full-import 
once a day starting at 1am. The second biggest collection is updated 
usind DIH delta-import every 10 minutes, the biggest one gets bulk json 
updates with commits once in 5 minutes.

On a regular basis, we have a leader information mismatch:
org.apache.solr.update.processor.DistributedUpdateProcessor; Request 
says it is coming from leader, but we are the leader
or the opposite
org.apache.solr.update.processor.DistributedUpdateProcessor; 
ClusterState says we are the leader, but locally we don't think so

One of these pop up once a day at around 8am, making either some cores 
going to "recovery failed" state, or all cores of at least one cloud 
node into state "gone".
This started out of the blue about 2 weeks ago, without changes to 
neither software, data, or client behaviour.

Most of the time, we get things going again by restarting solr on the 
current leader node, forcing a new election - can this be triggered 
while keeping solr (and the caches) up?
But sometimes this doesn't help, we had an incident last weekend where 
our admins didn't restart in time, creating millions of entries in 
/solr/oversser/queue, making zk close the connection, and leader 
re-elect fails. I had to flush zk, and re-upload collection config to 
get solr up again (just like in 
https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).

We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 
1500 requests/s) up and running, which does not have these problems 
since upgrading to 4.10.2.


Any hints on where to look for a solution?

Kind regards
Thomas

-- 
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139
Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476

Re: leader split-brain at least once a day - need help

Posted by Shawn Heisey <ap...@elyograg.org>.

On 1/12/2015 5:34 AM, Thomas Lamy wrote:
> I found no big/unusual GC pauses in the Log (at least manually; I
> found no free solution to analyze them that worked out of the box on a
> headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
> before) on one of the nodes, after checking allocation after 1 hour
> run time was at about 2-3GB. That didn't move the time frame where a
> restart was needed, so I don't think Solr's JVM GC is the problem.
> We're trying to get all of our node's logs (zookeeper and solr) into
> Splunk now, just to get a better sorted view of what's going on in the
> cloud once a problem occurs. We're also enabling GC logging for
> zookeeper; maybe we were missing problems there while focussing on
> solr logs.

If you make a copy of the gc log, you can put it on another system with
a GUI and graph it with this:

http://sourceforge.net/projects/gcviewer

Just double-click on the jar to run the program.  I find it is useful
for clarity on the graph to go to the View menu and uncheck everything
except the two "GC Times" options.  You can also change the zoom to a
lower percentage so you can see more of the graph.

That program is how I got the graph you can see on my wiki page about GC
tuning:

http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning

Another possible problem is that your install is exhausting the thread
pool.  Tomcat defaults to a maxThreads value of only 200.  There's a
good chance that your setup will need more than 200 threads at least
occasionally.  If you're near the limit, having a thread problem once
per day based on index activity seems like a good possibility.  Try
setting maxThreads to 10000 in the Tomcat config.

Thanks,
Shawn

Re: leader split-brain at least once a day - need help

Posted by Thomas Lamy <t....@cytainment.de>.

Hi Mark,

we're currently at 4.10.2, update to 4.10.3 ist scheduled for tomorrow.

T

Am 12.01.15 um 17:30 schrieb Mark Miller:
> bq. ClusterState says we are the leader, but locally we don't think so
>
> Generally this is due to some bug. One bug that can lead to it was recently
> fixed in 4.10.3 I think. What version are you on?
>
> - Mark
>
> On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy <t....@cytainment.de> wrote:
>
>> Hi,
>>
>> I found no big/unusual GC pauses in the Log (at least manually; I found
>> no free solution to analyze them that worked out of the box on a
>> headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
>> before) on one of the nodes, after checking allocation after 1 hour run
>> time was at about 2-3GB. That didn't move the time frame where a restart
>> was needed, so I don't think Solr's JVM GC is the problem.
>> We're trying to get all of our node's logs (zookeeper and solr) into
>> Splunk now, just to get a better sorted view of what's going on in the
>> cloud once a problem occurs. We're also enabling GC logging for
>> zookeeper; maybe we were missing problems there while focussing on solr
>> logs.
>>
>> Thomas
>>
>>
>> Am 08.01.15 um 16:33 schrieb Yonik Seeley:
>>> It's worth noting that those messages alone don't necessarily signify
>>> a problem with the system (and it wouldn't be called "split brain").
>>> The async nature of updates (and thread scheduling) along with
>>> stop-the-world GC pauses that can change leadership, cause these
>>> little windows of inconsistencies that we detect and log.
>>>
>>> -Yonik
>>> http://heliosearch.org - native code faceting, facet functions,
>>> sub-facets, off-heap data
>>>
>>>
>>> On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t....@cytainment.de>
>> wrote:
>>>> Hi there,
>>>>
>>>> we are running a 3 server cloud serving a dozen
>>>> single-shard/replicate-everywhere collections. The 2 biggest
>> collections are
>>>> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,
>> Tomcat
>>>> 7.0.56, Oracle Java 1.7.0_72-b14
>>>>
>>>> 10 of the 12 collections (the small ones) get filled by DIH full-import
>> once
>>>> a day starting at 1am. The second biggest collection is updated usind
>> DIH
>>>> delta-import every 10 minutes, the biggest one gets bulk json updates
>> with
>>>> commits once in 5 minutes.
>>>>
>>>> On a regular basis, we have a leader information mismatch:
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor; Request
>> says it
>>>> is coming from leader, but we are the leader
>>>> or the opposite
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor;
>> ClusterState
>>>> says we are the leader, but locally we don't think so
>>>>
>>>> One of these pop up once a day at around 8am, making either some cores
>> going
>>>> to "recovery failed" state, or all cores of at least one cloud node into
>>>> state "gone".
>>>> This started out of the blue about 2 weeks ago, without changes to
>> neither
>>>> software, data, or client behaviour.
>>>>
>>>> Most of the time, we get things going again by restarting solr on the
>>>> current leader node, forcing a new election - can this be triggered
>> while
>>>> keeping solr (and the caches) up?
>>>> But sometimes this doesn't help, we had an incident last weekend where
>> our
>>>> admins didn't restart in time, creating millions of entries in
>>>> /solr/oversser/queue, making zk close the connection, and leader
>> re-elect
>>>> fails. I had to flush zk, and re-upload collection config to get solr up
>>>> again (just like in https://gist.github.com/
>> isoboroff/424fcdf63fa760c1d1a7).
>>>> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,
>> 1500
>>>> requests/s) up and running, which does not have these problems since
>>>> upgrading to 4.10.2.
>>>>
>>>>
>>>> Any hints on where to look for a solution?
>>>>
>>>> Kind regards
>>>> Thomas
>>>>
>>>> --
>>>> Thomas Lamy
>>>> Cytainment AG & Co KG
>>>> Nordkanalstrasse 52
>>>> 20097 Hamburg
>>>>
>>>> Tel.:     +49 (40) 23 706-747
>>>> Fax:     +49 (40) 23 706-139
>>>> Sitz und Registergericht Hamburg
>>>> HRA 98121
>>>> HRB 86068
>>>> Ust-ID: DE213009476
>>>>
>>
>> --
>> Thomas Lamy
>> Cytainment AG & Co KG
>> Nordkanalstrasse 52
>> 20097 Hamburg
>>
>> Tel.:     +49 (40) 23 706-747
>> Fax:     +49 (40) 23 706-139
>>
>> Sitz und Registergericht Hamburg
>> HRA 98121
>> HRB 86068
>> Ust-ID: DE213009476
>>
>>


-- 
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476

Re: leader split-brain at least once a day - need help

Posted by Mark Miller <ma...@gmail.com>.

bq. ClusterState says we are the leader, but locally we don't think so

Generally this is due to some bug. One bug that can lead to it was recently
fixed in 4.10.3 I think. What version are you on?

- Mark

On Mon Jan 12 2015 at 7:35:47 AM Thomas Lamy <t....@cytainment.de> wrote:

> Hi,
>
> I found no big/unusual GC pauses in the Log (at least manually; I found
> no free solution to analyze them that worked out of the box on a
> headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G
> before) on one of the nodes, after checking allocation after 1 hour run
> time was at about 2-3GB. That didn't move the time frame where a restart
> was needed, so I don't think Solr's JVM GC is the problem.
> We're trying to get all of our node's logs (zookeeper and solr) into
> Splunk now, just to get a better sorted view of what's going on in the
> cloud once a problem occurs. We're also enabling GC logging for
> zookeeper; maybe we were missing problems there while focussing on solr
> logs.
>
> Thomas
>
>
> Am 08.01.15 um 16:33 schrieb Yonik Seeley:
> > It's worth noting that those messages alone don't necessarily signify
> > a problem with the system (and it wouldn't be called "split brain").
> > The async nature of updates (and thread scheduling) along with
> > stop-the-world GC pauses that can change leadership, cause these
> > little windows of inconsistencies that we detect and log.
> >
> > -Yonik
> > http://heliosearch.org - native code faceting, facet functions,
> > sub-facets, off-heap data
> >
> >
> > On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t....@cytainment.de>
> wrote:
> >> Hi there,
> >>
> >> we are running a 3 server cloud serving a dozen
> >> single-shard/replicate-everywhere collections. The 2 biggest
> collections are
> >> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5,
> Tomcat
> >> 7.0.56, Oracle Java 1.7.0_72-b14
> >>
> >> 10 of the 12 collections (the small ones) get filled by DIH full-import
> once
> >> a day starting at 1am. The second biggest collection is updated usind
> DIH
> >> delta-import every 10 minutes, the biggest one gets bulk json updates
> with
> >> commits once in 5 minutes.
> >>
> >> On a regular basis, we have a leader information mismatch:
> >> org.apache.solr.update.processor.DistributedUpdateProcessor; Request
> says it
> >> is coming from leader, but we are the leader
> >> or the opposite
> >> org.apache.solr.update.processor.DistributedUpdateProcessor;
> ClusterState
> >> says we are the leader, but locally we don't think so
> >>
> >> One of these pop up once a day at around 8am, making either some cores
> going
> >> to "recovery failed" state, or all cores of at least one cloud node into
> >> state "gone".
> >> This started out of the blue about 2 weeks ago, without changes to
> neither
> >> software, data, or client behaviour.
> >>
> >> Most of the time, we get things going again by restarting solr on the
> >> current leader node, forcing a new election - can this be triggered
> while
> >> keeping solr (and the caches) up?
> >> But sometimes this doesn't help, we had an incident last weekend where
> our
> >> admins didn't restart in time, creating millions of entries in
> >> /solr/oversser/queue, making zk close the connection, and leader
> re-elect
> >> fails. I had to flush zk, and re-upload collection config to get solr up
> >> again (just like in https://gist.github.com/
> isoboroff/424fcdf63fa760c1d1a7).
> >>
> >> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections,
> 1500
> >> requests/s) up and running, which does not have these problems since
> >> upgrading to 4.10.2.
> >>
> >>
> >> Any hints on where to look for a solution?
> >>
> >> Kind regards
> >> Thomas
> >>
> >> --
> >> Thomas Lamy
> >> Cytainment AG & Co KG
> >> Nordkanalstrasse 52
> >> 20097 Hamburg
> >>
> >> Tel.:     +49 (40) 23 706-747
> >> Fax:     +49 (40) 23 706-139
> >> Sitz und Registergericht Hamburg
> >> HRA 98121
> >> HRB 86068
> >> Ust-ID: DE213009476
> >>
>
>
> --
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
>
> Tel.:     +49 (40) 23 706-747
> Fax:     +49 (40) 23 706-139
>
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476
>
>

Re: leader split-brain at least once a day - need help

Posted by Thomas Lamy <t....@cytainment.de>.

Hi,

I found no big/unusual GC pauses in the Log (at least manually; I found 
no free solution to analyze them that worked out of the box on a 
headless debian wheezy box). Eventually i tried with -Xmx8G (was 64G 
before) on one of the nodes, after checking allocation after 1 hour run 
time was at about 2-3GB. That didn't move the time frame where a restart 
was needed, so I don't think Solr's JVM GC is the problem.
We're trying to get all of our node's logs (zookeeper and solr) into 
Splunk now, just to get a better sorted view of what's going on in the 
cloud once a problem occurs. We're also enabling GC logging for 
zookeeper; maybe we were missing problems there while focussing on solr 
logs.

Thomas


Am 08.01.15 um 16:33 schrieb Yonik Seeley:
> It's worth noting that those messages alone don't necessarily signify
> a problem with the system (and it wouldn't be called "split brain").
> The async nature of updates (and thread scheduling) along with
> stop-the-world GC pauses that can change leadership, cause these
> little windows of inconsistencies that we detect and log.
>
> -Yonik
> http://heliosearch.org - native code faceting, facet functions,
> sub-facets, off-heap data
>
>
> On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t....@cytainment.de> wrote:
>> Hi there,
>>
>> we are running a 3 server cloud serving a dozen
>> single-shard/replicate-everywhere collections. The 2 biggest collections are
>> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
>> 7.0.56, Oracle Java 1.7.0_72-b14
>>
>> 10 of the 12 collections (the small ones) get filled by DIH full-import once
>> a day starting at 1am. The second biggest collection is updated usind DIH
>> delta-import every 10 minutes, the biggest one gets bulk json updates with
>> commits once in 5 minutes.
>>
>> On a regular basis, we have a leader information mismatch:
>> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
>> is coming from leader, but we are the leader
>> or the opposite
>> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
>> says we are the leader, but locally we don't think so
>>
>> One of these pop up once a day at around 8am, making either some cores going
>> to "recovery failed" state, or all cores of at least one cloud node into
>> state "gone".
>> This started out of the blue about 2 weeks ago, without changes to neither
>> software, data, or client behaviour.
>>
>> Most of the time, we get things going again by restarting solr on the
>> current leader node, forcing a new election - can this be triggered while
>> keeping solr (and the caches) up?
>> But sometimes this doesn't help, we had an incident last weekend where our
>> admins didn't restart in time, creating millions of entries in
>> /solr/oversser/queue, making zk close the connection, and leader re-elect
>> fails. I had to flush zk, and re-upload collection config to get solr up
>> again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
>>
>> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
>> requests/s) up and running, which does not have these problems since
>> upgrading to 4.10.2.
>>
>>
>> Any hints on where to look for a solution?
>>
>> Kind regards
>> Thomas
>>
>> --
>> Thomas Lamy
>> Cytainment AG & Co KG
>> Nordkanalstrasse 52
>> 20097 Hamburg
>>
>> Tel.:     +49 (40) 23 706-747
>> Fax:     +49 (40) 23 706-139
>> Sitz und Registergericht Hamburg
>> HRA 98121
>> HRB 86068
>> Ust-ID: DE213009476
>>


-- 
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476

Re: leader split-brain at least once a day - need help

Posted by Yonik Seeley <yo...@heliosearch.com>.

It's worth noting that those messages alone don't necessarily signify
a problem with the system (and it wouldn't be called "split brain").
The async nature of updates (and thread scheduling) along with
stop-the-world GC pauses that can change leadership, cause these
little windows of inconsistencies that we detect and log.

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy <t....@cytainment.de> wrote:
> Hi there,
>
> we are running a 3 server cloud serving a dozen
> single-shard/replicate-everywhere collections. The 2 biggest collections are
> ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat
> 7.0.56, Oracle Java 1.7.0_72-b14
>
> 10 of the 12 collections (the small ones) get filled by DIH full-import once
> a day starting at 1am. The second biggest collection is updated usind DIH
> delta-import every 10 minutes, the biggest one gets bulk json updates with
> commits once in 5 minutes.
>
> On a regular basis, we have a leader information mismatch:
> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it
> is coming from leader, but we are the leader
> or the opposite
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
> says we are the leader, but locally we don't think so
>
> One of these pop up once a day at around 8am, making either some cores going
> to "recovery failed" state, or all cores of at least one cloud node into
> state "gone".
> This started out of the blue about 2 weeks ago, without changes to neither
> software, data, or client behaviour.
>
> Most of the time, we get things going again by restarting solr on the
> current leader node, forcing a new election - can this be triggered while
> keeping solr (and the caches) up?
> But sometimes this doesn't help, we had an incident last weekend where our
> admins didn't restart in time, creating millions of entries in
> /solr/oversser/queue, making zk close the connection, and leader re-elect
> fails. I had to flush zk, and re-upload collection config to get solr up
> again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
>
> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
> requests/s) up and running, which does not have these problems since
> upgrading to 4.10.2.
>
>
> Any hints on where to look for a solution?
>
> Kind regards
> Thomas
>
> --
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
>
> Tel.:     +49 (40) 23 706-747
> Fax:     +49 (40) 23 706-139
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476
>

Re: leader split-brain at least once a day - need help

Posted by Thomas Lamy <t....@cytainment.de>.

Hi Alan,
thanks for the pointer, I'll look at our gc logs

Am 07.01.2015 um 15:46 schrieb Alan Woodward:
> I had a similar issue, which was caused by https://issues.apache.org/jira/browse/SOLR-6763.  Are you getting long GC pauses or similar before the leader mismatches occur?
>
> Alan Woodward
> www.flax.co.uk
>
>
> On 7 Jan 2015, at 10:01, Thomas Lamy wrote:
>
>> Hi there,
>>
>> we are running a 3 server cloud serving a dozen single-shard/replicate-everywhere collections. The 2 biggest collections are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 7.0.56, Oracle Java 1.7.0_72-b14
>>
>> 10 of the 12 collections (the small ones) get filled by DIH full-import once a day starting at 1am. The second biggest collection is updated usind DIH delta-import every 10 minutes, the biggest one gets bulk json updates with commits once in 5 minutes.
>>
>> On a regular basis, we have a leader information mismatch:
>> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is coming from leader, but we are the leader
>> or the opposite
>> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so
>>
>> One of these pop up once a day at around 8am, making either some cores going to "recovery failed" state, or all cores of at least one cloud node into state "gone".
>> This started out of the blue about 2 weeks ago, without changes to neither software, data, or client behaviour.
>>
>> Most of the time, we get things going again by restarting solr on the current leader node, forcing a new election - can this be triggered while keeping solr (and the caches) up?
>> But sometimes this doesn't help, we had an incident last weekend where our admins didn't restart in time, creating millions of entries in /solr/oversser/queue, making zk close the connection, and leader re-elect fails. I had to flush zk, and re-upload collection config to get solr up again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
>>
>> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 requests/s) up and running, which does not have these problems since upgrading to 4.10.2.
>>
>>
>> Any hints on where to look for a solution?
>>
>> Kind regards
>> Thomas
>>
>> -- 
>> Thomas Lamy
>> Cytainment AG & Co KG
>> Nordkanalstrasse 52
>> 20097 Hamburg
>>
>> Tel.:     +49 (40) 23 706-747
>> Fax:     +49 (40) 23 706-139
>> Sitz und Registergericht Hamburg
>> HRA 98121
>> HRB 86068
>> Ust-ID: DE213009476
>>
>


-- 
Thomas Lamy
Cytainment AG & Co KG
Nordkanalstrasse 52
20097 Hamburg

Tel.:     +49 (40) 23 706-747
Fax:     +49 (40) 23 706-139

Sitz und Registergericht Hamburg
HRA 98121
HRB 86068
Ust-ID: DE213009476

Re: leader split-brain at least once a day - need help

Posted by Alan Woodward <al...@flax.co.uk>.

I had a similar issue, which was caused by https://issues.apache.org/jira/browse/SOLR-6763.  Are you getting long GC pauses or similar before the leader mismatches occur?

Alan Woodward
www.flax.co.uk


On 7 Jan 2015, at 10:01, Thomas Lamy wrote:

> Hi there,
> 
> we are running a 3 server cloud serving a dozen single-shard/replicate-everywhere collections. The 2 biggest collections are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 7.0.56, Oracle Java 1.7.0_72-b14
> 
> 10 of the 12 collections (the small ones) get filled by DIH full-import once a day starting at 1am. The second biggest collection is updated usind DIH delta-import every 10 minutes, the biggest one gets bulk json updates with commits once in 5 minutes.
> 
> On a regular basis, we have a leader information mismatch:
> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is coming from leader, but we are the leader
> or the opposite
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so
> 
> One of these pop up once a day at around 8am, making either some cores going to "recovery failed" state, or all cores of at least one cloud node into state "gone".
> This started out of the blue about 2 weeks ago, without changes to neither software, data, or client behaviour.
> 
> Most of the time, we get things going again by restarting solr on the current leader node, forcing a new election - can this be triggered while keeping solr (and the caches) up?
> But sometimes this doesn't help, we had an incident last weekend where our admins didn't restart in time, creating millions of entries in /solr/oversser/queue, making zk close the connection, and leader re-elect fails. I had to flush zk, and re-upload collection config to get solr up again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7).
> 
> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 requests/s) up and running, which does not have these problems since upgrading to 4.10.2.
> 
> 
> Any hints on where to look for a solution?
> 
> Kind regards
> Thomas
> 
> -- 
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
> 
> Tel.:     +49 (40) 23 706-747
> Fax:     +49 (40) 23 706-139
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476
>

Re: leader split-brain at least once a day - need help

Posted by Ugo Matrangolo <ug...@gmail.com>.

Hi Thomas,

I did not get these split brains (probably our use case is simpler) but we
got the spammed Zk phenomenon.

The easiest way to fix it is to:
1. Shut down all the Solr servers in the failing cluster
2. Connect to zk using its CLI
3. rmr overseer/queue
4. Restart Solr

Think is way faster of the gist you posted.

Ugo
On Jan 7, 2015 11:02 AM, "Thomas Lamy" <t....@cytainment.de> wrote:

> Hi there,
>
> we are running a 3 server cloud serving a dozen single-shard/replicate-everywhere
> collections. The 2 biggest collections are ~15M docs, and about 13GiB /
> 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 7.0.56, Oracle Java
> 1.7.0_72-b14
>
> 10 of the 12 collections (the small ones) get filled by DIH full-import
> once a day starting at 1am. The second biggest collection is updated usind
> DIH delta-import every 10 minutes, the biggest one gets bulk json updates
> with commits once in 5 minutes.
>
> On a regular basis, we have a leader information mismatch:
> org.apache.solr.update.processor.DistributedUpdateProcessor; Request says
> it is coming from leader, but we are the leader
> or the opposite
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
> says we are the leader, but locally we don't think so
>
> One of these pop up once a day at around 8am, making either some cores
> going to "recovery failed" state, or all cores of at least one cloud node
> into state "gone".
> This started out of the blue about 2 weeks ago, without changes to neither
> software, data, or client behaviour.
>
> Most of the time, we get things going again by restarting solr on the
> current leader node, forcing a new election - can this be triggered while
> keeping solr (and the caches) up?
> But sometimes this doesn't help, we had an incident last weekend where our
> admins didn't restart in time, creating millions of entries in
> /solr/oversser/queue, making zk close the connection, and leader re-elect
> fails. I had to flush zk, and re-upload collection config to get solr up
> again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7
> ).
>
> We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500
> requests/s) up and running, which does not have these problems since
> upgrading to 4.10.2.
>
>
> Any hints on where to look for a solution?
>
> Kind regards
> Thomas
>
> --
> Thomas Lamy
> Cytainment AG & Co KG
> Nordkanalstrasse 52
> 20097 Hamburg
>
> Tel.:     +49 (40) 23 706-747
> Fax:     +49 (40) 23 706-139
> Sitz und Registergericht Hamburg
> HRA 98121
> HRB 86068
> Ust-ID: DE213009476
>
>