You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Martin de Vries <ma...@downnotifier.com> on 2013/11/12 09:45:25 UTC

SolrCloud unstable

Hi,

We have:

Solr 4.5.1 - 5 servers
36 cores, 2 shards each, 2 servers per shard (every core is on 4 
servers)
about 4.5 GB total data on disk per server
4GB JVM-Memory per server, 3GB average in use
Zookeeper 3.3.5 - 3 servers (one shared with Solr)
haproxy load balancing

Our Solrcloud is very unstable. About one time a week some cores go in 
recovery state or down state. Many timeouts occur and we have to restart 
servers to get them back to work. The failover doesn't work in many 
cases, because one server has the core in down state, the other in 
recovering state. Other cores work fine. When the cloud is stable I 
sometimes see log messages like:
- shard update error StdNode: 
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException: 
IOException occured when talking to server at: 
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
- forwarding update to 
http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - 
retrying ...
- null:ClientAbortException: java.io.IOException: Broken pipe

Before the the cloud problems start there are many large Qtime's in the 
log (sometimes over 50 seconds), but there are no other errors until the 
recovery problems start.


Any clue about what can be wrong?


Kinds regards,

Martin

Re: SolrCloud unstable

Posted by Yago Riveiro <ya...@gmail.com>.

Hi Martin,

I have the same behaviour that you are describing with a setup that is pretty equal.

6 machines, ~50 shards with replicationFactor equal two.

The most critical issue IMHO is the fact of the failover doens't work because a node is down and the other in recovery mode.

In log I can see that exection says: My last state is recovering I won't be the leader and the other saying Zookeeper says I'm the leader but internally I don't think.

I don't know yet why in some situations 2 or 3 replicas go to recovery mode at the same time. I have a high index rate (>500 doc/s)

I moved the zookeeper to an own disk to ensure that latency is not a trouble.

In my case the fact of have other replica per shard is too expensive, my setup uses ssd's and the fact of set up a new replica involves more memory and resources.  

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Tuesday, November 12, 2013 at 8:45 AM, Martin de Vries wrote:

> Hi,
> 
> We have:
> 
> Solr 4.5.1 - 5 servers
> 36 cores, 2 shards each, 2 servers per shard (every core is on 4 
> servers)
> about 4.5 GB total data on disk per server
> 4GB JVM-Memory per server, 3GB average in use
> Zookeeper 3.3.5 - 3 servers (one shared with Solr)
> haproxy load balancing
> 
> Our Solrcloud is very unstable. About one time a week some cores go in 
> recovery state or down state. Many timeouts occur and we have to restart 
> servers to get them back to work. The failover doesn't work in many 
> cases, because one server has the core in down state, the other in 
> recovering state. Other cores work fine. When the cloud is stable I 
> sometimes see log messages like:
> - shard update error StdNode: 
> http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException: 
> IOException occured when talking to server at: 
> http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
> - forwarding update to 
> http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - 
> retrying ...
> - null:ClientAbortException: java.io.IOException: Broken pipe
> 
> Before the the cloud problems start there are many large Qtime's in the 
> log (sometimes over 50 seconds), but there are no other errors until the 
> recovery problems start.
> 
> 
> Any clue about what can be wrong?
> 
> 
> Kinds regards,
> 
> Martin

Re: SolrCloud unstable

Posted by Furkan KAMACI <fu...@gmail.com>.

Hi Martin;

You can change your Java version from 1.6 to 1.7 u25 and test it again to
see that whether it is related to version of Java.

Thanks;
Furkan KAMACI


2013/11/24 Lance Norskog <go...@gmail.com>

> Yes, you should use a recent Java 7. Java 6 is end-of-life and no longer
> supported by Oracle. Also, read up on the various garbage collectors. It is
> a complex topic and there are many guides online.
>
> In particular there is a problem in some Java 6 releases that causes a
> massive memory leak in Solr. The symptom is that memory use oscillates
> (normally) from, say 1GB to 2GB. After the bug triggers, the ceiling of 2GB
> becomes the floor, and memory use oscillates from 2GB to 3GB. I'm not
> saying this is the problem you have. I'm just saying that is important to
> read up on garbage collection.
>
> Lance
>
>
> On 11/22/2013 05:27 AM, Martin de Vries wrote:
>
>>
>> We did some more monitoring and have some new information:
>>
>> Before
>> the issue happens the garbage collector's "collection count" increases a
>> lot. The increase seems to start about an hour before the real problem
>> occurs:
>>
>> http://www.analyticsforapplications.com/GC.png [1]
>>
>> We tried
>> both the g1 garbage collector and the regular one, the problem happens
>> with both of them.
>>
>> We use Java 1.6 on some servers. Will Java 1.7 be
>> better?
>>
>> Martin
>>
>> Martin de Vries schreef op 12.11.2013 10:45:
>>
>> Hi,
>>
>>> We have:
>>>
>>> Solr 4.5.1 - 5 servers
>>> 36 cores, 2 shards each,
>>>
>> 2 servers per shard (every core is on 4
>>
>>> servers)
>>> about 4.5 GB total
>>>
>> data on disk per server
>>
>>> 4GB JVM-Memory per server, 3GB average in
>>>
>> use
>>
>>> Zookeeper 3.3.5 - 3 servers (one shared with Solr)
>>> haproxy load
>>>
>> balancing
>>
>>> Our Solrcloud is very unstable. About one time a week
>>>
>> some cores go in
>>
>>> recovery state or down state. Many timeouts occur
>>>
>> and we have to restart
>>
>>> servers to get them back to work. The failover
>>>
>> doesn't work in many
>>
>>> cases, because one server has the core in down
>>>
>> state, the other in
>>
>>> recovering state. Other cores work fine. When the
>>>
>> cloud is stable I
>>
>>> sometimes see log messages like:
>>> - shard update
>>>
>> error StdNode:
>> http://033.downnotifier.com:8983/solr/dntest_shard2_
>> replica1/:org.apache.solr.client.solrj.SolrServerException:
>>
>>  IOException occured when talking to server at:
>>>
>>>  http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
>>
>>> -
>>>
>> forwarding update to
>> http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed -
>> retrying ...
>>
>>> - null:ClientAbortException: java.io.IOException: Broken
>>>
>> pipe
>>
>>> Before the the cloud problems start there are many large
>>>
>> Qtime's in the
>>
>>> log (sometimes over 50 seconds), but there are no
>>>
>> other errors until the
>>
>>> recovery problems start.
>>>
>>> Any clue about
>>>
>> what can be wrong?
>>
>>> Kinds regards,
>>>
>>> Martin
>>>
>>
>> Links:
>> ------
>> [1]
>> http://www.analyticsforapplications.com/GC.png
>>
>>
>

Re: SolrCloud unstable

Posted by Lance Norskog <go...@gmail.com>.

Yes, you should use a recent Java 7. Java 6 is end-of-life and no longer 
supported by Oracle. Also, read up on the various garbage collectors. It 
is a complex topic and there are many guides online.

In particular there is a problem in some Java 6 releases that causes a 
massive memory leak in Solr. The symptom is that memory use oscillates 
(normally) from, say 1GB to 2GB. After the bug triggers, the ceiling of 
2GB becomes the floor, and memory use oscillates from 2GB to 3GB. I'm 
not saying this is the problem you have. I'm just saying that is 
important to read up on garbage collection.

Lance

On 11/22/2013 05:27 AM, Martin de Vries wrote:
>   
>
> We did some more monitoring and have some new information:
>
> Before
> the issue happens the garbage collector's "collection count" increases a
> lot. The increase seems to start about an hour before the real problem
> occurs:
>
> http://www.analyticsforapplications.com/GC.png [1]
>
> We tried
> both the g1 garbage collector and the regular one, the problem happens
> with both of them.
>
> We use Java 1.6 on some servers. Will Java 1.7 be
> better?
>
> Martin
>
> Martin de Vries schreef op 12.11.2013 10:45:
>
> Hi,
>> We have:
>>
>> Solr 4.5.1 - 5 servers
>> 36 cores, 2 shards each,
> 2 servers per shard (every core is on 4
>> servers)
>> about 4.5 GB total
> data on disk per server
>> 4GB JVM-Memory per server, 3GB average in
> use
>> Zookeeper 3.3.5 - 3 servers (one shared with Solr)
>> haproxy load
> balancing
>> Our Solrcloud is very unstable. About one time a week
> some cores go in
>> recovery state or down state. Many timeouts occur
> and we have to restart
>> servers to get them back to work. The failover
> doesn't work in many
>> cases, because one server has the core in down
> state, the other in
>> recovering state. Other cores work fine. When the
> cloud is stable I
>> sometimes see log messages like:
>> - shard update
> error StdNode:
> http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException:
>
>> IOException occured when talking to server at:
>>
> http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
>> -
> forwarding update to
> http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed -
> retrying ...
>> - null:ClientAbortException: java.io.IOException: Broken
> pipe
>> Before the the cloud problems start there are many large
> Qtime's in the
>> log (sometimes over 50 seconds), but there are no
> other errors until the
>> recovery problems start.
>>
>> Any clue about
> what can be wrong?
>> Kinds regards,
>>
>> Martin
>   
>
> Links:
> ------
> [1]
> http://www.analyticsforapplications.com/GC.png
>

Re: SolrCloud unstable

Posted by Martin de Vries <ma...@downnotifier.com>.

 

We did some more monitoring and have some new information: 

Before
the issue happens the garbage collector's "collection count" increases a
lot. The increase seems to start about an hour before the real problem
occurs: 

http://www.analyticsforapplications.com/GC.png [1] 

We tried
both the g1 garbage collector and the regular one, the problem happens
with both of them. 

We use Java 1.6 on some servers. Will Java 1.7 be
better? 

Martin 

Martin de Vries schreef op 12.11.2013 10:45: 

>
Hi,
> 
> We have:
> 
> Solr 4.5.1 - 5 servers
> 36 cores, 2 shards each,
2 servers per shard (every core is on 4 
> servers)
> about 4.5 GB total
data on disk per server
> 4GB JVM-Memory per server, 3GB average in
use
> Zookeeper 3.3.5 - 3 servers (one shared with Solr)
> haproxy load
balancing
> 
> Our Solrcloud is very unstable. About one time a week
some cores go in 
> recovery state or down state. Many timeouts occur
and we have to restart 
> servers to get them back to work. The failover
doesn't work in many 
> cases, because one server has the core in down
state, the other in 
> recovering state. Other cores work fine. When the
cloud is stable I 
> sometimes see log messages like:
> - shard update
error StdNode: 
>
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1/:org.apache.solr.client.solrj.SolrServerException:

> IOException occured when talking to server at: 
>
http://033.downnotifier.com:8983/solr/dntest_shard2_replica1
> -
forwarding update to 
>
http://033.downnotifier.com:8983/solr/dn_shard2_replica2/ failed - 
>
retrying ...
> - null:ClientAbortException: java.io.IOException: Broken
pipe
> 
> Before the the cloud problems start there are many large
Qtime's in the 
> log (sometimes over 50 seconds), but there are no
other errors until the 
> recovery problems start.
> 
> Any clue about
what can be wrong?
> 
> Kinds regards,
> 
> Martin

 

Links:
------
[1]
http://www.analyticsforapplications.com/GC.png

Re: SolrCloud unstable

Posted by yriveiro <ya...@gmail.com>.

Sometime ago I posted this issue
http://lucene.472066.n3.nabble.com/Leader-election-fails-in-some-point-td4096514.html

The link for screenshot is no longer available. When some shard fails and
lost the leader I have those exceptions.



-----
Best regards
--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-unstable-tp4100419p4100432.html
Sent from the Solr - User mailing list archive at Nabble.com.