You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lindsay Martin <lm...@abebooks.com> on 2015/01/10 00:54:02 UTC

Unexplained leader initiated recovery after updates

Hi all,

I am experiencing a problem where Solr nodes go into recovery following an update cycle.

Examination of the logs indicates that the recovery is initiated by the shard master while processing regular update events, because the replica is unreachable.

For example, the following is recorded in the leader’s log file:

...
2014-12-15 05:14:03.285 [qtp2092193830-400307] INFO org.apache.solr.cloud.ZkController Put replica core=listings coreNodeName=solr12:8983_solr_listings on solr12:8983_solr into leader-initiated recovery.
2014-12-15 05:14:03.285 [qtp2092193830-400307] WARN org.apache.solr.cloud.ZkController Leader is publishing core=listings coreNodeName =solr12:8983_solr_listings state=down on behalf of un-reachable replica http://solr12:8983/solr/listings/; forcePublishState? false
2014-12-15 05:14:03.287 [zkCallback-2-thread-20] INFO org.apache.solr.cloud.DistributedQueue LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
...

However when I check, I cannot detect any connectivity problems between the leader and the replica. About 40% of the time, the nodes recover without any intervention in 4 or 5 minutes. The remaining 60% of the time however, the recovering node reports a java.lang.OutOfMemoryError and Solr needs to be restarted.

For background, here are some details about our configuration:
* Solr 4.10.2 (problem also observed with Solr 4.6.1)
* 12 shards with 2 nodes per shard
* a single updater running in a separate subnet is posting updates using the SolrJ CloudSolrServer client. Updates are triggered hourly.
* system is under continuous query load
* autoCommit is set to 821 seconds
* autoSoftCommit is set to 303 seconds

I cannot correlate these recovery events to an increase in update or query load. The query traffic is does not appear to be affected by any transient connectivity issues.

The only clear pattern is that these recovery events happen after an updater run and the cluster is busy processing the updates.

Can suggest where to look to figure out why these recovery events are occurring?

Thanks,

Lindsay Martin

Re: Unexplained leader initiated recovery after updates

Posted by Lindsay Martin <lm...@abebooks.com>.

I have uncovered some additional details in the shard leader log:

2015-01-11 09:38:00.693 [qtp268575911-3617101] INFO
org.apache.solr.update.processor.LogUpdateProcessor  – [listings]
webapp=/solr path=/update
params{distrib.from=http://solr05.search.abebooks.com:8983/solr/listings/&u
pdate.distrib=TOLEADER&wt=javabin&version=2} {add=[14065572860
(1490024273004199936)]} 0 707
2015-01-11 09:38:00.913 [updateExecutor-1-thread-35734] ERROR
org.apache.solr.update.StreamingSolrServers  – error
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessi
onInputBuffer.java:160)
        at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java
:84)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSession
InputBuffer.java:273)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe
sponseParser.java:140)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe
sponseParser.java:57)
        at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.j
ava:260)
        at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(Abs
tractHttpClientConnection.java:283)
        at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(Def
aultClientConnection.java:251)
        at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader
(ManagedClientConnectionImpl.java:197)
        at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestE
xecutor.java:271)
        at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.ja
va:123)
        at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultReques
tDirector.java:682)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDi
rector.java:486)
        at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient
.java:863)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient
.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient
.java:106)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient
.java:57)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(Con
currentUpdateSolrServer.java:233)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
615)
        at java.lang.Thread.run(Thread.java:745)
2015-01-11 09:38:00.917 [qtp268575911-3616964] WARN
org.apache.solr.update.processor.DistributedUpdateProcessor  – Error
sending update
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessi
onInputBuffer.java:160)
        at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java
:84)
        at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSession
InputBuffer.java:273)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe
sponseParser.java:140)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe
sponseParser.java:57)
        at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.j
ava:260)
        at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(Abs
tractHttpClientConnection.java:283)
        at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(Def
aultClientConnection.java:251)
        at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader
(ManagedClientConnectionImpl.java:197)
        at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestE
xecutor.java:271)
        at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.ja
va:123)
        at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultReques
tDirector.java:682)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDi
rector.java:486)
        at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient
.java:863)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient
.java:82)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient
.java:106)
        at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient
.java:57)
        at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(Con
currentUpdateSolrServer.java:233)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
615)
        at java.lang.Thread.run(Thread.java:745)



It appears that the connection to the replica is being reset. The number
of file open files allowed on all Solr boxes is set to 65536.

There doesn’t appear to be any retries here and this triggers the recovery
of the replica. Is this expected behaviour?

Thanks,

Lindsay

On 2015-01-12, 11:11 AM, "Lindsay Martin" <lm...@abebooks.com> wrote:

>Here are more details about our setup:
>
>Zookeeper:
>* 3 separate hosts in same rack as Solr cluster
>* Zookeeper hosts do not run any other processes
>
>Solr:
>* total servers: 24 (plus 2 cold standbys in case of host failure)
>* physical memory: 65931872 kB (62 GB)
>* max JVM heap size: -Xmx10880m ( 10 GB)
>* only one Solr per host
>
>On the Œindex¹ directory size front, I am seeing some differences in the
>disk usage between leaders and replicas.
>
>In 1 / 12 shards, there is no difference in size between the leader and
>replica.
>
>In 6 / 12 shards, there 1 1 G difference in size between the leader and
>replica. Both have one index directory.
>
>In 5 / 12 shards, the replica is a multiple of the leader size, due to
>multiple index directories on disk.
>
>For example, shard 1 leader has a directory named
>'index.20140624071707699¹ 30 G in size. The replica has two directories:
>'index.20150108052156468¹ at 31G and Œindex.20140624071556270¹ at 32G.
>
>Thanks,
>
>Lindsay
>
>On 2015-01-09, 5:01 PM, "Shawn Heisey" <ap...@elyograg.org> wrote:
>
>>On 1/9/2015 4:54 PM, Lindsay Martin wrote:
>>> I am experiencing a problem where Solr nodes go into recovery following
>>>an update cycle.
>>
>><snip>
>>
>>> For background, here are some details about our configuration:
>>> * Solr 4.10.2 (problem also observed with Solr 4.6.1)
>>> * 12 shards with 2 nodes per shard
>>> * a single updater running in a separate subnet is posting updates
>>>using the SolrJ CloudSolrServer client. Updates are triggered hourly.
>>> * system is under continuous query load
>>> * autoCommit is set to 821 seconds
>>> * autoSoftCommit is set to 303 seconds
>>
>>I would suspect some kind of performance problem that likely results in
>>the zkClientTimeout expiring.  I have a standard set of questions for
>>performance problems.
>>
>>Questions about zookeeper:
>>
>>How many ZK nodes?  Is zookeeper on separate hardware?  If it's on the
>>same hardware as Solr, is its database on the same disk spindles as the
>>Solr index, or separate spindles?  Is zookeeper standalone or embedded
>>in Solr?  If it's standalone, do you happen to know the java max heap
>>for the zookeeper processes?
>>
>>Questions about Solr and the hardware:
>>
>>How many total Solr servers?  How much RAM is installed on each one?
>>What is the max size of the Java heap?  Are you running more than one
>>Solr (JVM/container) instance per machine?
>>
>>If you add up all the "index" directories on a server, how much disk
>>space does it take?  Is the amount of disk space used similar on all of
>>the servers?
>>
>>Thanks,
>>Shawn
>>
>

Re: Unexplained leader initiated recovery after updates

Posted by Lindsay Martin <lm...@abebooks.com>.

Here are more details about our setup:

Zookeeper:
* 3 separate hosts in same rack as Solr cluster
* Zookeeper hosts do not run any other processes

Solr:
* total servers: 24 (plus 2 cold standbys in case of host failure)
* physical memory: 65931872 kB (62 GB)
* max JVM heap size: -Xmx10880m ( 10 GB)
* only one Solr per host

On the Œindex¹ directory size front, I am seeing some differences in the
disk usage between leaders and replicas.

In 1 / 12 shards, there is no difference in size between the leader and
replica.

In 6 / 12 shards, there 1 1 G difference in size between the leader and
replica. Both have one index directory.

In 5 / 12 shards, the replica is a multiple of the leader size, due to
multiple index directories on disk.

For example, shard 1 leader has a directory named
'index.20140624071707699¹ 30 G in size. The replica has two directories:
'index.20150108052156468¹ at 31G and Œindex.20140624071556270¹ at 32G.

Thanks,

Lindsay

On 2015-01-09, 5:01 PM, "Shawn Heisey" <ap...@elyograg.org> wrote:

>On 1/9/2015 4:54 PM, Lindsay Martin wrote:
>> I am experiencing a problem where Solr nodes go into recovery following
>>an update cycle.
>
><snip>
>
>> For background, here are some details about our configuration:
>> * Solr 4.10.2 (problem also observed with Solr 4.6.1)
>> * 12 shards with 2 nodes per shard
>> * a single updater running in a separate subnet is posting updates
>>using the SolrJ CloudSolrServer client. Updates are triggered hourly.
>> * system is under continuous query load
>> * autoCommit is set to 821 seconds
>> * autoSoftCommit is set to 303 seconds
>
>I would suspect some kind of performance problem that likely results in
>the zkClientTimeout expiring.  I have a standard set of questions for
>performance problems.
>
>Questions about zookeeper:
>
>How many ZK nodes?  Is zookeeper on separate hardware?  If it's on the
>same hardware as Solr, is its database on the same disk spindles as the
>Solr index, or separate spindles?  Is zookeeper standalone or embedded
>in Solr?  If it's standalone, do you happen to know the java max heap
>for the zookeeper processes?
>
>Questions about Solr and the hardware:
>
>How many total Solr servers?  How much RAM is installed on each one?
>What is the max size of the Java heap?  Are you running more than one
>Solr (JVM/container) instance per machine?
>
>If you add up all the "index" directories on a server, how much disk
>space does it take?  Is the amount of disk space used similar on all of
>the servers?
>
>Thanks,
>Shawn
>

Re: Unexplained leader initiated recovery after updates

Posted by Shawn Heisey <ap...@elyograg.org>.

On 1/9/2015 4:54 PM, Lindsay Martin wrote:
> I am experiencing a problem where Solr nodes go into recovery following an update cycle.

<snip>

> For background, here are some details about our configuration:
> * Solr 4.10.2 (problem also observed with Solr 4.6.1)
> * 12 shards with 2 nodes per shard
> * a single updater running in a separate subnet is posting updates using the SolrJ CloudSolrServer client. Updates are triggered hourly.
> * system is under continuous query load
> * autoCommit is set to 821 seconds
> * autoSoftCommit is set to 303 seconds

I would suspect some kind of performance problem that likely results in
the zkClientTimeout expiring.  I have a standard set of questions for
performance problems.

Questions about zookeeper:

How many ZK nodes?  Is zookeeper on separate hardware?  If it's on the
same hardware as Solr, is its database on the same disk spindles as the
Solr index, or separate spindles?  Is zookeeper standalone or embedded
in Solr?  If it's standalone, do you happen to know the java max heap
for the zookeeper processes?

Questions about Solr and the hardware:

How many total Solr servers?  How much RAM is installed on each one?
What is the max size of the Java heap?  Are you running more than one
Solr (JVM/container) instance per machine?

If you add up all the "index" directories on a server, how much disk
space does it take?  Is the amount of disk space used similar on all of
the servers?

Thanks,
Shawn

Re: Unexplained leader initiated recovery after updates

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

This is very similar to issues:

https://issues.apache.org/jira/browse/SOLR-6983
https://issues.apache.org/jira/browse/SOLR-6931

The problem is that we used to retry automatically in certain situations
where it wasn't really safe to retry so we removed that ability. This made
the connection reset issue more severe. Mark has fixed it in the 5x branch.
It's also been backported to 4.10 branch but hasn't been released yet.

On Fri, Feb 6, 2015 at 2:23 PM, adfel70 <ad...@gmail.com> wrote:

> any inputs on this?
> i'm facing the same problem..
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Unexplained-leader-initiated-recovery-after-updates-tp4178496p4184336.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: Unexplained leader initiated recovery after updates

Posted by adfel70 <ad...@gmail.com>.

any inputs on this?
i'm facing the same problem..



--
View this message in context: http://lucene.472066.n3.nabble.com/Unexplained-leader-initiated-recovery-after-updates-tp4178496p4184336.html
Sent from the Solr - User mailing list archive at Nabble.com.