You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sudhakar Maddineni <ma...@gmail.com> on 2012/12/06 01:21:37 UTC

SolrCloud - ClusterState says we are the leader,but locally ...

Hi,
We are uploading solr documents to the index in batches using 30 threads
and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
10000.
In the code, we are using HttpSolrServer and add(inputDoc) method to add
docx.
And, we have the following commit settings in solrconfig:

     <autoCommit>
       <maxTime>300000</maxTime>
       <maxDocs>10000</maxDocs>
       <openSearcher>false</openSearcher>
     </autoCommit>

       <autoSoftCommit>
         <maxTime>1000</maxTime>
       </autoSoftCommit>

Cluster Details:
----------------------------
solr version - 4.0
zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
numshards=2 ,
001, 002, 003 are the solr nodes and these three are behind the
loadbalancer  <vip>
001, 003 assigned to shard1; 002 assigned to shard2


Logs:Getting the errors in the below sequence after uploading some docx:
-----------------------------------------------------------------------------------------------------------
003
Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1
timeoutin=179999

001
Dec 4, 2012 12:12:59 PM
org.apache.solr.update.processor.DistributedUpdateProcessor
doDefensiveChecks
SEVERE: ClusterState says we are the leader, but locally we don't think so

003
Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying ...

001
Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
<vip>/solr/core1. returned non ok status:503, message:Service Unavailable
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
001
Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
SEVERE: Error while trying to recover.
core=core1:org.apache.solr.common.SolrException: We are not the leader
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)

001
Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
SEVERE: Error uploading: org.apache.solr.client.solrj.SolrServerException:
IOException occured when talking to server at <vip>/solr/core1
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
... 5 lines omitted ...
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Connection reset


After sometime, all the three servers are going down.

Appreciate, if someone could let us know what we are missing.

Thx,Sudhakar.

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Sudhakar Maddineni <ma...@gmail.com>.
using solr version - 4.0 final.

Thx, Sudhakar.

On Wed, Dec 5, 2012 at 5:26 PM, Mark Miller <ma...@gmail.com> wrote:

> What Solr version - beta, alpha, 4.0 final, 4X or 5X?
>
> - Mark
>
> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <ma...@gmail.com>
> wrote:
>
> > Hi,
> > We are uploading solr documents to the index in batches using 30 threads
> > and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
> > 10000.
> > In the code, we are using HttpSolrServer and add(inputDoc) method to add
> > docx.
> > And, we have the following commit settings in solrconfig:
> >
> >     <autoCommit>
> >       <maxTime>300000</maxTime>
> >       <maxDocs>10000</maxDocs>
> >       <openSearcher>false</openSearcher>
> >     </autoCommit>
> >
> >       <autoSoftCommit>
> >         <maxTime>1000</maxTime>
> >       </autoSoftCommit>
> >
> > Cluster Details:
> > ----------------------------
> > solr version - 4.0
> > zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> > numshards=2 ,
> > 001, 002, 003 are the solr nodes and these three are behind the
> > loadbalancer  <vip>
> > 001, 003 assigned to shard1; 002 assigned to shard2
> >
> >
> > Logs:Getting the errors in the below sequence after uploading some docx:
> >
> -----------------------------------------------------------------------------------------------------------
> > 003
> > Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> > waitForReplicasToComeUp
> > INFO: Waiting until we see more replicas up: total=2 found=1
> > timeoutin=179999
> >
> > 001
> > Dec 4, 2012 12:12:59 PM
> > org.apache.solr.update.processor.DistributedUpdateProcessor
> > doDefensiveChecks
> > SEVERE: ClusterState says we are the leader, but locally we don't think
> so
> >
> > 003
> > Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> > SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying ...
> >
> > 001
> > Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> > SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
> > <vip>/solr/core1. returned non ok status:503, message:Service Unavailable
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> > 001
> > Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> > SEVERE: Error while trying to recover.
> > core=core1:org.apache.solr.common.SolrException: We are not the leader
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >
> > 001
> > Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> > SEVERE: Error uploading:
> org.apache.solr.client.solrj.SolrServerException:
> > IOException occured when talking to server at <vip>/solr/core1
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> > at
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> > ... 5 lines omitted ...
> > at java.lang.Thread.run(Unknown Source)
> > Caused by: java.net.SocketException: Connection reset
> >
> >
> > After sometime, all the three servers are going down.
> >
> > Appreciate, if someone could let us know what we are missing.
> >
> > Thx,Sudhakar.
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Mark Miller <ma...@gmail.com>.
What Solr version - beta, alpha, 4.0 final, 4X or 5X?

- Mark

On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <ma...@gmail.com> wrote:

> Hi,
> We are uploading solr documents to the index in batches using 30 threads
> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
> 10000.
> In the code, we are using HttpSolrServer and add(inputDoc) method to add
> docx.
> And, we have the following commit settings in solrconfig:
> 
>     <autoCommit>
>       <maxTime>300000</maxTime>
>       <maxDocs>10000</maxDocs>
>       <openSearcher>false</openSearcher>
>     </autoCommit>
> 
>       <autoSoftCommit>
>         <maxTime>1000</maxTime>
>       </autoSoftCommit>
> 
> Cluster Details:
> ----------------------------
> solr version - 4.0
> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> numshards=2 ,
> 001, 002, 003 are the solr nodes and these three are behind the
> loadbalancer  <vip>
> 001, 003 assigned to shard1; 002 assigned to shard2
> 
> 
> Logs:Getting the errors in the below sequence after uploading some docx:
> -----------------------------------------------------------------------------------------------------------
> 003
> Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=179999
> 
> 001
> Dec 4, 2012 12:12:59 PM
> org.apache.solr.update.processor.DistributedUpdateProcessor
> doDefensiveChecks
> SEVERE: ClusterState says we are the leader, but locally we don't think so
> 
> 003
> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying ...
> 
> 001
> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
> <vip>/solr/core1. returned non ok status:503, message:Service Unavailable
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> 001
> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover.
> core=core1:org.apache.solr.common.SolrException: We are not the leader
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> 
> 001
> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> SEVERE: Error uploading: org.apache.solr.client.solrj.SolrServerException:
> IOException occured when talking to server at <vip>/solr/core1
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> ... 5 lines omitted ...
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.SocketException: Connection reset
> 
> 
> After sometime, all the three servers are going down.
> 
> Appreciate, if someone could let us know what we are missing.
> 
> Thx,Sudhakar.


Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Erick Erickson <er...@gmail.com>.
I've seen the "Waiting until we see..." message as well, it seems for me to
be an artifact of bouncing servers rapidly. It took a lot of patience to
wait until the timeoutin value got all the way to 0, but when it did the
system recovered.

As to your original problem, are you possibly getting page caching at the
servlet level?

Best
Erick


On Wed, Dec 5, 2012 at 9:41 PM, Sudhakar Maddineni
<ma...@gmail.com>wrote:

> Yep, after restarting, cluster came back to normal state.We will run
> couple of more tests and see if we could reproduce this issue.
>
> Btw, I am attaching the server logs before that 'INFO: *Waiting until we
> see more replicas*'  message.From the logs, we can see that leader
> election process started on 003 which was the replica for 001
> initially.That means leader 001 went down at that time?
>
> logs on 003:
> ========
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
>         INFO: Running the leader process.
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
>         INFO: Checking if I should try and be the leader.
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
>         INFO: My last published State was Active, it's okay to be the
> leader.
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
>         INFO: I may be the new leader - try and sync
> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>         WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core
> core=core1.
> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>         INFO: Sync replicas to http://<003>:8080/solr/core1/
> 12:11:16 PM org.apache.solr.update.PeerSync sync
>         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> replicas=[<001>:8080/solr/core1/] nUpdates=100
> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>         INFO: Updating live nodes -> this message is on 002
> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>  exception talking to <001>:8080/solr/core1/, failed
>         org.apache.solr.client.solrj.SolrServerException: Timeout occured
> while waiting response from server at: <001>:8080/solr/core1
>         at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>         at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>         at
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>         at
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>         at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>         at java.util.concurrent.FutureTask.run(Unknown Source)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
>         at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>         at java.util.concurrent.FutureTask.run(Unknown Source)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>         at java.lang.Thread.run(Unknown Source)
>         Caused by: java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(Unknown Source)
> 12:11:46 PM org.apache.solr.update.PeerSync sync
>         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. sync
> failed
> 12:11:46 PM org.apache.solr.common.SolrException log
>         SEVERE: Sync Failed
> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> rejoinLeaderElection
>         INFO: There is a better leader candidate than us - going back into
> recovery
> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>         INFO: Running recovery - first canceling any ongoing recovery
> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>         INFO: Starting recovery process.  core=core1
> recoveringAfterStartup=false
> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
> core=core1 - recoveringAfterStartup=false
> 12:11:46 PM org.apache.solr.update.PeerSync sync
>         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> replicas=[<001>:8080/solr/core1/] nUpdates=100
> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
>         INFO: Running the leader process.
> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
>         INFO: *Waiting until we see more replicas up: total=2 found=1
> timeoutin=179999*
> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
>         INFO: *Waiting until we see more replicas up: total=2 found=1
> timeoutin=179495*
> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
>         INFO: *Waiting until we see more replicas up: total=2 found=1
> timeoutin=178985*
> ....
> ....
>
> Thanks for your help.
> Sudhakar.
>
> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> The waiting logging had to happen on restart unless it's some kind of bug.
>>
>> Beyond that, something is off, but I have no clue why - it seems your
>> clusterstate.json is not up to date at all.
>>
>> Have you tried restarting the cluster then? Does that help at all?
>>
>> Do you see any exceptions around zookeeper session timeouts?
>>
>> - Mark
>>
>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <ma...@gmail.com>
>> wrote:
>>
>> > Hey Mark,
>> >
>> > Yes, I am able to access all of the nodes under each shard from
>> solrcloud
>> > admin UI.
>> >
>> >
>> >   - *It kind of looks like the urls solrcloud is using are not
>> accessible.
>> >   When you go to the admin page and the cloud tab, can you access the
>> urls it
>> >   shows for each shard? That is, if you click on of the links or copy
>> and
>> >   paste the address into a web browser, does it work?*
>> >
>> > Actually, I got these errors when my document upload task/job was
>> running,
>> > not during the cluster restart. Also,job ran fine initially for the
>> first
>> > one hour and started throwing these errors after indexing some docx.
>> >
>> > Thx, Sudhakar.
>> >
>> >
>> >
>> >
>> > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> >
>> >> It kind of looks like the urls solrcloud is using are not accessible.
>> When
>> >> you go to the admin page and the cloud tab, can you access the urls it
>> >> shows for each shard? That is, if you click on of the links or copy and
>> >> paste the address into a web browser, does it work?
>> >>
>> >> You may have to explicitly set the host= in solr.xml if it's not auto
>> >> detecting the right one. Make sure the ports like right too.
>> >>
>> >>> waitForReplicasToComeUp
>> >>> INFO: Waiting until we see more replicas up: total=2 found=1
>> >>> timeoutin=179999
>> >>
>> >> That happens when you stop the cluster and try to start it again -
>> before
>> >> a leader is chosen, it will wait for all known replicas fora shard to
>> come
>> >> up so that everyone can sync up and have a chance to be the best
>> leader. So
>> >> at this point it was only finding one of 2 known replicas and waiting
>> for
>> >> the second to come up. After a couple minutes (configurable) it will
>> just
>> >> continue anyway without the missing replica (if it doesn't show up).
>> >>
>> >> - Mark
>> >>
>> >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <maddinenidev@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> Hi,
>> >>> We are uploading solr documents to the index in batches using 30
>> threads
>> >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set
>> to
>> >>> 10000.
>> >>> In the code, we are using HttpSolrServer and add(inputDoc) method to
>> add
>> >>> docx.
>> >>> And, we have the following commit settings in solrconfig:
>> >>>
>> >>>    <autoCommit>
>> >>>      <maxTime>300000</maxTime>
>> >>>      <maxDocs>10000</maxDocs>
>> >>>      <openSearcher>false</openSearcher>
>> >>>    </autoCommit>
>> >>>
>> >>>      <autoSoftCommit>
>> >>>        <maxTime>1000</maxTime>
>> >>>      </autoSoftCommit>
>> >>>
>> >>> Cluster Details:
>> >>> ----------------------------
>> >>> solr version - 4.0
>> >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>> >>> numshards=2 ,
>> >>> 001, 002, 003 are the solr nodes and these three are behind the
>> >>> loadbalancer  <vip>
>> >>> 001, 003 assigned to shard1; 002 assigned to shard2
>> >>>
>> >>>
>> >>> Logs:Getting the errors in the below sequence after uploading some
>> docx:
>> >>>
>> >>
>> -----------------------------------------------------------------------------------------------------------
>> >>> 003
>> >>> Dec 4, 2012 12:11:46 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> >>> waitForReplicasToComeUp
>> >>> INFO: Waiting until we see more replicas up: total=2 found=1
>> >>> timeoutin=179999
>> >>>
>> >>> 001
>> >>> Dec 4, 2012 12:12:59 PM
>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor
>> >>> doDefensiveChecks
>> >>> SEVERE: ClusterState says we are the leader, but locally we don't
>> think
>> >> so
>> >>>
>> >>> 003
>> >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>> >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying
>> ...
>> >>>
>> >>> 001
>> >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>> >>> SEVERE: Error uploading: org.apache.solr.common.SolrException: Server
>> at
>> >>> <vip>/solr/core1. returned non ok status:503, message:Service
>> Unavailable
>> >>> at
>> >>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>> >>> at
>> >>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>> 001
>> >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>> >>> SEVERE: Error while trying to recover.
>> >>> core=core1:org.apache.solr.common.SolrException: We are not the leader
>> >>> at
>> >>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>> >>>
>> >>> 001
>> >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>> >>> SEVERE: Error uploading:
>> >> org.apache.solr.client.solrj.SolrServerException:
>> >>> IOException occured when talking to server at <vip>/solr/core1
>> >>> at
>> >>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>> >>> at
>> >>>
>> >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >>> at
>> >>>
>> >>
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>> >>> ... 5 lines omitted ...
>> >>> at java.lang.Thread.run(Unknown Source)
>> >>> Caused by: java.net.SocketException: Connection reset
>> >>>
>> >>>
>> >>> After sometime, all the three servers are going down.
>> >>>
>> >>> Appreciate, if someone could let us know what we are missing.
>> >>>
>> >>> Thx,Sudhakar.
>> >>
>> >>
>>
>>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Marcin Rzewucki <mr...@gmail.com>.
Hi,
I think I did 2 changes at the same time: increased maxFormContentSize and
zkClientTimeout (from 15s to 30s). When I restarted cluster there were no
"ClusterState issues", but most probably due to increased zkClientTimeout
and not maxFormContentSize. I did one more test with default value
for maxFormContentSize (1M) and there were no issues either.

Regards.

On 3 February 2013 22:16, Marcin Rzewucki <mr...@gmail.com> wrote:

> Hi,
>
> I set this:
>     <Call name="setAttribute">
>       <Arg>org.eclipse.jetty.server.Request.maxFormContentSize</Arg>
>       <Arg>10485760</Arg>
>     </Call>
> multipartUploadLimitInKB is set to 2MB in my case. The funny is that I did
> change only in jetty.xml
> I'll change this value back to 1MB and repeat test to check if this is the
> reason.
> Regards.
>
>
> On 3 February 2013 21:26, Shawn Heisey <so...@elyograg.org> wrote:
>
>> On 2/3/2013 1:07 PM, Marcin Rzewucki wrote:
>>
>>> I'm loading in batches. 10 threads are reading json files and load to
>>> Solr
>>> by sending POST request (from couple of dozens to couple of hundreds docs
>>> in 1 request). I had 1MB post request size, but when I changed it to 10MB
>>> errors disappeared. I guess this could be the reason.
>>> Regards.
>>>
>>
>> I thought SOLR-4265 changed the whole way that Solr interacts with Jetty
>> and set the max form size to 2MB within Solr.  My reading says that you can
>> now control the max POST size within solrconfig.xml - look for
>> formdataUploadLimitInKB within the example solrconfig.
>>
>> I'm curious as to exactly how you changed the max size in jetty.xml. The
>> typical way that you set maxFormContentSize is broken, but jetty 8.1.9 will
>> fix that.  Jetty 8.1.8 is what is currently included with Solr 4.1.  See
>> the following Jetty bug:
>>
>> https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130<https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130>
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Marcin Rzewucki <mr...@gmail.com>.
Hi,

I set this:
    <Call name="setAttribute">
      <Arg>org.eclipse.jetty.server.Request.maxFormContentSize</Arg>
      <Arg>10485760</Arg>
    </Call>
multipartUploadLimitInKB is set to 2MB in my case. The funny is that I did
change only in jetty.xml
I'll change this value back to 1MB and repeat test to check if this is the
reason.
Regards.

On 3 February 2013 21:26, Shawn Heisey <so...@elyograg.org> wrote:

> On 2/3/2013 1:07 PM, Marcin Rzewucki wrote:
>
>> I'm loading in batches. 10 threads are reading json files and load to Solr
>> by sending POST request (from couple of dozens to couple of hundreds docs
>> in 1 request). I had 1MB post request size, but when I changed it to 10MB
>> errors disappeared. I guess this could be the reason.
>> Regards.
>>
>
> I thought SOLR-4265 changed the whole way that Solr interacts with Jetty
> and set the max form size to 2MB within Solr.  My reading says that you can
> now control the max POST size within solrconfig.xml - look for
> formdataUploadLimitInKB within the example solrconfig.
>
> I'm curious as to exactly how you changed the max size in jetty.xml. The
> typical way that you set maxFormContentSize is broken, but jetty 8.1.9 will
> fix that.  Jetty 8.1.8 is what is currently included with Solr 4.1.  See
> the following Jetty bug:
>
> https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130<https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130>
>
> Thanks,
> Shawn
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Shawn Heisey <so...@elyograg.org>.
On 2/3/2013 1:07 PM, Marcin Rzewucki wrote:
> I'm loading in batches. 10 threads are reading json files and load to Solr
> by sending POST request (from couple of dozens to couple of hundreds docs
> in 1 request). I had 1MB post request size, but when I changed it to 10MB
> errors disappeared. I guess this could be the reason.
> Regards.

I thought SOLR-4265 changed the whole way that Solr interacts with Jetty 
and set the max form size to 2MB within Solr.  My reading says that you 
can now control the max POST size within solrconfig.xml - look for 
formdataUploadLimitInKB within the example solrconfig.

I'm curious as to exactly how you changed the max size in jetty.xml. The 
typical way that you set maxFormContentSize is broken, but jetty 8.1.9 
will fix that.  Jetty 8.1.8 is what is currently included with Solr 4.1. 
  See the following Jetty bug:

https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

Thanks,
Shawn


Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Marcin Rzewucki <mr...@gmail.com>.
I'm loading in batches. 10 threads are reading json files and load to Solr
by sending POST request (from couple of dozens to couple of hundreds docs
in 1 request). I had 1MB post request size, but when I changed it to 10MB
errors disappeared. I guess this could be the reason.
Regards.


On 3 February 2013 20:55, Mark Miller <ma...@gmail.com> wrote:

> What led you to trying that? I'm not connecting the dots in my head - the
> exception and the solution.
>
> - Mark
>
> On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki <mr...@gmail.com> wrote:
>
> > Hi,
> >
> > I think the issue was not in zk client timeout, but POST request size.
> When
> > I increased the value for Request.maxFormContentSize in jetty.xml I don't
> > see this issue any more.
> >
> > Regards.
> >
> > On 3 February 2013 01:56, Mark Miller <ma...@gmail.com> wrote:
> >
> >> Do you see anything about session expiration in the logs? That is the
> >> likely culprit for something like this. You may need to raise the
> timeout:
> >> http://wiki.apache.org/solr/SolrCloud#FAQ
> >>
> >> If you see no session timeouts, I don't have a guess yet.
> >>
> >> - Mark
> >>
> >> On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mr...@gmail.com>
> wrote:
> >>
> >>> I'm experiencing same problem in Solr4.1 during bulk loading. After 50
> >>> minutes of indexing the following error starts to occur:
> >>>
> >>> INFO: [core] webapp=/solr path=/update params={} {} 0 4
> >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> >>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
> >> the
> >>> leader, but locally we don't think so
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
> >>>       at
> >>>
> >>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
> >>>       at
> >>> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>       at
> >>>
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
> >>>       at
> >>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
> >>>       at
> >>>
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
> >>>       at
> >>>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> >>>       at org.eclipse.jetty.server.Server.handle(Server.java:365)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
> >>>       at
> >> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
> >>>       at
> >>> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> >>>       at
> >>>
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> >>>       at java.lang.Thread.run(Unknown Source)
> >>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> >>> Feb 02, 2013 11:36:31 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >> timeoutin=50699
> >>>
> >>> Then leader tries to sync with replica and after it finishes I can
> >> continue
> >>> loading.
> >>> None of SolrCloud nodes was restarted during that time. I don't
> remember
> >>> such behaviour in Solr4.0. Could it be related with the number of
> fields
> >>> indexed during loading ? I have a collection with about 2400 fields. I
> >>> can't reproduce same issue for other collections with much less fields
> >> per
> >>> record.
> >>> Regards.
> >>>
> >>> On 11 December 2012 19:50, Sudhakar Maddineni <maddinenidev@gmail.com
> >>> wrote:
> >>>
> >>>> Just an update on this issue:
> >>>>  We tried by increasing zookeeper client timeout settings to 30000ms
> in
> >>>> solr.xml (i think default is 15000ms), and haven't seen any issues
> from
> >> our
> >>>> tests.
> >>>> <cores .........           zkClientTimeout="30000" >
> >>>>
> >>>> Thanks, Sudhakar.
> >>>>
> >>>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
> >>>> <ma...@gmail.com>wrote:
> >>>>
> >>>>> We saw this error again today during our load test - basically,
> >> whenever
> >>>>> session is getting expired on the leader node, we are seeing the
> >>>>> error.After this happens, leader(001) is going into 'recovery' mode
> and
> >>>> all
> >>>>> the index updates are failing with "503- service unavailable" error
> >>>>> message.After some time(once recovery is successful), roles are
> swapped
> >>>>> i.e. 001 acting as the replica and 003 as leader.
> >>>>>
> >>>>> Btw, do you know why the connection to zookeeper[solr->zk] getting
> >>>>> interrupted in the middle?
> >>>>> is it because of the load(no of updates) we are putting on the
> cluster?
> >>>>>
> >>>>> Here is the exception stack trace:
> >>>>>
> >>>>> *Dec* *7*, *2012* *2:28:03* *PM*
> >>>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
> >>>>> *WARNING:*
> >>>> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
> >>>> *KeeperErrorCode* *=* *Session* *expired* *for*
> */overseer_elect/leader*
> >>>>>     *at*
> >>>>
> >>
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
> >>>>>     *at*
> >>>>
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
> >>>>>     *at*
> *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
> >>>>> )
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
> >>>>>     *at*
> >>>>
> >>
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
> >>>>>     *at* *java.lang.Thread.run*(*Unknown* *Source*)
> >>>>>
> >>>>> Thx,Sudhakar.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
> >>>> maddinenidev@gmail.com
> >>>>>> wrote:
> >>>>>
> >>>>>> Erick:
> >>>>>> Not seeing any page caching related issues...
> >>>>>>
> >>>>>> Mark:
> >>>>>> 1.Would this "waiting" on 003(replica) cause any inconsistencies in
> >>>> the
> >>>>>> zookeeper cluster state? I was also looking at the leader(001) logs
> at
> >>>> that
> >>>>>> time and seeing errors related to "*SEVERE: ClusterState says we are
> >> the
> >>>>>> leader, but locally we don't think so*".
> >>>>>> 2.Also, all of our servers in cluster were gone down when the index
> >>>>>> updates were running in parallel along with this issue.Do you see
> this
> >>>>>> related to the session expiry on 001?
> >>>>>>
> >>>>>> Here are the logs on 001
> >>>>>> =========================
> >>>>>>
> >>>>>> Dec 4, 2012 12:12:29 PM
> >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >>>>>> WARNING:
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired for /overseer_elect/leader
> >>>>>> at
> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> >>>>>> at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >>>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> >>>>>> Dec 4, 2012 12:12:29 PM
> >>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >>>>>> INFO: According to ZK I
> >>>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
> >>>> leader.
> >>>>>>
> >>>>>> Dec 4, 2012 12:12:29 PM
> >>>> org.apache.solr.cloud.OverseerCollectionProcessor
> >>>>>> run
> >>>>>> WARNING: Overseer cannot talk to ZK
> >>>>>>
> >>>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem finding the leader in
> >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem finding the leader in
> >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem making a request to the
> >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
> >>>> state
> >>>>>> down for <001>:8080_solr but I still do not see the request state. I
> >> see
> >>>>>> state: active live:true
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem making a request to the
> >>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
> >>>> state
> >>>>>> down for <001>:8080_solr but I still do not see the request state. I
> >> see
> >>>>>> state: active live:true
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>>>> ....
> >>>>>> ....
> >>>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: There was a problem finding the leader in
> >>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>>>> at
> >>>>>>
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>>>> ....
> >>>>>> ....
> >>>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
> >>>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
> >>>>>> finding the leader in zk
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
> >>>>>> at
> >>>>>>
> >>>>
> >>
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
> >>>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
> >>>>>> SEVERE: Error getting leader from zk
> >>>>>> org.apache.solr.common.SolrException: *There is conflicting
> >> information
> >>>>>> about the leader of shard: shard1 our state says:http://
> >>>> <001>:8080/solr/core1/
> >>>>>> but zookeeper says:http://<003>:8080/solr/core1/*
> >>>>>> * at
> >>>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
> >>>>>> * at
> >> org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
> >>>>>> Dec 4, 2012 12:22:30 PM
> >>>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
> >>>>>> INFO: Running the leader process.
> >>>>>> ....
> >>>>>> ....
> >>>>>>
> >>>>>> Thanks for your inputs.
> >>>>>> Sudhakar.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmiller@gmail.com
> >>>>> wrote:
> >>>>>>
> >>>>>>> Yes - it means that 001 went down (or more likely had it's
> connection
> >>>> to
> >>>>>>> ZooKeeper interrupted? that's what I mean about a session timeout -
> >> if
> >>>> the
> >>>>>>> solr->zk link is broken for longer than the session timeout that
> will
> >>>>>>> trigger a leader election and when the connection is reestablished,
> >> the
> >>>>>>> node will have to recover). That waiting should stop as soon as 001
> >>>> came
> >>>>>>> back up or reconnected to ZooKeeper.
> >>>>>>>
> >>>>>>> In fact, this waiting should not happen in this case - but only on
> >>>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully
> coming
> >>>> very
> >>>>>>> soon!):
> >>>>>>>
> >>>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
> >>>> code
> >>>>>>> path
> >>>>>>> for a fresh cluster start rather than fail over. (Mark Miller)
> >>>>>>>
> >>>>>>> - Mark
> >>>>>>>
> >>>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <
> >> maddinenidev@gmail.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Yep, after restarting, cluster came back to normal state.We will
> run
> >>>>>>> couple of more tests and see if we could reproduce this issue.
> >>>>>>>>
> >>>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting
> until
> >>>>>>> we see more replicas'  message.From the logs, we can see that
> leader
> >>>>>>> election process started on 003 which was the replica for 001
> >>>>>>> initially.That means leader 001 went down at that time?
> >>>>>>>>
> >>>>>>>> logs on 003:
> >>>>>>>> ========
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>>>       INFO: Running the leader process.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>>>       INFO: Checking if I should try and be the leader.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> shouldIBeLeader
> >>>>>>>>       INFO: My last published State was Active, it's okay to be
> the
> >>>>>>> leader.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>>>       INFO: I may be the new leader - try and sync
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
> >>>>>>>>       WARNING: Stopping recovery for
> >>>> zkNodeName=<003>:8080_solr_core
> >>>>>>> core=core1.
> >>>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
> >>>>>>>>       INFO: Sync replicas to http://<003>:8080/solr/core1/
> >>>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> >>>>>>>>       INFO: Updating live nodes -> this message is on 002
> >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
> >>>>>>>>       WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
> >>>>>>> exception talking to <001>:8080/solr/core1/, failed
> >>>>>>>>       org.apache.solr.client.solrj.SolrServerException: Timeout
> >>>>>>> occured while waiting response from server at:
> <001>:8080/solr/core1
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
> >>>>>>>>             at
> >>>>>>>
> >>>>
> >>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
> >>>>>>>>             at
> >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>>>>>> Source)
> >>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
> >>>>>>>>             at
> >>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> >>>>>>>>             at
> >>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>>>>>> Source)
> >>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
> >>>>>>>>             at
> >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> >> Source)
> >>>>>>>>             at
> >>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >>>>>>>>             at java.lang.Thread.run(Unknown Source)
> >>>>>>>>       Caused by: java.net.SocketTimeoutException: Read timed out
> >>>>>>>>             at java.net.SocketInputStream.socketRead0(Native
> >>>> Method)
> >>>>>>>>             at java.net.SocketInputStream.read(Unknown Source)
> >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
> >>>>>>> sync failed
> >>>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log
> >>>>>>>>       SEVERE: Sync Failed
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> rejoinLeaderElection
> >>>>>>>>       INFO: There is a better leader candidate than us - going
> back
> >>>>>>> into recovery
> >>>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
> >>>>>>>>       INFO: Running recovery - first canceling any ongoing
> recovery
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
> >>>>>>>>       INFO: Starting recovery process.  core=core1
> >>>>>>> recoveringAfterStartup=false
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> >>>>>>>>       INFO: Attempting to PeerSync from <001>:8080/solr/core1/
> >>>>>>> core=core1 - recoveringAfterStartup=false
> >>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> runLeaderProcess
> >>>>>>>>       INFO: Running the leader process.
> >>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> waitForReplicasToComeUp
> >>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>> timeoutin=179999
> >>>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> waitForReplicasToComeUp
> >>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>> timeoutin=179495
> >>>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>> waitForReplicasToComeUp
> >>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>> timeoutin=178985
> >>>>>>>> ....
> >>>>>>>> ....
> >>>>>>>>
> >>>>>>>> Thanks for your help.
> >>>>>>>> Sudhakar.
> >>>>>>>>
> >>>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <
> markrmiller@gmail.com>
> >>>>>>> wrote:
> >>>>>>>> The waiting logging had to happen on restart unless it's some kind
> >> of
> >>>>>>> bug.
> >>>>>>>>
> >>>>>>>> Beyond that, something is off, but I have no clue why - it seems
> >> your
> >>>>>>> clusterstate.json is not up to date at all.
> >>>>>>>>
> >>>>>>>> Have you tried restarting the cluster then? Does that help at all?
> >>>>>>>>
> >>>>>>>> Do you see any exceptions around zookeeper session timeouts?
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
> >>>> maddinenidev@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Mark,
> >>>>>>>>>
> >>>>>>>>> Yes, I am able to access all of the nodes under each shard from
> >>>>>>> solrcloud
> >>>>>>>>> admin UI.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> - *It kind of looks like the urls solrcloud is using are not
> >>>>>>> accessible.
> >>>>>>>>> When you go to the admin page and the cloud tab, can you access
> >>>>>>> the urls it
> >>>>>>>>> shows for each shard? That is, if you click on of the links or
> >>>>>>> copy and
> >>>>>>>>> paste the address into a web browser, does it work?*
> >>>>>>>>>
> >>>>>>>>> Actually, I got these errors when my document upload task/job was
> >>>>>>> running,
> >>>>>>>>> not during the cluster restart. Also,job ran fine initially for
> the
> >>>>>>> first
> >>>>>>>>> one hour and started throwing these errors after indexing some
> >>>> docx.
> >>>>>>>>>
> >>>>>>>>> Thx, Sudhakar.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <
> markrmiller@gmail.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> It kind of looks like the urls solrcloud is using are not
> >>>>>>> accessible. When
> >>>>>>>>>> you go to the admin page and the cloud tab, can you access the
> >>>> urls
> >>>>>>> it
> >>>>>>>>>> shows for each shard? That is, if you click on of the links or
> >>>> copy
> >>>>>>> and
> >>>>>>>>>> paste the address into a web browser, does it work?
> >>>>>>>>>>
> >>>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not
> >>>>>>> auto
> >>>>>>>>>> detecting the right one. Make sure the ports like right too.
> >>>>>>>>>>
> >>>>>>>>>>> waitForReplicasToComeUp
> >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>>>>>> timeoutin=179999
> >>>>>>>>>>
> >>>>>>>>>> That happens when you stop the cluster and try to start it
> again -
> >>>>>>> before
> >>>>>>>>>> a leader is chosen, it will wait for all known replicas fora
> shard
> >>>>>>> to come
> >>>>>>>>>> up so that everyone can sync up and have a chance to be the best
> >>>>>>> leader. So
> >>>>>>>>>> at this point it was only finding one of 2 known replicas and
> >>>>>>> waiting for
> >>>>>>>>>> the second to come up. After a couple minutes (configurable) it
> >>>>>>> will just
> >>>>>>>>>> continue anyway without the missing replica (if it doesn't show
> >>>> up).
> >>>>>>>>>>
> >>>>>>>>>> - Mark
> >>>>>>>>>>
> >>>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
> >>>>>>> maddinenidev@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>> We are uploading solr documents to the index in batches using
> 30
> >>>>>>> threads
> >>>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max
> limit
> >>>>>>> set to
> >>>>>>>>>>> 10000.
> >>>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc)
> method
> >>>>>>> to add
> >>>>>>>>>>> docx.
> >>>>>>>>>>> And, we have the following commit settings in solrconfig:
> >>>>>>>>>>>
> >>>>>>>>>>>  <autoCommit>
> >>>>>>>>>>>    <maxTime>300000</maxTime>
> >>>>>>>>>>>    <maxDocs>10000</maxDocs>
> >>>>>>>>>>>    <openSearcher>false</openSearcher>
> >>>>>>>>>>>  </autoCommit>
> >>>>>>>>>>>
> >>>>>>>>>>>    <autoSoftCommit>
> >>>>>>>>>>>      <maxTime>1000</maxTime>
> >>>>>>>>>>>    </autoSoftCommit>
> >>>>>>>>>>>
> >>>>>>>>>>> Cluster Details:
> >>>>>>>>>>> ----------------------------
> >>>>>>>>>>> solr version - 4.0
> >>>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> >>>>>>>>>>> numshards=2 ,
> >>>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the
> >>>>>>>>>>> loadbalancer  <vip>
> >>>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Logs:Getting the errors in the below sequence after uploading
> >>>> some
> >>>>>>> docx:
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> -----------------------------------------------------------------------------------------------------------
> >>>>>>>>>>> 003
> >>>>>>>>>>> Dec 4, 2012 12:11:46 PM
> >>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>> waitForReplicasToComeUp
> >>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>>>>>> timeoutin=179999
> >>>>>>>>>>>
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:12:59 PM
> >>>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor
> >>>>>>>>>>> doDefensiveChecks
> >>>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we
> don't
> >>>>>>> think
> >>>>>>>>>> so
> >>>>>>>>>>>
> >>>>>>>>>>> 003
> >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
> >>>>>>> retrying ...
> >>>>>>>>>>>
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
> >>>>>>> Server at
> >>>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service
> >>>>>>> Unavailable
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: Error while trying to recover.
> >>>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the
> >>>>>>> leader
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>>>>>>>>>
> >>>>>>>>>>> 001
> >>>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException
> log
> >>>>>>>>>>> SEVERE: Error uploading:
> >>>>>>>>>> org.apache.solr.client.solrj.SolrServerException:
> >>>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>> at
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>>>>>>>>> at
> >>>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> >>>>>>>>>>> ... 5 lines omitted ...
> >>>>>>>>>>> at java.lang.Thread.run(Unknown Source)
> >>>>>>>>>>> Caused by: java.net.SocketException: Connection reset
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> After sometime, all the three servers are going down.
> >>>>>>>>>>>
> >>>>>>>>>>> Appreciate, if someone could let us know what we are missing.
> >>>>>>>>>>>
> >>>>>>>>>>> Thx,Sudhakar.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> <logs_error.txt>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Mark Miller <ma...@gmail.com>.
What led you to trying that? I'm not connecting the dots in my head - the exception and the solution.

- Mark

On Feb 3, 2013, at 2:48 PM, Marcin Rzewucki <mr...@gmail.com> wrote:

> Hi,
> 
> I think the issue was not in zk client timeout, but POST request size. When
> I increased the value for Request.maxFormContentSize in jetty.xml I don't
> see this issue any more.
> 
> Regards.
> 
> On 3 February 2013 01:56, Mark Miller <ma...@gmail.com> wrote:
> 
>> Do you see anything about session expiration in the logs? That is the
>> likely culprit for something like this. You may need to raise the timeout:
>> http://wiki.apache.org/solr/SolrCloud#FAQ
>> 
>> If you see no session timeouts, I don't have a guess yet.
>> 
>> - Mark
>> 
>> On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mr...@gmail.com> wrote:
>> 
>>> I'm experiencing same problem in Solr4.1 during bulk loading. After 50
>>> minutes of indexing the following error starts to occur:
>>> 
>>> INFO: [core] webapp=/solr path=/update params={} {} 0 4
>>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
>> the
>>> leader, but locally we don't think so
>>>       at
>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
>>>       at
>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
>>>       at
>>> 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
>>>       at
>>> 
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>       at
>>> 
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
>>>       at
>>> 
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
>>>       at
>>> 
>> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
>>>       at
>>> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
>>>       at
>>> 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>       at
>>> 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>       at
>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>>>       at
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
>>>       at
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
>>>       at
>>> 
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>>>       at
>>> 
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>>       at
>>> 
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>>>       at
>>> 
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
>>>       at
>>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
>>>       at
>>> 
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>>>       at
>>> 
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>>>       at org.eclipse.jetty.server.Server.handle(Server.java:365)
>>>       at
>>> 
>> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
>>>       at
>>> 
>> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>>>       at
>>> 
>> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
>>>       at
>>> 
>> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
>>>       at
>> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
>>>       at
>>> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>>>       at
>>> 
>> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>>>       at
>>> 
>> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>>>       at
>>> 
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>>>       at
>>> 
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>>>       at java.lang.Thread.run(Unknown Source)
>>> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
>>> Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=50699
>>> 
>>> Then leader tries to sync with replica and after it finishes I can
>> continue
>>> loading.
>>> None of SolrCloud nodes was restarted during that time. I don't remember
>>> such behaviour in Solr4.0. Could it be related with the number of fields
>>> indexed during loading ? I have a collection with about 2400 fields. I
>>> can't reproduce same issue for other collections with much less fields
>> per
>>> record.
>>> Regards.
>>> 
>>> On 11 December 2012 19:50, Sudhakar Maddineni <maddinenidev@gmail.com
>>> wrote:
>>> 
>>>> Just an update on this issue:
>>>>  We tried by increasing zookeeper client timeout settings to 30000ms in
>>>> solr.xml (i think default is 15000ms), and haven't seen any issues from
>> our
>>>> tests.
>>>> <cores .........           zkClientTimeout="30000" >
>>>> 
>>>> Thanks, Sudhakar.
>>>> 
>>>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
>>>> <ma...@gmail.com>wrote:
>>>> 
>>>>> We saw this error again today during our load test - basically,
>> whenever
>>>>> session is getting expired on the leader node, we are seeing the
>>>>> error.After this happens, leader(001) is going into 'recovery' mode and
>>>> all
>>>>> the index updates are failing with "503- service unavailable" error
>>>>> message.After some time(once recovery is successful), roles are swapped
>>>>> i.e. 001 acting as the replica and 003 as leader.
>>>>> 
>>>>> Btw, do you know why the connection to zookeeper[solr->zk] getting
>>>>> interrupted in the middle?
>>>>> is it because of the load(no of updates) we are putting on the cluster?
>>>>> 
>>>>> Here is the exception stack trace:
>>>>> 
>>>>> *Dec* *7*, *2012* *2:28:03* *PM*
>>>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
>>>>> *WARNING:*
>>>> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
>>>> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
>>>>>     *at*
>>>> 
>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
>>>>>     *at*
>>>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
>>>>>     *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
>>>>> )
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
>>>>>     *at*
>>>> 
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
>>>>>     *at* *java.lang.Thread.run*(*Unknown* *Source*)
>>>>> 
>>>>> Thx,Sudhakar.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
>>>> maddinenidev@gmail.com
>>>>>> wrote:
>>>>> 
>>>>>> Erick:
>>>>>> Not seeing any page caching related issues...
>>>>>> 
>>>>>> Mark:
>>>>>> 1.Would this "waiting" on 003(replica) cause any inconsistencies in
>>>> the
>>>>>> zookeeper cluster state? I was also looking at the leader(001) logs at
>>>> that
>>>>>> time and seeing errors related to "*SEVERE: ClusterState says we are
>> the
>>>>>> leader, but locally we don't think so*".
>>>>>> 2.Also, all of our servers in cluster were gone down when the index
>>>>>> updates were running in parallel along with this issue.Do you see this
>>>>>> related to the session expiry on 001?
>>>>>> 
>>>>>> Here are the logs on 001
>>>>>> =========================
>>>>>> 
>>>>>> Dec 4, 2012 12:12:29 PM
>>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>>>> WARNING:
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired for /overseer_elect/leader
>>>>>> at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>>>>>> at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>>>>>> Dec 4, 2012 12:12:29 PM
>>>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>>>> INFO: According to ZK I
>>>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
>>>> leader.
>>>>>> 
>>>>>> Dec 4, 2012 12:12:29 PM
>>>> org.apache.solr.cloud.OverseerCollectionProcessor
>>>>>> run
>>>>>> WARNING: Overseer cannot talk to ZK
>>>>>> 
>>>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem finding the leader in
>>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem finding the leader in
>>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem making a request to the
>>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>>>> state
>>>>>> down for <001>:8080_solr but I still do not see the request state. I
>> see
>>>>>> state: active live:true
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem making a request to the
>>>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>>>> state
>>>>>> down for <001>:8080_solr but I still do not see the request state. I
>> see
>>>>>> state: active live:true
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>> ....
>>>>>> ....
>>>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: There was a problem finding the leader in
>>>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>>>> at
>>>>>> 
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>>>> ....
>>>>>> ....
>>>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
>>>>>> finding the leader in zk
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
>>>>>> at
>>>>>> 
>>>> 
>> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>>>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
>>>>>> SEVERE: Error getting leader from zk
>>>>>> org.apache.solr.common.SolrException: *There is conflicting
>> information
>>>>>> about the leader of shard: shard1 our state says:http://
>>>> <001>:8080/solr/core1/
>>>>>> but zookeeper says:http://<003>:8080/solr/core1/*
>>>>>> * at
>>>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
>>>>>> * at
>> org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>>>>>> Dec 4, 2012 12:22:30 PM
>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>>>>>> INFO: Running the leader process.
>>>>>> ....
>>>>>> ....
>>>>>> 
>>>>>> Thanks for your inputs.
>>>>>> Sudhakar.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmiller@gmail.com
>>>>> wrote:
>>>>>> 
>>>>>>> Yes - it means that 001 went down (or more likely had it's connection
>>>> to
>>>>>>> ZooKeeper interrupted? that's what I mean about a session timeout -
>> if
>>>> the
>>>>>>> solr->zk link is broken for longer than the session timeout that will
>>>>>>> trigger a leader election and when the connection is reestablished,
>> the
>>>>>>> node will have to recover). That waiting should stop as soon as 001
>>>> came
>>>>>>> back up or reconnected to ZooKeeper.
>>>>>>> 
>>>>>>> In fact, this waiting should not happen in this case - but only on
>>>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming
>>>> very
>>>>>>> soon!):
>>>>>>> 
>>>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
>>>> code
>>>>>>> path
>>>>>>> for a fresh cluster start rather than fail over. (Mark Miller)
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <
>> maddinenidev@gmail.com
>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Yep, after restarting, cluster came back to normal state.We will run
>>>>>>> couple of more tests and see if we could reproduce this issue.
>>>>>>>> 
>>>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting until
>>>>>>> we see more replicas'  message.From the logs, we can see that leader
>>>>>>> election process started on 003 which was the replica for 001
>>>>>>> initially.That means leader 001 went down at that time?
>>>>>>>> 
>>>>>>>> logs on 003:
>>>>>>>> ========
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>>>       INFO: Running the leader process.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>>>       INFO: Checking if I should try and be the leader.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>>>       INFO: My last published State was Active, it's okay to be the
>>>>>>> leader.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>>>       INFO: I may be the new leader - try and sync
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>>>>>>>>       WARNING: Stopping recovery for
>>>> zkNodeName=<003>:8080_solr_core
>>>>>>> core=core1.
>>>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>>>>>>>>       INFO: Sync replicas to http://<003>:8080/solr/core1/
>>>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync
>>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>>>>>>>>       INFO: Updating live nodes -> this message is on 002
>>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>>>>>>>>       WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>>>>>> exception talking to <001>:8080/solr/core1/, failed
>>>>>>>>       org.apache.solr.client.solrj.SolrServerException: Timeout
>>>>>>> occured while waiting response from server at: <001>:8080/solr/core1
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>>>>>>>             at
>>>>>>> 
>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>>>>>>>             at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>>>> Source)
>>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>>>             at
>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>>>>>>>>             at
>>>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>>>> Source)
>>>>>>>>             at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>>>             at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
>> Source)
>>>>>>>>             at
>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>>>>             at java.lang.Thread.run(Unknown Source)
>>>>>>>>       Caused by: java.net.SocketTimeoutException: Read timed out
>>>>>>>>             at java.net.SocketInputStream.socketRead0(Native
>>>> Method)
>>>>>>>>             at java.net.SocketInputStream.read(Unknown Source)
>>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>>>>>>> sync failed
>>>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log
>>>>>>>>       SEVERE: Sync Failed
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> rejoinLeaderElection
>>>>>>>>       INFO: There is a better leader candidate than us - going back
>>>>>>> into recovery
>>>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>>>>>>>>       INFO: Running recovery - first canceling any ongoing recovery
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>>>>>>>>       INFO: Starting recovery process.  core=core1
>>>>>>> recoveringAfterStartup=false
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>>>>>>>>       INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>>>>>>> core=core1 - recoveringAfterStartup=false
>>>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>>>       INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>>>       INFO: Running the leader process.
>>>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> waitForReplicasToComeUp
>>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>> timeoutin=179999
>>>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> waitForReplicasToComeUp
>>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>> timeoutin=179495
>>>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> waitForReplicasToComeUp
>>>>>>>>       INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>> timeoutin=178985
>>>>>>>> ....
>>>>>>>> ....
>>>>>>>> 
>>>>>>>> Thanks for your help.
>>>>>>>> Sudhakar.
>>>>>>>> 
>>>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>> The waiting logging had to happen on restart unless it's some kind
>> of
>>>>>>> bug.
>>>>>>>> 
>>>>>>>> Beyond that, something is off, but I have no clue why - it seems
>> your
>>>>>>> clusterstate.json is not up to date at all.
>>>>>>>> 
>>>>>>>> Have you tried restarting the cluster then? Does that help at all?
>>>>>>>> 
>>>>>>>> Do you see any exceptions around zookeeper session timeouts?
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
>>>> maddinenidev@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hey Mark,
>>>>>>>>> 
>>>>>>>>> Yes, I am able to access all of the nodes under each shard from
>>>>>>> solrcloud
>>>>>>>>> admin UI.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> - *It kind of looks like the urls solrcloud is using are not
>>>>>>> accessible.
>>>>>>>>> When you go to the admin page and the cloud tab, can you access
>>>>>>> the urls it
>>>>>>>>> shows for each shard? That is, if you click on of the links or
>>>>>>> copy and
>>>>>>>>> paste the address into a web browser, does it work?*
>>>>>>>>> 
>>>>>>>>> Actually, I got these errors when my document upload task/job was
>>>>>>> running,
>>>>>>>>> not during the cluster restart. Also,job ran fine initially for the
>>>>>>> first
>>>>>>>>> one hour and started throwing these errors after indexing some
>>>> docx.
>>>>>>>>> 
>>>>>>>>> Thx, Sudhakar.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmiller@gmail.com
>>>>> 
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> It kind of looks like the urls solrcloud is using are not
>>>>>>> accessible. When
>>>>>>>>>> you go to the admin page and the cloud tab, can you access the
>>>> urls
>>>>>>> it
>>>>>>>>>> shows for each shard? That is, if you click on of the links or
>>>> copy
>>>>>>> and
>>>>>>>>>> paste the address into a web browser, does it work?
>>>>>>>>>> 
>>>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not
>>>>>>> auto
>>>>>>>>>> detecting the right one. Make sure the ports like right too.
>>>>>>>>>> 
>>>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>>>> timeoutin=179999
>>>>>>>>>> 
>>>>>>>>>> That happens when you stop the cluster and try to start it again -
>>>>>>> before
>>>>>>>>>> a leader is chosen, it will wait for all known replicas fora shard
>>>>>>> to come
>>>>>>>>>> up so that everyone can sync up and have a chance to be the best
>>>>>>> leader. So
>>>>>>>>>> at this point it was only finding one of 2 known replicas and
>>>>>>> waiting for
>>>>>>>>>> the second to come up. After a couple minutes (configurable) it
>>>>>>> will just
>>>>>>>>>> continue anyway without the missing replica (if it doesn't show
>>>> up).
>>>>>>>>>> 
>>>>>>>>>> - Mark
>>>>>>>>>> 
>>>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>>>>>>> maddinenidev@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> We are uploading solr documents to the index in batches using 30
>>>>>>> threads
>>>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>>>>>>> set to
>>>>>>>>>>> 10000.
>>>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) method
>>>>>>> to add
>>>>>>>>>>> docx.
>>>>>>>>>>> And, we have the following commit settings in solrconfig:
>>>>>>>>>>> 
>>>>>>>>>>>  <autoCommit>
>>>>>>>>>>>    <maxTime>300000</maxTime>
>>>>>>>>>>>    <maxDocs>10000</maxDocs>
>>>>>>>>>>>    <openSearcher>false</openSearcher>
>>>>>>>>>>>  </autoCommit>
>>>>>>>>>>> 
>>>>>>>>>>>    <autoSoftCommit>
>>>>>>>>>>>      <maxTime>1000</maxTime>
>>>>>>>>>>>    </autoSoftCommit>
>>>>>>>>>>> 
>>>>>>>>>>> Cluster Details:
>>>>>>>>>>> ----------------------------
>>>>>>>>>>> solr version - 4.0
>>>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>>>>>>>>>>> numshards=2 ,
>>>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the
>>>>>>>>>>> loadbalancer  <vip>
>>>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Logs:Getting the errors in the below sequence after uploading
>>>> some
>>>>>>> docx:
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> -----------------------------------------------------------------------------------------------------------
>>>>>>>>>>> 003
>>>>>>>>>>> Dec 4, 2012 12:11:46 PM
>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>>>> timeoutin=179999
>>>>>>>>>>> 
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:12:59 PM
>>>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor
>>>>>>>>>>> doDefensiveChecks
>>>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we don't
>>>>>>> think
>>>>>>>>>> so
>>>>>>>>>>> 
>>>>>>>>>>> 003
>>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
>>>>>>> retrying ...
>>>>>>>>>>> 
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>>>>>>> Server at
>>>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service
>>>>>>> Unavailable
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: Error while trying to recover.
>>>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the
>>>>>>> leader
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>>>>>>> 
>>>>>>>>>>> 001
>>>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>>>>>>>>>>> SEVERE: Error uploading:
>>>>>>>>>> org.apache.solr.client.solrj.SolrServerException:
>>>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>>>> at
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>>>>>>>> at
>>>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>>>>>>>>>>> ... 5 lines omitted ...
>>>>>>>>>>> at java.lang.Thread.run(Unknown Source)
>>>>>>>>>>> Caused by: java.net.SocketException: Connection reset
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> After sometime, all the three servers are going down.
>>>>>>>>>>> 
>>>>>>>>>>> Appreciate, if someone could let us know what we are missing.
>>>>>>>>>>> 
>>>>>>>>>>> Thx,Sudhakar.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> <logs_error.txt>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 


Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Marcin Rzewucki <mr...@gmail.com>.
Hi,

I think the issue was not in zk client timeout, but POST request size. When
I increased the value for Request.maxFormContentSize in jetty.xml I don't
see this issue any more.

Regards.

On 3 February 2013 01:56, Mark Miller <ma...@gmail.com> wrote:

> Do you see anything about session expiration in the logs? That is the
> likely culprit for something like this. You may need to raise the timeout:
> http://wiki.apache.org/solr/SolrCloud#FAQ
>
> If you see no session timeouts, I don't have a guess yet.
>
> - Mark
>
> On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mr...@gmail.com> wrote:
>
> > I'm experiencing same problem in Solr4.1 during bulk loading. After 50
> > minutes of indexing the following error starts to occur:
> >
> > INFO: [core] webapp=/solr path=/update params={} {} 0 4
> > Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException: ClusterState says we are
> the
> > leader, but locally we don't think so
> >        at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
> >        at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
> >        at
> >
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
> >        at
> >
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >        at
> >
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
> >        at
> >
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
> >        at
> >
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
> >        at
> > org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
> >        at
> >
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >        at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >        at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
> >        at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
> >        at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >        at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >        at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >        at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> >        at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >        at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
> >        at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
> >        at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> >        at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
> >        at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> >        at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> >        at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> >        at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> >        at org.eclipse.jetty.server.Server.handle(Server.java:365)
> >        at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
> >        at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> >        at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
> >        at
> >
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
> >        at
> org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
> >        at
> > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> >        at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> >        at
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> >        at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> >        at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> >        at java.lang.Thread.run(Unknown Source)
> > Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> > Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
> > waitForReplicasToComeUp
> > INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=50699
> >
> > Then leader tries to sync with replica and after it finishes I can
> continue
> > loading.
> > None of SolrCloud nodes was restarted during that time. I don't remember
> > such behaviour in Solr4.0. Could it be related with the number of fields
> > indexed during loading ? I have a collection with about 2400 fields. I
> > can't reproduce same issue for other collections with much less fields
> per
> > record.
> > Regards.
> >
> > On 11 December 2012 19:50, Sudhakar Maddineni <maddinenidev@gmail.com
> >wrote:
> >
> >> Just an update on this issue:
> >>   We tried by increasing zookeeper client timeout settings to 30000ms in
> >> solr.xml (i think default is 15000ms), and haven't seen any issues from
> our
> >> tests.
> >> <cores .........           zkClientTimeout="30000" >
> >>
> >> Thanks, Sudhakar.
> >>
> >> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
> >> <ma...@gmail.com>wrote:
> >>
> >>> We saw this error again today during our load test - basically,
> whenever
> >>> session is getting expired on the leader node, we are seeing the
> >>> error.After this happens, leader(001) is going into 'recovery' mode and
> >> all
> >>> the index updates are failing with "503- service unavailable" error
> >>> message.After some time(once recovery is successful), roles are swapped
> >>> i.e. 001 acting as the replica and 003 as leader.
> >>>
> >>> Btw, do you know why the connection to zookeeper[solr->zk] getting
> >>> interrupted in the middle?
> >>> is it because of the load(no of updates) we are putting on the cluster?
> >>>
> >>> Here is the exception stack trace:
> >>>
> >>> *Dec* *7*, *2012* *2:28:03* *PM*
> >> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
> >>> *WARNING:*
> >> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
> >> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
> >>>      *at*
> >>
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
> >>>      *at*
> >> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
> >>>      *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
> >>> )
> >>>      *at*
> >>
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
> >>>      *at*
> >>
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
> >>>      *at*
> >>
> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
> >>>      *at*
> >>
> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
> >>>      *at*
> >>
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
> >>>      *at*
> >>
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
> >>>      *at* *java.lang.Thread.run*(*Unknown* *Source*)
> >>>
> >>> Thx,Sudhakar.
> >>>
> >>>
> >>>
> >>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
> >> maddinenidev@gmail.com
> >>>> wrote:
> >>>
> >>>> Erick:
> >>>>  Not seeing any page caching related issues...
> >>>>
> >>>> Mark:
> >>>>  1.Would this "waiting" on 003(replica) cause any inconsistencies in
> >> the
> >>>> zookeeper cluster state? I was also looking at the leader(001) logs at
> >> that
> >>>> time and seeing errors related to "*SEVERE: ClusterState says we are
> the
> >>>> leader, but locally we don't think so*".
> >>>>  2.Also, all of our servers in cluster were gone down when the index
> >>>> updates were running in parallel along with this issue.Do you see this
> >>>> related to the session expiry on 001?
> >>>>
> >>>> Here are the logs on 001
> >>>> =========================
> >>>>
> >>>> Dec 4, 2012 12:12:29 PM
> >>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >>>> WARNING:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /overseer_elect/leader
> >>>> at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> >>>> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> >>>> Dec 4, 2012 12:12:29 PM
> >>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >>>> INFO: According to ZK I
> >>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
> >> leader.
> >>>>
> >>>> Dec 4, 2012 12:12:29 PM
> >> org.apache.solr.cloud.OverseerCollectionProcessor
> >>>> run
> >>>> WARNING: Overseer cannot talk to ZK
> >>>>
> >>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
> >>>> SEVERE: There was a problem finding the leader in
> >>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>> at
> >>>>
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>> at
> >>>>
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
> >>>> SEVERE: There was a problem finding the leader in
> >>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>> at
> >>>>
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>> at
> >>>>
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
> >>>> SEVERE: There was a problem making a request to the
> >>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
> >> state
> >>>> down for <001>:8080_solr but I still do not see the request state. I
> see
> >>>> state: active live:true
> >>>> at
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
> >>>> SEVERE: There was a problem making a request to the
> >>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
> >> state
> >>>> down for <001>:8080_solr but I still do not see the request state. I
> see
> >>>> state: active live:true
> >>>> at
> >>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>> ....
> >>>> ....
> >>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
> >>>> SEVERE: There was a problem finding the leader in
> >>>> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>>> at
> >>>>
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >>>> ....
> >>>> ....
> >>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
> >>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
> >>>> finding the leader in zk
> >>>> at
> >>>>
> >>
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
> >>>> at
> >>>>
> >>
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
> >>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
> >>>> SEVERE: Error getting leader from zk
> >>>> org.apache.solr.common.SolrException: *There is conflicting
> information
> >>>> about the leader of shard: shard1 our state says:http://
> >> <001>:8080/solr/core1/
> >>>> but zookeeper says:http://<003>:8080/solr/core1/*
> >>>> * at
> >> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
> >>>> * at
> org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
> >>>> Dec 4, 2012 12:22:30 PM
> >>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
> >>>> INFO: Running the leader process.
> >>>> ....
> >>>> ....
> >>>>
> >>>> Thanks for your inputs.
> >>>> Sudhakar.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmiller@gmail.com
> >>> wrote:
> >>>>
> >>>>> Yes - it means that 001 went down (or more likely had it's connection
> >> to
> >>>>> ZooKeeper interrupted? that's what I mean about a session timeout -
> if
> >> the
> >>>>> solr->zk link is broken for longer than the session timeout that will
> >>>>> trigger a leader election and when the connection is reestablished,
> the
> >>>>> node will have to recover). That waiting should stop as soon as 001
> >> came
> >>>>> back up or reconnected to ZooKeeper.
> >>>>>
> >>>>> In fact, this waiting should not happen in this case - but only on
> >>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming
> >> very
> >>>>> soon!):
> >>>>>
> >>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
> >> code
> >>>>> path
> >>>>>  for a fresh cluster start rather than fail over. (Mark Miller)
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <
> maddinenidev@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Yep, after restarting, cluster came back to normal state.We will run
> >>>>> couple of more tests and see if we could reproduce this issue.
> >>>>>>
> >>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting until
> >>>>> we see more replicas'  message.From the logs, we can see that leader
> >>>>> election process started on 003 which was the replica for 001
> >>>>> initially.That means leader 001 went down at that time?
> >>>>>>
> >>>>>> logs on 003:
> >>>>>> ========
> >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> runLeaderProcess
> >>>>>>        INFO: Running the leader process.
> >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> shouldIBeLeader
> >>>>>>        INFO: Checking if I should try and be the leader.
> >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> shouldIBeLeader
> >>>>>>        INFO: My last published State was Active, it's okay to be the
> >>>>> leader.
> >>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> runLeaderProcess
> >>>>>>        INFO: I may be the new leader - try and sync
> >>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
> >>>>>>        WARNING: Stopping recovery for
> >> zkNodeName=<003>:8080_solr_core
> >>>>> core=core1.
> >>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
> >>>>>>        INFO: Sync replicas to http://<003>:8080/solr/core1/
> >>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync
> >>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> >>>>>>        INFO: Updating live nodes -> this message is on 002
> >>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
> >>>>>>        WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
> >>>>> exception talking to <001>:8080/solr/core1/, failed
> >>>>>>        org.apache.solr.client.solrj.SolrServerException: Timeout
> >>>>> occured while waiting response from server at: <001>:8080/solr/core1
> >>>>>>              at
> >>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
> >>>>>>              at
> >>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>              at
> >>>>>
> >>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
> >>>>>>              at
> >>>>>
> >>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
> >>>>>>              at
> >> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>>>> Source)
> >>>>>>              at java.util.concurrent.FutureTask.run(Unknown Source)
> >>>>>>              at
> >>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> >>>>>>              at
> >> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>>>> Source)
> >>>>>>              at java.util.concurrent.FutureTask.run(Unknown Source)
> >>>>>>              at
> >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
> >>>>>>              at
> >>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >>>>>>              at java.lang.Thread.run(Unknown Source)
> >>>>>>        Caused by: java.net.SocketTimeoutException: Read timed out
> >>>>>>              at java.net.SocketInputStream.socketRead0(Native
> >> Method)
> >>>>>>              at java.net.SocketInputStream.read(Unknown Source)
> >>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
> >>>>> sync failed
> >>>>>> 12:11:46 PM org.apache.solr.common.SolrException log
> >>>>>>        SEVERE: Sync Failed
> >>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> rejoinLeaderElection
> >>>>>>        INFO: There is a better leader candidate than us - going back
> >>>>> into recovery
> >>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
> >>>>>>        INFO: Running recovery - first canceling any ongoing recovery
> >>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
> >>>>>>        INFO: Starting recovery process.  core=core1
> >>>>> recoveringAfterStartup=false
> >>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> >>>>>>        INFO: Attempting to PeerSync from <001>:8080/solr/core1/
> >>>>> core=core1 - recoveringAfterStartup=false
> >>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> runLeaderProcess
> >>>>>>        INFO: Running the leader process.
> >>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> waitForReplicasToComeUp
> >>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>> timeoutin=179999
> >>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> waitForReplicasToComeUp
> >>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>> timeoutin=179495
> >>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>> waitForReplicasToComeUp
> >>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>> timeoutin=178985
> >>>>>> ....
> >>>>>> ....
> >>>>>>
> >>>>>> Thanks for your help.
> >>>>>> Sudhakar.
> >>>>>>
> >>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com>
> >>>>> wrote:
> >>>>>> The waiting logging had to happen on restart unless it's some kind
> of
> >>>>> bug.
> >>>>>>
> >>>>>> Beyond that, something is off, but I have no clue why - it seems
> your
> >>>>> clusterstate.json is not up to date at all.
> >>>>>>
> >>>>>> Have you tried restarting the cluster then? Does that help at all?
> >>>>>>
> >>>>>> Do you see any exceptions around zookeeper session timeouts?
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
> >> maddinenidev@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hey Mark,
> >>>>>>>
> >>>>>>> Yes, I am able to access all of the nodes under each shard from
> >>>>> solrcloud
> >>>>>>> admin UI.
> >>>>>>>
> >>>>>>>
> >>>>>>>  - *It kind of looks like the urls solrcloud is using are not
> >>>>> accessible.
> >>>>>>>  When you go to the admin page and the cloud tab, can you access
> >>>>> the urls it
> >>>>>>>  shows for each shard? That is, if you click on of the links or
> >>>>> copy and
> >>>>>>>  paste the address into a web browser, does it work?*
> >>>>>>>
> >>>>>>> Actually, I got these errors when my document upload task/job was
> >>>>> running,
> >>>>>>> not during the cluster restart. Also,job ran fine initially for the
> >>>>> first
> >>>>>>> one hour and started throwing these errors after indexing some
> >> docx.
> >>>>>>>
> >>>>>>> Thx, Sudhakar.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmiller@gmail.com
> >>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> It kind of looks like the urls solrcloud is using are not
> >>>>> accessible. When
> >>>>>>>> you go to the admin page and the cloud tab, can you access the
> >> urls
> >>>>> it
> >>>>>>>> shows for each shard? That is, if you click on of the links or
> >> copy
> >>>>> and
> >>>>>>>> paste the address into a web browser, does it work?
> >>>>>>>>
> >>>>>>>> You may have to explicitly set the host= in solr.xml if it's not
> >>>>> auto
> >>>>>>>> detecting the right one. Make sure the ports like right too.
> >>>>>>>>
> >>>>>>>>> waitForReplicasToComeUp
> >>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>>>> timeoutin=179999
> >>>>>>>>
> >>>>>>>> That happens when you stop the cluster and try to start it again -
> >>>>> before
> >>>>>>>> a leader is chosen, it will wait for all known replicas fora shard
> >>>>> to come
> >>>>>>>> up so that everyone can sync up and have a chance to be the best
> >>>>> leader. So
> >>>>>>>> at this point it was only finding one of 2 known replicas and
> >>>>> waiting for
> >>>>>>>> the second to come up. After a couple minutes (configurable) it
> >>>>> will just
> >>>>>>>> continue anyway without the missing replica (if it doesn't show
> >> up).
> >>>>>>>>
> >>>>>>>> - Mark
> >>>>>>>>
> >>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
> >>>>> maddinenidev@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>> We are uploading solr documents to the index in batches using 30
> >>>>> threads
> >>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
> >>>>> set to
> >>>>>>>>> 10000.
> >>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) method
> >>>>> to add
> >>>>>>>>> docx.
> >>>>>>>>> And, we have the following commit settings in solrconfig:
> >>>>>>>>>
> >>>>>>>>>   <autoCommit>
> >>>>>>>>>     <maxTime>300000</maxTime>
> >>>>>>>>>     <maxDocs>10000</maxDocs>
> >>>>>>>>>     <openSearcher>false</openSearcher>
> >>>>>>>>>   </autoCommit>
> >>>>>>>>>
> >>>>>>>>>     <autoSoftCommit>
> >>>>>>>>>       <maxTime>1000</maxTime>
> >>>>>>>>>     </autoSoftCommit>
> >>>>>>>>>
> >>>>>>>>> Cluster Details:
> >>>>>>>>> ----------------------------
> >>>>>>>>> solr version - 4.0
> >>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> >>>>>>>>> numshards=2 ,
> >>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the
> >>>>>>>>> loadbalancer  <vip>
> >>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Logs:Getting the errors in the below sequence after uploading
> >> some
> >>>>> docx:
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> -----------------------------------------------------------------------------------------------------------
> >>>>>>>>> 003
> >>>>>>>>> Dec 4, 2012 12:11:46 PM
> >>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>> waitForReplicasToComeUp
> >>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>>>>>>>> timeoutin=179999
> >>>>>>>>>
> >>>>>>>>> 001
> >>>>>>>>> Dec 4, 2012 12:12:59 PM
> >>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor
> >>>>>>>>> doDefensiveChecks
> >>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we don't
> >>>>> think
> >>>>>>>> so
> >>>>>>>>>
> >>>>>>>>> 003
> >>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
> >>>>> retrying ...
> >>>>>>>>>
> >>>>>>>>> 001
> >>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
> >>>>> Server at
> >>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service
> >>>>> Unavailable
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>> 001
> >>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> >>>>>>>>> SEVERE: Error while trying to recover.
> >>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the
> >>>>> leader
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>>>>>>>
> >>>>>>>>> 001
> >>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> >>>>>>>>> SEVERE: Error uploading:
> >>>>>>>> org.apache.solr.client.solrj.SolrServerException:
> >>>>>>>>> IOException occured when talking to server at <vip>/solr/core1
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>> at
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>>>>>>>> at
> >> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> >>>>>>>>> ... 5 lines omitted ...
> >>>>>>>>> at java.lang.Thread.run(Unknown Source)
> >>>>>>>>> Caused by: java.net.SocketException: Connection reset
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> After sometime, all the three servers are going down.
> >>>>>>>>>
> >>>>>>>>> Appreciate, if someone could let us know what we are missing.
> >>>>>>>>>
> >>>>>>>>> Thx,Sudhakar.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> <logs_error.txt>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Mark Miller <ma...@gmail.com>.
Do you see anything about session expiration in the logs? That is the likely culprit for something like this. You may need to raise the timeout: http://wiki.apache.org/solr/SolrCloud#FAQ

If you see no session timeouts, I don't have a guess yet.

- Mark

On Feb 2, 2013, at 7:35 PM, Marcin Rzewucki <mr...@gmail.com> wrote:

> I'm experiencing same problem in Solr4.1 during bulk loading. After 50
> minutes of indexing the following error starts to occur:
> 
> INFO: [core] webapp=/solr path=/update params={} {} 0 4
> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
> leader, but locally we don't think so
>        at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
>        at
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
>        at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
>        at
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>        at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
>        at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
>        at
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
>        at
> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
>        at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>        at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
>        at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>        at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>        at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>        at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>        at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>        at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
>        at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
>        at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>        at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
>        at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>        at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>        at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>        at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>        at org.eclipse.jetty.server.Server.handle(Server.java:365)
>        at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
>        at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
>        at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
>        at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
>        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
>        at
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
>        at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
>        at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
>        at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
>        at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
>        at java.lang.Thread.run(Unknown Source)
> Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
> Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=50699
> 
> Then leader tries to sync with replica and after it finishes I can continue
> loading.
> None of SolrCloud nodes was restarted during that time. I don't remember
> such behaviour in Solr4.0. Could it be related with the number of fields
> indexed during loading ? I have a collection with about 2400 fields. I
> can't reproduce same issue for other collections with much less fields per
> record.
> Regards.
> 
> On 11 December 2012 19:50, Sudhakar Maddineni <ma...@gmail.com>wrote:
> 
>> Just an update on this issue:
>>   We tried by increasing zookeeper client timeout settings to 30000ms in
>> solr.xml (i think default is 15000ms), and haven't seen any issues from our
>> tests.
>> <cores .........           zkClientTimeout="30000" >
>> 
>> Thanks, Sudhakar.
>> 
>> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
>> <ma...@gmail.com>wrote:
>> 
>>> We saw this error again today during our load test - basically, whenever
>>> session is getting expired on the leader node, we are seeing the
>>> error.After this happens, leader(001) is going into 'recovery' mode and
>> all
>>> the index updates are failing with "503- service unavailable" error
>>> message.After some time(once recovery is successful), roles are swapped
>>> i.e. 001 acting as the replica and 003 as leader.
>>> 
>>> Btw, do you know why the connection to zookeeper[solr->zk] getting
>>> interrupted in the middle?
>>> is it because of the load(no of updates) we are putting on the cluster?
>>> 
>>> Here is the exception stack trace:
>>> 
>>> *Dec* *7*, *2012* *2:28:03* *PM*
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
>>> *WARNING:*
>> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
>> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
>>>      *at*
>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
>>>      *at*
>> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
>>>      *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
>>> )
>>>      *at*
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
>>>      *at*
>> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
>>>      *at*
>> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
>>>      *at*
>> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
>>>      *at*
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
>>>      *at*
>> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
>>>      *at* *java.lang.Thread.run*(*Unknown* *Source*)
>>> 
>>> Thx,Sudhakar.
>>> 
>>> 
>>> 
>>> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
>> maddinenidev@gmail.com
>>>> wrote:
>>> 
>>>> Erick:
>>>>  Not seeing any page caching related issues...
>>>> 
>>>> Mark:
>>>>  1.Would this "waiting" on 003(replica) cause any inconsistencies in
>> the
>>>> zookeeper cluster state? I was also looking at the leader(001) logs at
>> that
>>>> time and seeing errors related to "*SEVERE: ClusterState says we are the
>>>> leader, but locally we don't think so*".
>>>>  2.Also, all of our servers in cluster were gone down when the index
>>>> updates were running in parallel along with this issue.Do you see this
>>>> related to the session expiry on 001?
>>>> 
>>>> Here are the logs on 001
>>>> =========================
>>>> 
>>>> Dec 4, 2012 12:12:29 PM
>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>> WARNING:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /overseer_elect/leader
>>>> at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>>>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>>>> Dec 4, 2012 12:12:29 PM
>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>>>> INFO: According to ZK I
>>>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
>> leader.
>>>> 
>>>> Dec 4, 2012 12:12:29 PM
>> org.apache.solr.cloud.OverseerCollectionProcessor
>>>> run
>>>> WARNING: Overseer cannot talk to ZK
>>>> 
>>>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem finding the leader in
>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>> Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem finding the leader in
>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>>> Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem making a request to the
>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>> state
>>>> down for <001>:8080_solr but I still do not see the request state. I see
>>>> state: active live:true
>>>> at
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>> Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem making a request to the
>>>> leader:org.apache.solr.common.SolrException: I was asked to wait on
>> state
>>>> down for <001>:8080_solr but I still do not see the request state. I see
>>>> state: active live:true
>>>> at
>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>> ....
>>>> ....
>>>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
>>>> SEVERE: There was a problem finding the leader in
>>>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>>> at
>>>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>>>> ....
>>>> ....
>>>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
>>>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
>>>> finding the leader in zk
>>>> at
>>>> 
>> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
>>>> at
>>>> 
>> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>>>> Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
>>>> SEVERE: Error getting leader from zk
>>>> org.apache.solr.common.SolrException: *There is conflicting information
>>>> about the leader of shard: shard1 our state says:http://
>> <001>:8080/solr/core1/
>>>> but zookeeper says:http://<003>:8080/solr/core1/*
>>>> * at
>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
>>>> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>>>> Dec 4, 2012 12:22:30 PM
>>>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>>>> INFO: Running the leader process.
>>>> ....
>>>> ....
>>>> 
>>>> Thanks for your inputs.
>>>> Sudhakar.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmiller@gmail.com
>>> wrote:
>>>> 
>>>>> Yes - it means that 001 went down (or more likely had it's connection
>> to
>>>>> ZooKeeper interrupted? that's what I mean about a session timeout - if
>> the
>>>>> solr->zk link is broken for longer than the session timeout that will
>>>>> trigger a leader election and when the connection is reestablished, the
>>>>> node will have to recover). That waiting should stop as soon as 001
>> came
>>>>> back up or reconnected to ZooKeeper.
>>>>> 
>>>>> In fact, this waiting should not happen in this case - but only on
>>>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming
>> very
>>>>> soon!):
>>>>> 
>>>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
>> code
>>>>> path
>>>>>  for a fresh cluster start rather than fail over. (Mark Miller)
>>>>> 
>>>>> - Mark
>>>>> 
>>>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddinenidev@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Yep, after restarting, cluster came back to normal state.We will run
>>>>> couple of more tests and see if we could reproduce this issue.
>>>>>> 
>>>>>> Btw, I am attaching the server logs before that 'INFO: Waiting until
>>>>> we see more replicas'  message.From the logs, we can see that leader
>>>>> election process started on 003 which was the replica for 001
>>>>> initially.That means leader 001 went down at that time?
>>>>>> 
>>>>>> logs on 003:
>>>>>> ========
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> runLeaderProcess
>>>>>>        INFO: Running the leader process.
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> shouldIBeLeader
>>>>>>        INFO: Checking if I should try and be the leader.
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> shouldIBeLeader
>>>>>>        INFO: My last published State was Active, it's okay to be the
>>>>> leader.
>>>>>> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> runLeaderProcess
>>>>>>        INFO: I may be the new leader - try and sync
>>>>>> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>>>>>>        WARNING: Stopping recovery for
>> zkNodeName=<003>:8080_solr_core
>>>>> core=core1.
>>>>>> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>>>>>>        INFO: Sync replicas to http://<003>:8080/solr/core1/
>>>>>> 12:11:16 PM org.apache.solr.update.PeerSync sync
>>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>>>>>>        INFO: Updating live nodes -> this message is on 002
>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>>>>>>        WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>>>> exception talking to <001>:8080/solr/core1/, failed
>>>>>>        org.apache.solr.client.solrj.SolrServerException: Timeout
>>>>> occured while waiting response from server at: <001>:8080/solr/core1
>>>>>>              at
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>>>>>>              at
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>              at
>>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>>>>>              at
>>>>> 
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>>>>>              at
>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>> Source)
>>>>>>              at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>              at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>>>>>>              at
>> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>>>> Source)
>>>>>>              at java.util.concurrent.FutureTask.run(Unknown Source)
>>>>>>              at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>>>>>>              at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>>>>              at java.lang.Thread.run(Unknown Source)
>>>>>>        Caused by: java.net.SocketTimeoutException: Read timed out
>>>>>>              at java.net.SocketInputStream.socketRead0(Native
>> Method)
>>>>>>              at java.net.SocketInputStream.read(Unknown Source)
>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>>>>> sync failed
>>>>>> 12:11:46 PM org.apache.solr.common.SolrException log
>>>>>>        SEVERE: Sync Failed
>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> rejoinLeaderElection
>>>>>>        INFO: There is a better leader candidate than us - going back
>>>>> into recovery
>>>>>> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>>>>>>        INFO: Running recovery - first canceling any ongoing recovery
>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>>>>>>        INFO: Starting recovery process.  core=core1
>>>>> recoveringAfterStartup=false
>>>>>> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>>>>>>        INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>>>>> core=core1 - recoveringAfterStartup=false
>>>>>> 12:11:46 PM org.apache.solr.update.PeerSync sync
>>>>>>        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> runLeaderProcess
>>>>>>        INFO: Running the leader process.
>>>>>> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> waitForReplicasToComeUp
>>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
>>>>> timeoutin=179999
>>>>>> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> waitForReplicasToComeUp
>>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
>>>>> timeoutin=179495
>>>>>> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>>>> waitForReplicasToComeUp
>>>>>>        INFO: Waiting until we see more replicas up: total=2 found=1
>>>>> timeoutin=178985
>>>>>> ....
>>>>>> ....
>>>>>> 
>>>>>> Thanks for your help.
>>>>>> Sudhakar.
>>>>>> 
>>>>>> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com>
>>>>> wrote:
>>>>>> The waiting logging had to happen on restart unless it's some kind of
>>>>> bug.
>>>>>> 
>>>>>> Beyond that, something is off, but I have no clue why - it seems your
>>>>> clusterstate.json is not up to date at all.
>>>>>> 
>>>>>> Have you tried restarting the cluster then? Does that help at all?
>>>>>> 
>>>>>> Do you see any exceptions around zookeeper session timeouts?
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
>> maddinenidev@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> Hey Mark,
>>>>>>> 
>>>>>>> Yes, I am able to access all of the nodes under each shard from
>>>>> solrcloud
>>>>>>> admin UI.
>>>>>>> 
>>>>>>> 
>>>>>>>  - *It kind of looks like the urls solrcloud is using are not
>>>>> accessible.
>>>>>>>  When you go to the admin page and the cloud tab, can you access
>>>>> the urls it
>>>>>>>  shows for each shard? That is, if you click on of the links or
>>>>> copy and
>>>>>>>  paste the address into a web browser, does it work?*
>>>>>>> 
>>>>>>> Actually, I got these errors when my document upload task/job was
>>>>> running,
>>>>>>> not during the cluster restart. Also,job ran fine initially for the
>>>>> first
>>>>>>> one hour and started throwing these errors after indexing some
>> docx.
>>>>>>> 
>>>>>>> Thx, Sudhakar.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmiller@gmail.com
>>> 
>>>>> wrote:
>>>>>>> 
>>>>>>>> It kind of looks like the urls solrcloud is using are not
>>>>> accessible. When
>>>>>>>> you go to the admin page and the cloud tab, can you access the
>> urls
>>>>> it
>>>>>>>> shows for each shard? That is, if you click on of the links or
>> copy
>>>>> and
>>>>>>>> paste the address into a web browser, does it work?
>>>>>>>> 
>>>>>>>> You may have to explicitly set the host= in solr.xml if it's not
>>>>> auto
>>>>>>>> detecting the right one. Make sure the ports like right too.
>>>>>>>> 
>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>> timeoutin=179999
>>>>>>>> 
>>>>>>>> That happens when you stop the cluster and try to start it again -
>>>>> before
>>>>>>>> a leader is chosen, it will wait for all known replicas fora shard
>>>>> to come
>>>>>>>> up so that everyone can sync up and have a chance to be the best
>>>>> leader. So
>>>>>>>> at this point it was only finding one of 2 known replicas and
>>>>> waiting for
>>>>>>>> the second to come up. After a couple minutes (configurable) it
>>>>> will just
>>>>>>>> continue anyway without the missing replica (if it doesn't show
>> up).
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>>>>> maddinenidev@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> We are uploading solr documents to the index in batches using 30
>>>>> threads
>>>>>>>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>>>>> set to
>>>>>>>>> 10000.
>>>>>>>>> In the code, we are using HttpSolrServer and add(inputDoc) method
>>>>> to add
>>>>>>>>> docx.
>>>>>>>>> And, we have the following commit settings in solrconfig:
>>>>>>>>> 
>>>>>>>>>   <autoCommit>
>>>>>>>>>     <maxTime>300000</maxTime>
>>>>>>>>>     <maxDocs>10000</maxDocs>
>>>>>>>>>     <openSearcher>false</openSearcher>
>>>>>>>>>   </autoCommit>
>>>>>>>>> 
>>>>>>>>>     <autoSoftCommit>
>>>>>>>>>       <maxTime>1000</maxTime>
>>>>>>>>>     </autoSoftCommit>
>>>>>>>>> 
>>>>>>>>> Cluster Details:
>>>>>>>>> ----------------------------
>>>>>>>>> solr version - 4.0
>>>>>>>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>>>>>>>>> numshards=2 ,
>>>>>>>>> 001, 002, 003 are the solr nodes and these three are behind the
>>>>>>>>> loadbalancer  <vip>
>>>>>>>>> 001, 003 assigned to shard1; 002 assigned to shard2
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Logs:Getting the errors in the below sequence after uploading
>> some
>>>>> docx:
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> -----------------------------------------------------------------------------------------------------------
>>>>>>>>> 003
>>>>>>>>> Dec 4, 2012 12:11:46 PM
>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>>>> waitForReplicasToComeUp
>>>>>>>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>>>>>>>> timeoutin=179999
>>>>>>>>> 
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:12:59 PM
>>>>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor
>>>>>>>>> doDefensiveChecks
>>>>>>>>> SEVERE: ClusterState says we are the leader, but locally we don't
>>>>> think
>>>>>>>> so
>>>>>>>>> 
>>>>>>>>> 003
>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
>>>>> retrying ...
>>>>>>>>> 
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>>>>> Server at
>>>>>>>>> <vip>/solr/core1. returned non ok status:503, message:Service
>>>>> Unavailable
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: Error while trying to recover.
>>>>>>>>> core=core1:org.apache.solr.common.SolrException: We are not the
>>>>> leader
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>>>>>>>> 
>>>>>>>>> 001
>>>>>>>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>>>>>>>>> SEVERE: Error uploading:
>>>>>>>> org.apache.solr.client.solrj.SolrServerException:
>>>>>>>>> IOException occured when talking to server at <vip>/solr/core1
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>>>>> at
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>>>>>>>> at
>> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>>>>>>>>> ... 5 lines omitted ...
>>>>>>>>> at java.lang.Thread.run(Unknown Source)
>>>>>>>>> Caused by: java.net.SocketException: Connection reset
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> After sometime, all the three servers are going down.
>>>>>>>>> 
>>>>>>>>> Appreciate, if someone could let us know what we are missing.
>>>>>>>>> 
>>>>>>>>> Thx,Sudhakar.
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> <logs_error.txt>
>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Marcin Rzewucki <mr...@gmail.com>.
I'm experiencing same problem in Solr4.1 during bulk loading. After 50
minutes of indexing the following error starts to occur:

INFO: [core] webapp=/solr path=/update params={} {} 0 4
Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the
leader, but locally we don't think so
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:295)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:230)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:343)
        at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:387)
        at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:112)
        at
org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:96)
        at
org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:60)
        at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269)
        at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
        at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
        at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
        at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
        at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
        at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
        at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
        at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
        at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
        at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
        at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
        at org.eclipse.jetty.server.Server.handle(Server.java:365)
        at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
        at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
        at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
        at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
        at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
        at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
        at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Unknown Source)
Feb 02, 2013 11:36:15 PM org.apache.solr.common.SolrException log
Feb 02, 2013 11:36:31 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=50699

Then leader tries to sync with replica and after it finishes I can continue
loading.
None of SolrCloud nodes was restarted during that time. I don't remember
such behaviour in Solr4.0. Could it be related with the number of fields
indexed during loading ? I have a collection with about 2400 fields. I
can't reproduce same issue for other collections with much less fields per
record.
Regards.

On 11 December 2012 19:50, Sudhakar Maddineni <ma...@gmail.com>wrote:

> Just an update on this issue:
>    We tried by increasing zookeeper client timeout settings to 30000ms in
> solr.xml (i think default is 15000ms), and haven't seen any issues from our
> tests.
> <cores .........           zkClientTimeout="30000" >
>
> Thanks, Sudhakar.
>
> On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
> <ma...@gmail.com>wrote:
>
> > We saw this error again today during our load test - basically, whenever
> > session is getting expired on the leader node, we are seeing the
> > error.After this happens, leader(001) is going into 'recovery' mode and
> all
> > the index updates are failing with "503- service unavailable" error
> > message.After some time(once recovery is successful), roles are swapped
> > i.e. 001 acting as the replica and 003 as leader.
> >
> > Btw, do you know why the connection to zookeeper[solr->zk] getting
> > interrupted in the middle?
> > is it because of the load(no of updates) we are putting on the cluster?
> >
> > Here is the exception stack trace:
> >
> > *Dec* *7*, *2012* *2:28:03* *PM*
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
> > *WARNING:*
> *org.apache.zookeeper.KeeperException$SessionExpiredException:*
> *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
> >       *at*
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
> >       *at*
> *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
> >       *at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
> > )
> >       *at*
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
> >       *at*
> *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
> >       *at*
> *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
> >       *at*
> *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
> >       *at*
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
> >       *at*
> *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
> >       *at* *java.lang.Thread.run*(*Unknown* *Source*)
> >
> > Thx,Sudhakar.
> >
> >
> >
> > On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <
> maddinenidev@gmail.com
> > > wrote:
> >
> >> Erick:
> >>   Not seeing any page caching related issues...
> >>
> >> Mark:
> >>   1.Would this "waiting" on 003(replica) cause any inconsistencies in
> the
> >> zookeeper cluster state? I was also looking at the leader(001) logs at
> that
> >> time and seeing errors related to "*SEVERE: ClusterState says we are the
> >> leader, but locally we don't think so*".
> >>   2.Also, all of our servers in cluster were gone down when the index
> >> updates were running in parallel along with this issue.Do you see this
> >> related to the session expiry on 001?
> >>
> >> Here are the logs on 001
> >> =========================
> >>
> >> Dec 4, 2012 12:12:29 PM
> >> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >> WARNING:
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> KeeperErrorCode = Session expired for /overseer_elect/leader
> >>  at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> >> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> >> Dec 4, 2012 12:12:29 PM
> >> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
> >> INFO: According to ZK I
> >> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a
> leader.
> >>
> >> Dec 4, 2012 12:12:29 PM
> org.apache.solr.cloud.OverseerCollectionProcessor
> >> run
> >> WARNING: Overseer cannot talk to ZK
> >>
> >> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem finding the leader in
> >> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>  at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >> at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem finding the leader in
> >> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>  at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >> at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
> >>  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem making a request to the
> >> leader:org.apache.solr.common.SolrException: I was asked to wait on
> state
> >> down for <001>:8080_solr but I still do not see the request state. I see
> >> state: active live:true
> >>  at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem making a request to the
> >> leader:org.apache.solr.common.SolrException: I was asked to wait on
> state
> >> down for <001>:8080_solr but I still do not see the request state. I see
> >> state: active live:true
> >>  at
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >> ....
> >>  ....
> >> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
> >> SEVERE: There was a problem finding the leader in
> >> zk:org.apache.solr.common.SolrException: Could not get leader props
> >>  at
> >> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> >> ....
> >>  ....
> >> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
> >> SEVERE: :org.apache.solr.common.SolrException: There was a problem
> >> finding the leader in zk
> >>  at
> >>
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
> >> at
> >>
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
> >>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
> >> SEVERE: Error getting leader from zk
> >> org.apache.solr.common.SolrException: *There is conflicting information
> >> about the leader of shard: shard1 our state says:http://
> <001>:8080/solr/core1/
> >> but zookeeper says:http://<003>:8080/solr/core1/*
> >> * at
> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
> >> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
> >>  Dec 4, 2012 12:22:30 PM
> >> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
> >> INFO: Running the leader process.
> >>  ....
> >> ....
> >>
> >> Thanks for your inputs.
> >> Sudhakar.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <markrmiller@gmail.com
> >wrote:
> >>
> >>> Yes - it means that 001 went down (or more likely had it's connection
> to
> >>> ZooKeeper interrupted? that's what I mean about a session timeout - if
> the
> >>> solr->zk link is broken for longer than the session timeout that will
> >>> trigger a leader election and when the connection is reestablished, the
> >>> node will have to recover). That waiting should stop as soon as 001
> came
> >>> back up or reconnected to ZooKeeper.
> >>>
> >>> In fact, this waiting should not happen in this case - but only on
> >>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming
> very
> >>> soon!):
> >>>
> >>> * SOLR-3940: Rejoining the leader election incorrectly triggers the
> code
> >>> path
> >>>   for a fresh cluster start rather than fail over. (Mark Miller)
> >>>
> >>> - Mark
> >>>
> >>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <maddinenidev@gmail.com
> >
> >>> wrote:
> >>>
> >>> > Yep, after restarting, cluster came back to normal state.We will run
> >>> couple of more tests and see if we could reproduce this issue.
> >>> >
> >>> > Btw, I am attaching the server logs before that 'INFO: Waiting until
> >>> we see more replicas'  message.From the logs, we can see that leader
> >>> election process started on 003 which was the replica for 001
> >>> initially.That means leader 001 went down at that time?
> >>> >
> >>> > logs on 003:
> >>> > ========
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> runLeaderProcess
> >>> >         INFO: Running the leader process.
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> shouldIBeLeader
> >>> >         INFO: Checking if I should try and be the leader.
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> shouldIBeLeader
> >>> >         INFO: My last published State was Active, it's okay to be the
> >>> leader.
> >>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> runLeaderProcess
> >>> >         INFO: I may be the new leader - try and sync
> >>> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
> >>> >         WARNING: Stopping recovery for
> zkNodeName=<003>:8080_solr_core
> >>> core=core1.
> >>> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
> >>> >         INFO: Sync replicas to http://<003>:8080/solr/core1/
> >>> > 12:11:16 PM org.apache.solr.update.PeerSync sync
> >>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> >>> >         INFO: Updating live nodes -> this message is on 002
> >>> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
> >>> >         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
> >>>  exception talking to <001>:8080/solr/core1/, failed
> >>> >         org.apache.solr.client.solrj.SolrServerException: Timeout
> >>> occured while waiting response from server at: <001>:8080/solr/core1
> >>> >               at
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
> >>> >               at
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> >               at
> >>>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
> >>> >               at
> >>>
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
> >>> >               at
> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>> Source)
> >>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
> >>> >               at
> >>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> >>> >               at
> java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> >>> Source)
> >>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
> >>> >               at
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> >>> >               at
> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >>> >               at java.lang.Thread.run(Unknown Source)
> >>> >         Caused by: java.net.SocketTimeoutException: Read timed out
> >>> >               at java.net.SocketInputStream.socketRead0(Native
> Method)
> >>> >               at java.net.SocketInputStream.read(Unknown Source)
> >>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
> >>> sync failed
> >>> > 12:11:46 PM org.apache.solr.common.SolrException log
> >>> >         SEVERE: Sync Failed
> >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> rejoinLeaderElection
> >>> >         INFO: There is a better leader candidate than us - going back
> >>> into recovery
> >>> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
> >>> >         INFO: Running recovery - first canceling any ongoing recovery
> >>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
> >>> >         INFO: Starting recovery process.  core=core1
> >>> recoveringAfterStartup=false
> >>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> >>> >         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
> >>> core=core1 - recoveringAfterStartup=false
> >>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
> >>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> >>> replicas=[<001>:8080/solr/core1/] nUpdates=100
> >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> runLeaderProcess
> >>> >         INFO: Running the leader process.
> >>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> >         INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179999
> >>> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> >         INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179495
> >>> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> >         INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=178985
> >>> > ....
> >>> > ....
> >>> >
> >>> > Thanks for your help.
> >>> > Sudhakar.
> >>> >
> >>> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com>
> >>> wrote:
> >>> > The waiting logging had to happen on restart unless it's some kind of
> >>> bug.
> >>> >
> >>> > Beyond that, something is off, but I have no clue why - it seems your
> >>> clusterstate.json is not up to date at all.
> >>> >
> >>> > Have you tried restarting the cluster then? Does that help at all?
> >>> >
> >>> > Do you see any exceptions around zookeeper session timeouts?
> >>> >
> >>> > - Mark
> >>> >
> >>> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <
> maddinenidev@gmail.com>
> >>> wrote:
> >>> >
> >>> > > Hey Mark,
> >>> > >
> >>> > > Yes, I am able to access all of the nodes under each shard from
> >>> solrcloud
> >>> > > admin UI.
> >>> > >
> >>> > >
> >>> > >   - *It kind of looks like the urls solrcloud is using are not
> >>> accessible.
> >>> > >   When you go to the admin page and the cloud tab, can you access
> >>> the urls it
> >>> > >   shows for each shard? That is, if you click on of the links or
> >>> copy and
> >>> > >   paste the address into a web browser, does it work?*
> >>> > >
> >>> > > Actually, I got these errors when my document upload task/job was
> >>> running,
> >>> > > not during the cluster restart. Also,job ran fine initially for the
> >>> first
> >>> > > one hour and started throwing these errors after indexing some
> docx.
> >>> > >
> >>> > > Thx, Sudhakar.
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <markrmiller@gmail.com
> >
> >>> wrote:
> >>> > >
> >>> > >> It kind of looks like the urls solrcloud is using are not
> >>> accessible. When
> >>> > >> you go to the admin page and the cloud tab, can you access the
> urls
> >>> it
> >>> > >> shows for each shard? That is, if you click on of the links or
> copy
> >>> and
> >>> > >> paste the address into a web browser, does it work?
> >>> > >>
> >>> > >> You may have to explicitly set the host= in solr.xml if it's not
> >>> auto
> >>> > >> detecting the right one. Make sure the ports like right too.
> >>> > >>
> >>> > >>> waitForReplicasToComeUp
> >>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> > >>> timeoutin=179999
> >>> > >>
> >>> > >> That happens when you stop the cluster and try to start it again -
> >>> before
> >>> > >> a leader is chosen, it will wait for all known replicas fora shard
> >>> to come
> >>> > >> up so that everyone can sync up and have a chance to be the best
> >>> leader. So
> >>> > >> at this point it was only finding one of 2 known replicas and
> >>> waiting for
> >>> > >> the second to come up. After a couple minutes (configurable) it
> >>> will just
> >>> > >> continue anyway without the missing replica (if it doesn't show
> up).
> >>> > >>
> >>> > >> - Mark
> >>> > >>
> >>> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
> >>> maddinenidev@gmail.com>
> >>> > >> wrote:
> >>> > >>
> >>> > >>> Hi,
> >>> > >>> We are uploading solr documents to the index in batches using 30
> >>> threads
> >>> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
> >>> set to
> >>> > >>> 10000.
> >>> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method
> >>> to add
> >>> > >>> docx.
> >>> > >>> And, we have the following commit settings in solrconfig:
> >>> > >>>
> >>> > >>>    <autoCommit>
> >>> > >>>      <maxTime>300000</maxTime>
> >>> > >>>      <maxDocs>10000</maxDocs>
> >>> > >>>      <openSearcher>false</openSearcher>
> >>> > >>>    </autoCommit>
> >>> > >>>
> >>> > >>>      <autoSoftCommit>
> >>> > >>>        <maxTime>1000</maxTime>
> >>> > >>>      </autoSoftCommit>
> >>> > >>>
> >>> > >>> Cluster Details:
> >>> > >>> ----------------------------
> >>> > >>> solr version - 4.0
> >>> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> >>> > >>> numshards=2 ,
> >>> > >>> 001, 002, 003 are the solr nodes and these three are behind the
> >>> > >>> loadbalancer  <vip>
> >>> > >>> 001, 003 assigned to shard1; 002 assigned to shard2
> >>> > >>>
> >>> > >>>
> >>> > >>> Logs:Getting the errors in the below sequence after uploading
> some
> >>> docx:
> >>> > >>>
> >>> > >>
> >>>
> -----------------------------------------------------------------------------------------------------------
> >>> > >>> 003
> >>> > >>> Dec 4, 2012 12:11:46 PM
> >>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>> > >>> waitForReplicasToComeUp
> >>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> > >>> timeoutin=179999
> >>> > >>>
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:12:59 PM
> >>> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor
> >>> > >>> doDefensiveChecks
> >>> > >>> SEVERE: ClusterState says we are the leader, but locally we don't
> >>> think
> >>> > >> so
> >>> > >>>
> >>> > >>> 003
> >>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
> >>> retrying ...
> >>> > >>>
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
> >>> Server at
> >>> > >>> <vip>/solr/core1. returned non ok status:503, message:Service
> >>> Unavailable
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: Error while trying to recover.
> >>> > >>> core=core1:org.apache.solr.common.SolrException: We are not the
> >>> leader
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>> > >>>
> >>> > >>> 001
> >>> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> >>> > >>> SEVERE: Error uploading:
> >>> > >> org.apache.solr.client.solrj.SolrServerException:
> >>> > >>> IOException occured when talking to server at <vip>/solr/core1
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> > >>> at
> >>> > >>>
> >>> > >>
> >>>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>> > >>> at
> org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> >>> > >>> ... 5 lines omitted ...
> >>> > >>> at java.lang.Thread.run(Unknown Source)
> >>> > >>> Caused by: java.net.SocketException: Connection reset
> >>> > >>>
> >>> > >>>
> >>> > >>> After sometime, all the three servers are going down.
> >>> > >>>
> >>> > >>> Appreciate, if someone could let us know what we are missing.
> >>> > >>>
> >>> > >>> Thx,Sudhakar.
> >>> > >>
> >>> > >>
> >>> >
> >>> >
> >>> > <logs_error.txt>
> >>>
> >>>
> >>
> >
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Sudhakar Maddineni <ma...@gmail.com>.
Just an update on this issue:
   We tried by increasing zookeeper client timeout settings to 30000ms in
solr.xml (i think default is 15000ms), and haven't seen any issues from our
tests.
<cores .........           zkClientTimeout="30000" >

Thanks, Sudhakar.

On Fri, Dec 7, 2012 at 4:55 PM, Sudhakar Maddineni
<ma...@gmail.com>wrote:

> We saw this error again today during our load test - basically, whenever
> session is getting expired on the leader node, we are seeing the
> error.After this happens, leader(001) is going into 'recovery' mode and all
> the index updates are failing with "503- service unavailable" error
> message.After some time(once recovery is successful), roles are swapped
> i.e. 001 acting as the replica and 003 as leader.
>
> Btw, do you know why the connection to zookeeper[solr->zk] getting
> interrupted in the middle?
> is it because of the load(no of updates) we are putting on the cluster?
>
> Here is the exception stack trace:
>
> *Dec* *7*, *2012* *2:28:03* *PM* *org.apache.solr.cloud.Overseer$ClusterStateUpdater* *amILeader*
> *WARNING:* *org.apache.zookeeper.KeeperException$SessionExpiredException:* *KeeperErrorCode* *=* *Session* *expired* *for* */overseer_elect/leader*
> 	*at* *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
> 	*at* *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
> 	*at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*
> )
> 	*at* *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
> 	*at* *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
> 	*at* *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
> 	*at* *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
> 	*at* *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
> 	*at* *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
> 	*at* *java.lang.Thread.run*(*Unknown* *Source*)
>
> Thx,Sudhakar.
>
>
>
> On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni <maddinenidev@gmail.com
> > wrote:
>
>> Erick:
>>   Not seeing any page caching related issues...
>>
>> Mark:
>>   1.Would this "waiting" on 003(replica) cause any inconsistencies in the
>> zookeeper cluster state? I was also looking at the leader(001) logs at that
>> time and seeing errors related to "*SEVERE: ClusterState says we are the
>> leader, but locally we don't think so*".
>>   2.Also, all of our servers in cluster were gone down when the index
>> updates were running in parallel along with this issue.Do you see this
>> related to the session expiry on 001?
>>
>> Here are the logs on 001
>> =========================
>>
>> Dec 4, 2012 12:12:29 PM
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>> WARNING:
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired for /overseer_elect/leader
>>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
>> Dec 4, 2012 12:12:29 PM
>> org.apache.solr.cloud.Overseer$ClusterStateUpdater amILeader
>> INFO: According to ZK I
>> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a leader.
>>
>> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
>> run
>> WARNING: Overseer cannot talk to ZK
>>
>> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem finding the leader in
>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>  at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>> at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem finding the leader in
>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>  at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>> at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>>  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem making a request to the
>> leader:org.apache.solr.common.SolrException: I was asked to wait on state
>> down for <001>:8080_solr but I still do not see the request state. I see
>> state: active live:true
>>  at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem making a request to the
>> leader:org.apache.solr.common.SolrException: I was asked to wait on state
>> down for <001>:8080_solr but I still do not see the request state. I see
>> state: active live:true
>>  at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>> ....
>>  ....
>> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
>> SEVERE: There was a problem finding the leader in
>> zk:org.apache.solr.common.SolrException: Could not get leader props
>>  at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
>> ....
>>  ....
>> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
>> SEVERE: :org.apache.solr.common.SolrException: There was a problem
>> finding the leader in zk
>>  at
>> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
>> at
>> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
>> SEVERE: Error getting leader from zk
>> org.apache.solr.common.SolrException: *There is conflicting information
>> about the leader of shard: shard1 our state says:http://<001>:8080/solr/core1/
>> but zookeeper says:http://<003>:8080/solr/core1/*
>> * at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
>> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>>  Dec 4, 2012 12:22:30 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>> INFO: Running the leader process.
>>  ....
>> ....
>>
>> Thanks for your inputs.
>> Sudhakar.
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <ma...@gmail.com>wrote:
>>
>>> Yes - it means that 001 went down (or more likely had it's connection to
>>> ZooKeeper interrupted? that's what I mean about a session timeout - if the
>>> solr->zk link is broken for longer than the session timeout that will
>>> trigger a leader election and when the connection is reestablished, the
>>> node will have to recover). That waiting should stop as soon as 001 came
>>> back up or reconnected to ZooKeeper.
>>>
>>> In fact, this waiting should not happen in this case - but only on
>>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming very
>>> soon!):
>>>
>>> * SOLR-3940: Rejoining the leader election incorrectly triggers the code
>>> path
>>>   for a fresh cluster start rather than fail over. (Mark Miller)
>>>
>>> - Mark
>>>
>>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <ma...@gmail.com>
>>> wrote:
>>>
>>> > Yep, after restarting, cluster came back to normal state.We will run
>>> couple of more tests and see if we could reproduce this issue.
>>> >
>>> > Btw, I am attaching the server logs before that 'INFO: Waiting until
>>> we see more replicas'  message.From the logs, we can see that leader
>>> election process started on 003 which was the replica for 001
>>> initially.That means leader 001 went down at that time?
>>> >
>>> > logs on 003:
>>> > ========
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> >         INFO: Running the leader process.
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> shouldIBeLeader
>>> >         INFO: Checking if I should try and be the leader.
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> shouldIBeLeader
>>> >         INFO: My last published State was Active, it's okay to be the
>>> leader.
>>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> >         INFO: I may be the new leader - try and sync
>>> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>>> >         WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core
>>> core=core1.
>>> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>>> >         INFO: Sync replicas to http://<003>:8080/solr/core1/
>>> > 12:11:16 PM org.apache.solr.update.PeerSync sync
>>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>>> >         INFO: Updating live nodes -> this message is on 002
>>> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>>> >         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>>  exception talking to <001>:8080/solr/core1/, failed
>>> >         org.apache.solr.client.solrj.SolrServerException: Timeout
>>> occured while waiting response from server at: <001>:8080/solr/core1
>>> >               at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>>> >               at
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> >               at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>>> >               at
>>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>> Source)
>>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>>> >               at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>>> Source)
>>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>>> >               at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>>> >               at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>> >               at java.lang.Thread.run(Unknown Source)
>>> >         Caused by: java.net.SocketTimeoutException: Read timed out
>>> >               at java.net.SocketInputStream.socketRead0(Native Method)
>>> >               at java.net.SocketInputStream.read(Unknown Source)
>>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>>> sync failed
>>> > 12:11:46 PM org.apache.solr.common.SolrException log
>>> >         SEVERE: Sync Failed
>>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> rejoinLeaderElection
>>> >         INFO: There is a better leader candidate than us - going back
>>> into recovery
>>> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>>> >         INFO: Running recovery - first canceling any ongoing recovery
>>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>>> >         INFO: Starting recovery process.  core=core1
>>> recoveringAfterStartup=false
>>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>>> >         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>>> core=core1 - recoveringAfterStartup=false
>>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> runLeaderProcess
>>> >         INFO: Running the leader process.
>>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=179999
>>> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=179495
>>> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=178985
>>> > ....
>>> > ....
>>> >
>>> > Thanks for your help.
>>> > Sudhakar.
>>> >
>>> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>> > The waiting logging had to happen on restart unless it's some kind of
>>> bug.
>>> >
>>> > Beyond that, something is off, but I have no clue why - it seems your
>>> clusterstate.json is not up to date at all.
>>> >
>>> > Have you tried restarting the cluster then? Does that help at all?
>>> >
>>> > Do you see any exceptions around zookeeper session timeouts?
>>> >
>>> > - Mark
>>> >
>>> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <ma...@gmail.com>
>>> wrote:
>>> >
>>> > > Hey Mark,
>>> > >
>>> > > Yes, I am able to access all of the nodes under each shard from
>>> solrcloud
>>> > > admin UI.
>>> > >
>>> > >
>>> > >   - *It kind of looks like the urls solrcloud is using are not
>>> accessible.
>>> > >   When you go to the admin page and the cloud tab, can you access
>>> the urls it
>>> > >   shows for each shard? That is, if you click on of the links or
>>> copy and
>>> > >   paste the address into a web browser, does it work?*
>>> > >
>>> > > Actually, I got these errors when my document upload task/job was
>>> running,
>>> > > not during the cluster restart. Also,job ran fine initially for the
>>> first
>>> > > one hour and started throwing these errors after indexing some docx.
>>> > >
>>> > > Thx, Sudhakar.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com>
>>> wrote:
>>> > >
>>> > >> It kind of looks like the urls solrcloud is using are not
>>> accessible. When
>>> > >> you go to the admin page and the cloud tab, can you access the urls
>>> it
>>> > >> shows for each shard? That is, if you click on of the links or copy
>>> and
>>> > >> paste the address into a web browser, does it work?
>>> > >>
>>> > >> You may have to explicitly set the host= in solr.xml if it's not
>>> auto
>>> > >> detecting the right one. Make sure the ports like right too.
>>> > >>
>>> > >>> waitForReplicasToComeUp
>>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>>> > >>> timeoutin=179999
>>> > >>
>>> > >> That happens when you stop the cluster and try to start it again -
>>> before
>>> > >> a leader is chosen, it will wait for all known replicas fora shard
>>> to come
>>> > >> up so that everyone can sync up and have a chance to be the best
>>> leader. So
>>> > >> at this point it was only finding one of 2 known replicas and
>>> waiting for
>>> > >> the second to come up. After a couple minutes (configurable) it
>>> will just
>>> > >> continue anyway without the missing replica (if it doesn't show up).
>>> > >>
>>> > >> - Mark
>>> > >>
>>> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>>> maddinenidev@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >>> Hi,
>>> > >>> We are uploading solr documents to the index in batches using 30
>>> threads
>>> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>>> set to
>>> > >>> 10000.
>>> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method
>>> to add
>>> > >>> docx.
>>> > >>> And, we have the following commit settings in solrconfig:
>>> > >>>
>>> > >>>    <autoCommit>
>>> > >>>      <maxTime>300000</maxTime>
>>> > >>>      <maxDocs>10000</maxDocs>
>>> > >>>      <openSearcher>false</openSearcher>
>>> > >>>    </autoCommit>
>>> > >>>
>>> > >>>      <autoSoftCommit>
>>> > >>>        <maxTime>1000</maxTime>
>>> > >>>      </autoSoftCommit>
>>> > >>>
>>> > >>> Cluster Details:
>>> > >>> ----------------------------
>>> > >>> solr version - 4.0
>>> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>>> > >>> numshards=2 ,
>>> > >>> 001, 002, 003 are the solr nodes and these three are behind the
>>> > >>> loadbalancer  <vip>
>>> > >>> 001, 003 assigned to shard1; 002 assigned to shard2
>>> > >>>
>>> > >>>
>>> > >>> Logs:Getting the errors in the below sequence after uploading some
>>> docx:
>>> > >>>
>>> > >>
>>> -----------------------------------------------------------------------------------------------------------
>>> > >>> 003
>>> > >>> Dec 4, 2012 12:11:46 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>> > >>> waitForReplicasToComeUp
>>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>>> > >>> timeoutin=179999
>>> > >>>
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:12:59 PM
>>> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor
>>> > >>> doDefensiveChecks
>>> > >>> SEVERE: ClusterState says we are the leader, but locally we don't
>>> think
>>> > >> so
>>> > >>>
>>> > >>> 003
>>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
>>> retrying ...
>>> > >>>
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>>> Server at
>>> > >>> <vip>/solr/core1. returned non ok status:503, message:Service
>>> Unavailable
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: Error while trying to recover.
>>> > >>> core=core1:org.apache.solr.common.SolrException: We are not the
>>> leader
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>> > >>>
>>> > >>> 001
>>> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>>> > >>> SEVERE: Error uploading:
>>> > >> org.apache.solr.client.solrj.SolrServerException:
>>> > >>> IOException occured when talking to server at <vip>/solr/core1
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> > >>> at
>>> > >>>
>>> > >>
>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>> > >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>>> > >>> ... 5 lines omitted ...
>>> > >>> at java.lang.Thread.run(Unknown Source)
>>> > >>> Caused by: java.net.SocketException: Connection reset
>>> > >>>
>>> > >>>
>>> > >>> After sometime, all the three servers are going down.
>>> > >>>
>>> > >>> Appreciate, if someone could let us know what we are missing.
>>> > >>>
>>> > >>> Thx,Sudhakar.
>>> > >>
>>> > >>
>>> >
>>> >
>>> > <logs_error.txt>
>>>
>>>
>>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Sudhakar Maddineni <ma...@gmail.com>.
We saw this error again today during our load test - basically, whenever
session is getting expired on the leader node, we are seeing the
error.After this happens, leader(001) is going into 'recovery' mode and all
the index updates are failing with "503- service unavailable" error
message.After some time(once recovery is successful), roles are swapped
i.e. 001 acting as the replica and 003 as leader.

Btw, do you know why the connection to zookeeper[solr->zk] getting
interrupted in the middle?
is it because of the load(no of updates) we are putting on the cluster?

Here is the exception stack trace:

*Dec* *7*, *2012* *2:28:03* *PM*
*org.apache.solr.cloud.Overseer$ClusterStateUpdater*
*amILeader**WARNING:*
*org.apache.zookeeper.KeeperException$SessionExpiredException:*
*KeeperErrorCode* *=* *Session* *expired* *for*
*/overseer_elect/leader*
	*at* *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:118*)
	*at* *org.apache.zookeeper.KeeperException.create*(*KeeperException.java:42*)
	*at* *org.apache.zookeeper.ZooKeeper.getData*(*ZooKeeper.java:927*)
	*at* *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:244*)
	*at* *org.apache.solr.common.cloud.SolrZkClient$7.execute*(*SolrZkClient.java:241*)
	*at* *org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation*(*ZkCmdExecutor.java:63*)
	*at* *org.apache.solr.common.cloud.SolrZkClient.getData*(*SolrZkClient.java:241*)
	*at* *org.apache.solr.cloud.Overseer$ClusterStateUpdater.amILeader*(*Overseer.java:195*)
	*at* *org.apache.solr.cloud.Overseer$ClusterStateUpdater.run*(*Overseer.java:119*)
	*at* *java.lang.Thread.run*(*Unknown* *Source*)

Thx,Sudhakar.



On Fri, Dec 7, 2012 at 3:16 PM, Sudhakar Maddineni
<ma...@gmail.com>wrote:

> Erick:
>   Not seeing any page caching related issues...
>
> Mark:
>   1.Would this "waiting" on 003(replica) cause any inconsistencies in the
> zookeeper cluster state? I was also looking at the leader(001) logs at that
> time and seeing errors related to "*SEVERE: ClusterState says we are the
> leader, but locally we don't think so*".
>   2.Also, all of our servers in cluster were gone down when the index
> updates were running in parallel along with this issue.Do you see this
> related to the session expiry on 001?
>
> Here are the logs on 001
> =========================
>
> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
> amILeader
> WARNING:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leader
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
> amILeader
> INFO: According to ZK I
> (id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a leader.
>
> Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
> run
> WARNING: Overseer cannot talk to ZK
>
> Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem finding the leader in
> zk:org.apache.solr.common.SolrException: Could not get leader props
>  at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>  Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem finding the leader in
> zk:org.apache.solr.common.SolrException: Could not get leader props
>  at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
>  Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem making a request to the
> leader:org.apache.solr.common.SolrException: I was asked to wait on state
> down for <001>:8080_solr but I still do not see the request state. I see
> state: active live:true
>  at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>  Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem making a request to the
> leader:org.apache.solr.common.SolrException: I was asked to wait on state
> down for <001>:8080_solr but I still do not see the request state. I see
> state: active live:true
>  at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> ....
>  ....
> Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
> SEVERE: There was a problem finding the leader in
> zk:org.apache.solr.common.SolrException: Could not get leader props
>  at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
> ....
>  ....
> Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
> SEVERE: :org.apache.solr.common.SolrException: There was a problem finding
> the leader in zk
>  at
> org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
> at
> org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
> SEVERE: Error getting leader from zk
> org.apache.solr.common.SolrException: *There is conflicting information
> about the leader of shard: shard1 our state says:http://<001>:8080/solr/core1/
> but zookeeper says:http://<003>:8080/solr/core1/*
> * at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
> * at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
>  Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> INFO: Running the leader process.
>  ....
> ....
>
> Thanks for your inputs.
> Sudhakar.
>
>
>
>
>
>
>
> On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <ma...@gmail.com> wrote:
>
>> Yes - it means that 001 went down (or more likely had it's connection to
>> ZooKeeper interrupted? that's what I mean about a session timeout - if the
>> solr->zk link is broken for longer than the session timeout that will
>> trigger a leader election and when the connection is reestablished, the
>> node will have to recover). That waiting should stop as soon as 001 came
>> back up or reconnected to ZooKeeper.
>>
>> In fact, this waiting should not happen in this case - but only on
>> cluster restart. This is a bug that is fixed in 4.1 (hopefully coming very
>> soon!):
>>
>> * SOLR-3940: Rejoining the leader election incorrectly triggers the code
>> path
>>   for a fresh cluster start rather than fail over. (Mark Miller)
>>
>> - Mark
>>
>> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <ma...@gmail.com>
>> wrote:
>>
>> > Yep, after restarting, cluster came back to normal state.We will run
>> couple of more tests and see if we could reproduce this issue.
>> >
>> > Btw, I am attaching the server logs before that 'INFO: Waiting until we
>> see more replicas'  message.From the logs, we can see that leader election
>> process started on 003 which was the replica for 001 initially.That means
>> leader 001 went down at that time?
>> >
>> > logs on 003:
>> > ========
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> >         INFO: Running the leader process.
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> >         INFO: Checking if I should try and be the leader.
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> shouldIBeLeader
>> >         INFO: My last published State was Active, it's okay to be the
>> leader.
>> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> >         INFO: I may be the new leader - try and sync
>> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>> >         WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core
>> core=core1.
>> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>> >         INFO: Sync replicas to http://<003>:8080/solr/core1/
>> > 12:11:16 PM org.apache.solr.update.PeerSync sync
>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>> >         INFO: Updating live nodes -> this message is on 002
>> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>> >         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>>  exception talking to <001>:8080/solr/core1/, failed
>> >         org.apache.solr.client.solrj.SolrServerException: Timeout
>> occured while waiting response from server at: <001>:8080/solr/core1
>> >               at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>> >               at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> >               at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>> >               at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>> Source)
>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>> >               at
>> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
>> Source)
>> >               at java.util.concurrent.FutureTask.run(Unknown Source)
>> >               at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>> >               at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> >               at java.lang.Thread.run(Unknown Source)
>> >         Caused by: java.net.SocketTimeoutException: Read timed out
>> >               at java.net.SocketInputStream.socketRead0(Native Method)
>> >               at java.net.SocketInputStream.read(Unknown Source)
>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
>> sync failed
>> > 12:11:46 PM org.apache.solr.common.SolrException log
>> >         SEVERE: Sync Failed
>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> rejoinLeaderElection
>> >         INFO: There is a better leader candidate than us - going back
>> into recovery
>> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>> >         INFO: Running recovery - first canceling any ongoing recovery
>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>> >         INFO: Starting recovery process.  core=core1
>> recoveringAfterStartup=false
>> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>> >         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
>> core=core1 - recoveringAfterStartup=false
>> > 12:11:46 PM org.apache.solr.update.PeerSync sync
>> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
>> replicas=[<001>:8080/solr/core1/] nUpdates=100
>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> runLeaderProcess
>> >         INFO: Running the leader process.
>> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=179999
>> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=179495
>> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
>> waitForReplicasToComeUp
>> >         INFO: Waiting until we see more replicas up: total=2 found=1
>> timeoutin=178985
>> > ....
>> > ....
>> >
>> > Thanks for your help.
>> > Sudhakar.
>> >
>> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> > The waiting logging had to happen on restart unless it's some kind of
>> bug.
>> >
>> > Beyond that, something is off, but I have no clue why - it seems your
>> clusterstate.json is not up to date at all.
>> >
>> > Have you tried restarting the cluster then? Does that help at all?
>> >
>> > Do you see any exceptions around zookeeper session timeouts?
>> >
>> > - Mark
>> >
>> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <ma...@gmail.com>
>> wrote:
>> >
>> > > Hey Mark,
>> > >
>> > > Yes, I am able to access all of the nodes under each shard from
>> solrcloud
>> > > admin UI.
>> > >
>> > >
>> > >   - *It kind of looks like the urls solrcloud is using are not
>> accessible.
>> > >   When you go to the admin page and the cloud tab, can you access the
>> urls it
>> > >   shows for each shard? That is, if you click on of the links or copy
>> and
>> > >   paste the address into a web browser, does it work?*
>> > >
>> > > Actually, I got these errors when my document upload task/job was
>> running,
>> > > not during the cluster restart. Also,job ran fine initially for the
>> first
>> > > one hour and started throwing these errors after indexing some docx.
>> > >
>> > > Thx, Sudhakar.
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com>
>> wrote:
>> > >
>> > >> It kind of looks like the urls solrcloud is using are not
>> accessible. When
>> > >> you go to the admin page and the cloud tab, can you access the urls
>> it
>> > >> shows for each shard? That is, if you click on of the links or copy
>> and
>> > >> paste the address into a web browser, does it work?
>> > >>
>> > >> You may have to explicitly set the host= in solr.xml if it's not auto
>> > >> detecting the right one. Make sure the ports like right too.
>> > >>
>> > >>> waitForReplicasToComeUp
>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>> > >>> timeoutin=179999
>> > >>
>> > >> That happens when you stop the cluster and try to start it again -
>> before
>> > >> a leader is chosen, it will wait for all known replicas fora shard
>> to come
>> > >> up so that everyone can sync up and have a chance to be the best
>> leader. So
>> > >> at this point it was only finding one of 2 known replicas and
>> waiting for
>> > >> the second to come up. After a couple minutes (configurable) it will
>> just
>> > >> continue anyway without the missing replica (if it doesn't show up).
>> > >>
>> > >> - Mark
>> > >>
>> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
>> maddinenidev@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> Hi,
>> > >>> We are uploading solr documents to the index in batches using 30
>> threads
>> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit
>> set to
>> > >>> 10000.
>> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method
>> to add
>> > >>> docx.
>> > >>> And, we have the following commit settings in solrconfig:
>> > >>>
>> > >>>    <autoCommit>
>> > >>>      <maxTime>300000</maxTime>
>> > >>>      <maxDocs>10000</maxDocs>
>> > >>>      <openSearcher>false</openSearcher>
>> > >>>    </autoCommit>
>> > >>>
>> > >>>      <autoSoftCommit>
>> > >>>        <maxTime>1000</maxTime>
>> > >>>      </autoSoftCommit>
>> > >>>
>> > >>> Cluster Details:
>> > >>> ----------------------------
>> > >>> solr version - 4.0
>> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>> > >>> numshards=2 ,
>> > >>> 001, 002, 003 are the solr nodes and these three are behind the
>> > >>> loadbalancer  <vip>
>> > >>> 001, 003 assigned to shard1; 002 assigned to shard2
>> > >>>
>> > >>>
>> > >>> Logs:Getting the errors in the below sequence after uploading some
>> docx:
>> > >>>
>> > >>
>> -----------------------------------------------------------------------------------------------------------
>> > >>> 003
>> > >>> Dec 4, 2012 12:11:46 PM
>> org.apache.solr.cloud.ShardLeaderElectionContext
>> > >>> waitForReplicasToComeUp
>> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
>> > >>> timeoutin=179999
>> > >>>
>> > >>> 001
>> > >>> Dec 4, 2012 12:12:59 PM
>> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor
>> > >>> doDefensiveChecks
>> > >>> SEVERE: ClusterState says we are the leader, but locally we don't
>> think
>> > >> so
>> > >>>
>> > >>> 003
>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
>> retrying ...
>> > >>>
>> > >>> 001
>> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
>> Server at
>> > >>> <vip>/solr/core1. returned non ok status:503, message:Service
>> Unavailable
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> > >>> 001
>> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: Error while trying to recover.
>> > >>> core=core1:org.apache.solr.common.SolrException: We are not the
>> leader
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>> > >>>
>> > >>> 001
>> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>> > >>> SEVERE: Error uploading:
>> > >> org.apache.solr.client.solrj.SolrServerException:
>> > >>> IOException occured when talking to server at <vip>/solr/core1
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> > >>> at
>> > >>>
>> > >>
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> > >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>> > >>> ... 5 lines omitted ...
>> > >>> at java.lang.Thread.run(Unknown Source)
>> > >>> Caused by: java.net.SocketException: Connection reset
>> > >>>
>> > >>>
>> > >>> After sometime, all the three servers are going down.
>> > >>>
>> > >>> Appreciate, if someone could let us know what we are missing.
>> > >>>
>> > >>> Thx,Sudhakar.
>> > >>
>> > >>
>> >
>> >
>> > <logs_error.txt>
>>
>>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Sudhakar Maddineni <ma...@gmail.com>.
Erick:
  Not seeing any page caching related issues...

Mark:
  1.Would this "waiting" on 003(replica) cause any inconsistencies in the
zookeeper cluster state? I was also looking at the leader(001) logs at that
time and seeing errors related to "*SEVERE: ClusterState says we are the
leader, but locally we don't think so*".
  2.Also, all of our servers in cluster were gone down when the index
updates were running in parallel along with this issue.Do you see this
related to the session expiry on 001?

Here are the logs on 001
=========================

Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
amILeader
WARNING:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leader
at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
amILeader
INFO: According to ZK I
(id=232887758696546307-<001>:8080_solr-n_0000000005) am no longer a leader.

Dec 4, 2012 12:12:29 PM org.apache.solr.cloud.OverseerCollectionProcessor
run
WARNING: Overseer cannot talk to ZK

Dec 4, 2012 12:13:00 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
Dec 4, 2012 12:13:32 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
Dec 4, 2012 12:15:17 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem making a request to the
leader:org.apache.solr.common.SolrException: I was asked to wait on state
down for <001>:8080_solr but I still do not see the request state. I see
state: active live:true
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
Dec 4, 2012 12:15:50 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem making a request to the
leader:org.apache.solr.common.SolrException: I was asked to wait on state
down for <001>:8080_solr but I still do not see the request state. I see
state: active live:true
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
....
....
Dec 4, 2012 12:19:10 PM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
....
....
Dec 4, 2012 12:21:24 PM org.apache.solr.common.SolrException log
SEVERE: :org.apache.solr.common.SolrException: There was a problem finding
the leader in zk
at
org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
at
org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ZkController getLeader
SEVERE: Error getting leader from zk
org.apache.solr.common.SolrException: *There is conflicting information
about the leader of shard: shard1 our state says:http://<001>:8080/solr/core1/
but zookeeper says:http://<003>:8080/solr/core1/*
* at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:647)*
* at org.apache.solr.cloud.ZkController.register(ZkController.java:577)*
Dec 4, 2012 12:22:30 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
INFO: Running the leader process.
....
....

Thanks for your inputs.
Sudhakar.







On Thu, Dec 6, 2012 at 5:35 PM, Mark Miller <ma...@gmail.com> wrote:

> Yes - it means that 001 went down (or more likely had it's connection to
> ZooKeeper interrupted? that's what I mean about a session timeout - if the
> solr->zk link is broken for longer than the session timeout that will
> trigger a leader election and when the connection is reestablished, the
> node will have to recover). That waiting should stop as soon as 001 came
> back up or reconnected to ZooKeeper.
>
> In fact, this waiting should not happen in this case - but only on cluster
> restart. This is a bug that is fixed in 4.1 (hopefully coming very soon!):
>
> * SOLR-3940: Rejoining the leader election incorrectly triggers the code
> path
>   for a fresh cluster start rather than fail over. (Mark Miller)
>
> - Mark
>
> On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <ma...@gmail.com>
> wrote:
>
> > Yep, after restarting, cluster came back to normal state.We will run
> couple of more tests and see if we could reproduce this issue.
> >
> > Btw, I am attaching the server logs before that 'INFO: Waiting until we
> see more replicas'  message.From the logs, we can see that leader election
> process started on 003 which was the replica for 001 initially.That means
> leader 001 went down at that time?
> >
> > logs on 003:
> > ========
> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> >         INFO: Running the leader process.
> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
> >         INFO: Checking if I should try and be the leader.
> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> shouldIBeLeader
> >         INFO: My last published State was Active, it's okay to be the
> leader.
> > 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> >         INFO: I may be the new leader - try and sync
> > 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
> >         WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core
> core=core1.
> > 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
> >         INFO: Sync replicas to http://<003>:8080/solr/core1/
> > 12:11:16 PM org.apache.solr.update.PeerSync sync
> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> replicas=[<001>:8080/solr/core1/] nUpdates=100
> > 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
> >         INFO: Updating live nodes -> this message is on 002
> > 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
> >         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr
>  exception talking to <001>:8080/solr/core1/, failed
> >         org.apache.solr.client.solrj.SolrServerException: Timeout
> occured while waiting response from server at: <001>:8080/solr/core1
> >               at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
> >               at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >               at
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
> >               at
> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> Source)
> >               at java.util.concurrent.FutureTask.run(Unknown Source)
> >               at
> java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> >               at java.util.concurrent.FutureTask$Sync.innerRun(Unknown
> Source)
> >               at java.util.concurrent.FutureTask.run(Unknown Source)
> >               at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> >               at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >               at java.lang.Thread.run(Unknown Source)
> >         Caused by: java.net.SocketTimeoutException: Read timed out
> >               at java.net.SocketInputStream.socketRead0(Native Method)
> >               at java.net.SocketInputStream.read(Unknown Source)
> > 12:11:46 PM org.apache.solr.update.PeerSync sync
> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE.
> sync failed
> > 12:11:46 PM org.apache.solr.common.SolrException log
> >         SEVERE: Sync Failed
> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> rejoinLeaderElection
> >         INFO: There is a better leader candidate than us - going back
> into recovery
> > 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
> >         INFO: Running recovery - first canceling any ongoing recovery
> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
> >         INFO: Starting recovery process.  core=core1
> recoveringAfterStartup=false
> > 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
> >         INFO: Attempting to PeerSync from <001>:8080/solr/core1/
> core=core1 - recoveringAfterStartup=false
> > 12:11:46 PM org.apache.solr.update.PeerSync sync
> >         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
> replicas=[<001>:8080/solr/core1/] nUpdates=100
> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> runLeaderProcess
> >         INFO: Running the leader process.
> > 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> >         INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=179999
> > 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> >         INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=179495
> > 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> >         INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=178985
> > ....
> > ....
> >
> > Thanks for your help.
> > Sudhakar.
> >
> > On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com>
> wrote:
> > The waiting logging had to happen on restart unless it's some kind of
> bug.
> >
> > Beyond that, something is off, but I have no clue why - it seems your
> clusterstate.json is not up to date at all.
> >
> > Have you tried restarting the cluster then? Does that help at all?
> >
> > Do you see any exceptions around zookeeper session timeouts?
> >
> > - Mark
> >
> > On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <ma...@gmail.com>
> wrote:
> >
> > > Hey Mark,
> > >
> > > Yes, I am able to access all of the nodes under each shard from
> solrcloud
> > > admin UI.
> > >
> > >
> > >   - *It kind of looks like the urls solrcloud is using are not
> accessible.
> > >   When you go to the admin page and the cloud tab, can you access the
> urls it
> > >   shows for each shard? That is, if you click on of the links or copy
> and
> > >   paste the address into a web browser, does it work?*
> > >
> > > Actually, I got these errors when my document upload task/job was
> running,
> > > not during the cluster restart. Also,job ran fine initially for the
> first
> > > one hour and started throwing these errors after indexing some docx.
> > >
> > > Thx, Sudhakar.
> > >
> > >
> > >
> > >
> > > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com>
> wrote:
> > >
> > >> It kind of looks like the urls solrcloud is using are not accessible.
> When
> > >> you go to the admin page and the cloud tab, can you access the urls it
> > >> shows for each shard? That is, if you click on of the links or copy
> and
> > >> paste the address into a web browser, does it work?
> > >>
> > >> You may have to explicitly set the host= in solr.xml if it's not auto
> > >> detecting the right one. Make sure the ports like right too.
> > >>
> > >>> waitForReplicasToComeUp
> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
> > >>> timeoutin=179999
> > >>
> > >> That happens when you stop the cluster and try to start it again -
> before
> > >> a leader is chosen, it will wait for all known replicas fora shard to
> come
> > >> up so that everyone can sync up and have a chance to be the best
> leader. So
> > >> at this point it was only finding one of 2 known replicas and waiting
> for
> > >> the second to come up. After a couple minutes (configurable) it will
> just
> > >> continue anyway without the missing replica (if it doesn't show up).
> > >>
> > >> - Mark
> > >>
> > >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <
> maddinenidev@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi,
> > >>> We are uploading solr documents to the index in batches using 30
> threads
> > >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set
> to
> > >>> 10000.
> > >>> In the code, we are using HttpSolrServer and add(inputDoc) method to
> add
> > >>> docx.
> > >>> And, we have the following commit settings in solrconfig:
> > >>>
> > >>>    <autoCommit>
> > >>>      <maxTime>300000</maxTime>
> > >>>      <maxDocs>10000</maxDocs>
> > >>>      <openSearcher>false</openSearcher>
> > >>>    </autoCommit>
> > >>>
> > >>>      <autoSoftCommit>
> > >>>        <maxTime>1000</maxTime>
> > >>>      </autoSoftCommit>
> > >>>
> > >>> Cluster Details:
> > >>> ----------------------------
> > >>> solr version - 4.0
> > >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> > >>> numshards=2 ,
> > >>> 001, 002, 003 are the solr nodes and these three are behind the
> > >>> loadbalancer  <vip>
> > >>> 001, 003 assigned to shard1; 002 assigned to shard2
> > >>>
> > >>>
> > >>> Logs:Getting the errors in the below sequence after uploading some
> docx:
> > >>>
> > >>
> -----------------------------------------------------------------------------------------------------------
> > >>> 003
> > >>> Dec 4, 2012 12:11:46 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> > >>> waitForReplicasToComeUp
> > >>> INFO: Waiting until we see more replicas up: total=2 found=1
> > >>> timeoutin=179999
> > >>>
> > >>> 001
> > >>> Dec 4, 2012 12:12:59 PM
> > >>> org.apache.solr.update.processor.DistributedUpdateProcessor
> > >>> doDefensiveChecks
> > >>> SEVERE: ClusterState says we are the leader, but locally we don't
> think
> > >> so
> > >>>
> > >>> 003
> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> > >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed -
> retrying ...
> > >>>
> > >>> 001
> > >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> > >>> SEVERE: Error uploading: org.apache.solr.common.SolrException:
> Server at
> > >>> <vip>/solr/core1. returned non ok status:503, message:Service
> Unavailable
> > >>> at
> > >>>
> > >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> > >>> at
> > >>>
> > >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> > >>> 001
> > >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> > >>> SEVERE: Error while trying to recover.
> > >>> core=core1:org.apache.solr.common.SolrException: We are not the
> leader
> > >>> at
> > >>>
> > >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> > >>>
> > >>> 001
> > >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> > >>> SEVERE: Error uploading:
> > >> org.apache.solr.client.solrj.SolrServerException:
> > >>> IOException occured when talking to server at <vip>/solr/core1
> > >>> at
> > >>>
> > >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> > >>> at
> > >>>
> > >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> > >>> at
> > >>>
> > >>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> > >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> > >>> ... 5 lines omitted ...
> > >>> at java.lang.Thread.run(Unknown Source)
> > >>> Caused by: java.net.SocketException: Connection reset
> > >>>
> > >>>
> > >>> After sometime, all the three servers are going down.
> > >>>
> > >>> Appreciate, if someone could let us know what we are missing.
> > >>>
> > >>> Thx,Sudhakar.
> > >>
> > >>
> >
> >
> > <logs_error.txt>
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Mark Miller <ma...@gmail.com>.
Yes - it means that 001 went down (or more likely had it's connection to ZooKeeper interrupted? that's what I mean about a session timeout - if the solr->zk link is broken for longer than the session timeout that will trigger a leader election and when the connection is reestablished, the node will have to recover). That waiting should stop as soon as 001 came back up or reconnected to ZooKeeper.

In fact, this waiting should not happen in this case - but only on cluster restart. This is a bug that is fixed in 4.1 (hopefully coming very soon!):

* SOLR-3940: Rejoining the leader election incorrectly triggers the code path
  for a fresh cluster start rather than fail over. (Mark Miller)

- Mark

On Dec 5, 2012, at 9:41 PM, Sudhakar Maddineni <ma...@gmail.com> wrote:

> Yep, after restarting, cluster came back to normal state.We will run couple of more tests and see if we could reproduce this issue.
> 
> Btw, I am attaching the server logs before that 'INFO: Waiting until we see more replicas'  message.From the logs, we can see that leader election process started on 003 which was the replica for 001 initially.That means leader 001 went down at that time?
> 
> logs on 003:
> ========
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>         INFO: Running the leader process.
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
>         INFO: Checking if I should try and be the leader.
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
>         INFO: My last published State was Active, it's okay to be the leader.
> 12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>         INFO: I may be the new leader - try and sync
> 12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
>         WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core core=core1.
> 12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
>         INFO: Sync replicas to http://<003>:8080/solr/core1/
> 12:11:16 PM org.apache.solr.update.PeerSync sync
>         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START replicas=[<001>:8080/solr/core1/] nUpdates=100
> 12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
>         INFO: Updating live nodes -> this message is on 002
> 12:11:46 PM org.apache.solr.update.PeerSync handleResponse
>         WARNING: PeerSync: core=core1 url=http://<003>:8080/solr  exception talking to <001>:8080/solr/core1/, failed
>         org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: <001>:8080/solr/core1
>         	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
>         	at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>         	at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>         	at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>         	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>         	at java.util.concurrent.FutureTask.run(Unknown Source)
>         	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>         	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>         	at java.util.concurrent.FutureTask.run(Unknown Source)
>         	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>         	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>         	at java.lang.Thread.run(Unknown Source)
>         Caused by: java.net.SocketTimeoutException: Read timed out
>         	at java.net.SocketInputStream.socketRead0(Native Method)
>         	at java.net.SocketInputStream.read(Unknown Source)
> 12:11:46 PM org.apache.solr.update.PeerSync sync
>         INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. sync failed
> 12:11:46 PM org.apache.solr.common.SolrException log
>         SEVERE: Sync Failed
> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext rejoinLeaderElection
>         INFO: There is a better leader candidate than us - going back into recovery
> 12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
>         INFO: Running recovery - first canceling any ongoing recovery
> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
>         INFO: Starting recovery process.  core=core1 recoveringAfterStartup=false
> 12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>         INFO: Attempting to PeerSync from <001>:8080/solr/core1/ core=core1 - recoveringAfterStartup=false
> 12:11:46 PM org.apache.solr.update.PeerSync sync
>         INFO: PeerSync: core=core1 url=http://<003>:8080/solr START replicas=[<001>:8080/solr/core1/] nUpdates=100
> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess
>         INFO: Running the leader process.
> 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp
>         INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=179999
> 12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp
>         INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=179495
> 12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext waitForReplicasToComeUp
>         INFO: Waiting until we see more replicas up: total=2 found=1 timeoutin=178985
> ....
> ....
> 
> Thanks for your help.
> Sudhakar.
> 
> On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com> wrote:
> The waiting logging had to happen on restart unless it's some kind of bug.
> 
> Beyond that, something is off, but I have no clue why - it seems your clusterstate.json is not up to date at all.
> 
> Have you tried restarting the cluster then? Does that help at all?
> 
> Do you see any exceptions around zookeeper session timeouts?
> 
> - Mark
> 
> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <ma...@gmail.com> wrote:
> 
> > Hey Mark,
> >
> > Yes, I am able to access all of the nodes under each shard from solrcloud
> > admin UI.
> >
> >
> >   - *It kind of looks like the urls solrcloud is using are not accessible.
> >   When you go to the admin page and the cloud tab, can you access the urls it
> >   shows for each shard? That is, if you click on of the links or copy and
> >   paste the address into a web browser, does it work?*
> >
> > Actually, I got these errors when my document upload task/job was running,
> > not during the cluster restart. Also,job ran fine initially for the first
> > one hour and started throwing these errors after indexing some docx.
> >
> > Thx, Sudhakar.
> >
> >
> >
> >
> > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com> wrote:
> >
> >> It kind of looks like the urls solrcloud is using are not accessible. When
> >> you go to the admin page and the cloud tab, can you access the urls it
> >> shows for each shard? That is, if you click on of the links or copy and
> >> paste the address into a web browser, does it work?
> >>
> >> You may have to explicitly set the host= in solr.xml if it's not auto
> >> detecting the right one. Make sure the ports like right too.
> >>
> >>> waitForReplicasToComeUp
> >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179999
> >>
> >> That happens when you stop the cluster and try to start it again - before
> >> a leader is chosen, it will wait for all known replicas fora shard to come
> >> up so that everyone can sync up and have a chance to be the best leader. So
> >> at this point it was only finding one of 2 known replicas and waiting for
> >> the second to come up. After a couple minutes (configurable) it will just
> >> continue anyway without the missing replica (if it doesn't show up).
> >>
> >> - Mark
> >>
> >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <ma...@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>> We are uploading solr documents to the index in batches using 30 threads
> >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
> >>> 10000.
> >>> In the code, we are using HttpSolrServer and add(inputDoc) method to add
> >>> docx.
> >>> And, we have the following commit settings in solrconfig:
> >>>
> >>>    <autoCommit>
> >>>      <maxTime>300000</maxTime>
> >>>      <maxDocs>10000</maxDocs>
> >>>      <openSearcher>false</openSearcher>
> >>>    </autoCommit>
> >>>
> >>>      <autoSoftCommit>
> >>>        <maxTime>1000</maxTime>
> >>>      </autoSoftCommit>
> >>>
> >>> Cluster Details:
> >>> ----------------------------
> >>> solr version - 4.0
> >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> >>> numshards=2 ,
> >>> 001, 002, 003 are the solr nodes and these three are behind the
> >>> loadbalancer  <vip>
> >>> 001, 003 assigned to shard1; 002 assigned to shard2
> >>>
> >>>
> >>> Logs:Getting the errors in the below sequence after uploading some docx:
> >>>
> >> -----------------------------------------------------------------------------------------------------------
> >>> 003
> >>> Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179999
> >>>
> >>> 001
> >>> Dec 4, 2012 12:12:59 PM
> >>> org.apache.solr.update.processor.DistributedUpdateProcessor
> >>> doDefensiveChecks
> >>> SEVERE: ClusterState says we are the leader, but locally we don't think
> >> so
> >>>
> >>> 003
> >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying ...
> >>>
> >>> 001
> >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
> >>> <vip>/solr/core1. returned non ok status:503, message:Service Unavailable
> >>> at
> >>>
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>> at
> >>>
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> 001
> >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> >>> SEVERE: Error while trying to recover.
> >>> core=core1:org.apache.solr.common.SolrException: We are not the leader
> >>> at
> >>>
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>
> >>> 001
> >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> >>> SEVERE: Error uploading:
> >> org.apache.solr.client.solrj.SolrServerException:
> >>> IOException occured when talking to server at <vip>/solr/core1
> >>> at
> >>>
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >>> at
> >>>
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> at
> >>>
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> >>> ... 5 lines omitted ...
> >>> at java.lang.Thread.run(Unknown Source)
> >>> Caused by: java.net.SocketException: Connection reset
> >>>
> >>>
> >>> After sometime, all the three servers are going down.
> >>>
> >>> Appreciate, if someone could let us know what we are missing.
> >>>
> >>> Thx,Sudhakar.
> >>
> >>
> 
> 
> <logs_error.txt>


Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Sudhakar Maddineni <ma...@gmail.com>.
Yep, after restarting, cluster came back to normal state.We will run couple
of more tests and see if we could reproduce this issue.

Btw, I am attaching the server logs before that 'INFO: *Waiting until we
see more replicas*'  message.From the logs, we can see that leader election
process started on 003 which was the replica for 001 initially.That means
leader 001 went down at that time?

logs on 003:
========
12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
        INFO: Running the leader process.
12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
        INFO: Checking if I should try and be the leader.
12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader
        INFO: My last published State was Active, it's okay to be the
leader.
12:11:16 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
        INFO: I may be the new leader - try and sync
12:11:16 PM org.apache.solr.cloud.RecoveryStrategy close
        WARNING: Stopping recovery for zkNodeName=<003>:8080_solr_core
core=core1.
12:11:16 PM org.apache.solr.cloud.SyncStrategy sync
        INFO: Sync replicas to http://<003>:8080/solr/core1/
12:11:16 PM org.apache.solr.update.PeerSync sync
        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
replicas=[<001>:8080/solr/core1/] nUpdates=100
12:11:16 PM org.apache.solr.common.cloud.ZkStateReader$3 process
        INFO: Updating live nodes -> this message is on 002
12:11:46 PM org.apache.solr.update.PeerSync handleResponse
        WARNING: PeerSync: core=core1 url=http://<003>:8080/solr  exception
talking to <001>:8080/solr/core1/, failed
        org.apache.solr.client.solrj.SolrServerException: Timeout occured
while waiting response from server at: <001>:8080/solr/core1
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:409)
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
        at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
        at
org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
        at java.lang.Thread.run(Unknown Source)
        Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(Unknown Source)
12:11:46 PM org.apache.solr.update.PeerSync sync
        INFO: PeerSync: core=core1 url=http://<003>:8080/solr DONE. sync
failed
12:11:46 PM org.apache.solr.common.SolrException log
        SEVERE: Sync Failed
12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
rejoinLeaderElection
        INFO: There is a better leader candidate than us - going back into
recovery
12:11:46 PM org.apache.solr.update.DefaultSolrCoreState doRecovery
        INFO: Running recovery - first canceling any ongoing recovery
12:11:46 PM org.apache.solr.cloud.RecoveryStrategy run
        INFO: Starting recovery process.  core=core1
recoveringAfterStartup=false
12:11:46 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
        INFO: Attempting to PeerSync from <001>:8080/solr/core1/ core=core1
- recoveringAfterStartup=false
12:11:46 PM org.apache.solr.update.PeerSync sync
        INFO: PeerSync: core=core1 url=http://<003>:8080/solr START
replicas=[<001>:8080/solr/core1/] nUpdates=100
12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
runLeaderProcess
        INFO: Running the leader process.
12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
        INFO: *Waiting until we see more replicas up: total=2 found=1
timeoutin=179999*
12:11:47 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
        INFO: *Waiting until we see more replicas up: total=2 found=1
timeoutin=179495*
12:11:48 PM org.apache.solr.cloud.ShardLeaderElectionContext
waitForReplicasToComeUp
        INFO: *Waiting until we see more replicas up: total=2 found=1
timeoutin=178985*
....
....

Thanks for your help.
Sudhakar.

On Wed, Dec 5, 2012 at 6:19 PM, Mark Miller <ma...@gmail.com> wrote:

> The waiting logging had to happen on restart unless it's some kind of bug.
>
> Beyond that, something is off, but I have no clue why - it seems your
> clusterstate.json is not up to date at all.
>
> Have you tried restarting the cluster then? Does that help at all?
>
> Do you see any exceptions around zookeeper session timeouts?
>
> - Mark
>
> On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <ma...@gmail.com>
> wrote:
>
> > Hey Mark,
> >
> > Yes, I am able to access all of the nodes under each shard from solrcloud
> > admin UI.
> >
> >
> >   - *It kind of looks like the urls solrcloud is using are not
> accessible.
> >   When you go to the admin page and the cloud tab, can you access the
> urls it
> >   shows for each shard? That is, if you click on of the links or copy and
> >   paste the address into a web browser, does it work?*
> >
> > Actually, I got these errors when my document upload task/job was
> running,
> > not during the cluster restart. Also,job ran fine initially for the first
> > one hour and started throwing these errors after indexing some docx.
> >
> > Thx, Sudhakar.
> >
> >
> >
> >
> > On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com>
> wrote:
> >
> >> It kind of looks like the urls solrcloud is using are not accessible.
> When
> >> you go to the admin page and the cloud tab, can you access the urls it
> >> shows for each shard? That is, if you click on of the links or copy and
> >> paste the address into a web browser, does it work?
> >>
> >> You may have to explicitly set the host= in solr.xml if it's not auto
> >> detecting the right one. Make sure the ports like right too.
> >>
> >>> waitForReplicasToComeUp
> >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179999
> >>
> >> That happens when you stop the cluster and try to start it again -
> before
> >> a leader is chosen, it will wait for all known replicas fora shard to
> come
> >> up so that everyone can sync up and have a chance to be the best
> leader. So
> >> at this point it was only finding one of 2 known replicas and waiting
> for
> >> the second to come up. After a couple minutes (configurable) it will
> just
> >> continue anyway without the missing replica (if it doesn't show up).
> >>
> >> - Mark
> >>
> >> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <ma...@gmail.com>
> >> wrote:
> >>
> >>> Hi,
> >>> We are uploading solr documents to the index in batches using 30
> threads
> >>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
> >>> 10000.
> >>> In the code, we are using HttpSolrServer and add(inputDoc) method to
> add
> >>> docx.
> >>> And, we have the following commit settings in solrconfig:
> >>>
> >>>    <autoCommit>
> >>>      <maxTime>300000</maxTime>
> >>>      <maxDocs>10000</maxDocs>
> >>>      <openSearcher>false</openSearcher>
> >>>    </autoCommit>
> >>>
> >>>      <autoSoftCommit>
> >>>        <maxTime>1000</maxTime>
> >>>      </autoSoftCommit>
> >>>
> >>> Cluster Details:
> >>> ----------------------------
> >>> solr version - 4.0
> >>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> >>> numshards=2 ,
> >>> 001, 002, 003 are the solr nodes and these three are behind the
> >>> loadbalancer  <vip>
> >>> 001, 003 assigned to shard1; 002 assigned to shard2
> >>>
> >>>
> >>> Logs:Getting the errors in the below sequence after uploading some
> docx:
> >>>
> >>
> -----------------------------------------------------------------------------------------------------------
> >>> 003
> >>> Dec 4, 2012 12:11:46 PM
> org.apache.solr.cloud.ShardLeaderElectionContext
> >>> waitForReplicasToComeUp
> >>> INFO: Waiting until we see more replicas up: total=2 found=1
> >>> timeoutin=179999
> >>>
> >>> 001
> >>> Dec 4, 2012 12:12:59 PM
> >>> org.apache.solr.update.processor.DistributedUpdateProcessor
> >>> doDefensiveChecks
> >>> SEVERE: ClusterState says we are the leader, but locally we don't think
> >> so
> >>>
> >>> 003
> >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying
> ...
> >>>
> >>> 001
> >>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> >>> SEVERE: Error uploading: org.apache.solr.common.SolrException: Server
> at
> >>> <vip>/solr/core1. returned non ok status:503, message:Service
> Unavailable
> >>> at
> >>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> >>> at
> >>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> 001
> >>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> >>> SEVERE: Error while trying to recover.
> >>> core=core1:org.apache.solr.common.SolrException: We are not the leader
> >>> at
> >>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >>>
> >>> 001
> >>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> >>> SEVERE: Error uploading:
> >> org.apache.solr.client.solrj.SolrServerException:
> >>> IOException occured when talking to server at <vip>/solr/core1
> >>> at
> >>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> >>> at
> >>>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>> at
> >>>
> >>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> >>> ... 5 lines omitted ...
> >>> at java.lang.Thread.run(Unknown Source)
> >>> Caused by: java.net.SocketException: Connection reset
> >>>
> >>>
> >>> After sometime, all the three servers are going down.
> >>>
> >>> Appreciate, if someone could let us know what we are missing.
> >>>
> >>> Thx,Sudhakar.
> >>
> >>
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Mark Miller <ma...@gmail.com>.
The waiting logging had to happen on restart unless it's some kind of bug.

Beyond that, something is off, but I have no clue why - it seems your clusterstate.json is not up to date at all.

Have you tried restarting the cluster then? Does that help at all?

Do you see any exceptions around zookeeper session timeouts?

- Mark

On Dec 5, 2012, at 4:57 PM, Sudhakar Maddineni <ma...@gmail.com> wrote:

> Hey Mark,
> 
> Yes, I am able to access all of the nodes under each shard from solrcloud
> admin UI.
> 
> 
>   - *It kind of looks like the urls solrcloud is using are not accessible.
>   When you go to the admin page and the cloud tab, can you access the urls it
>   shows for each shard? That is, if you click on of the links or copy and
>   paste the address into a web browser, does it work?*
> 
> Actually, I got these errors when my document upload task/job was running,
> not during the cluster restart. Also,job ran fine initially for the first
> one hour and started throwing these errors after indexing some docx.
> 
> Thx, Sudhakar.
> 
> 
> 
> 
> On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com> wrote:
> 
>> It kind of looks like the urls solrcloud is using are not accessible. When
>> you go to the admin page and the cloud tab, can you access the urls it
>> shows for each shard? That is, if you click on of the links or copy and
>> paste the address into a web browser, does it work?
>> 
>> You may have to explicitly set the host= in solr.xml if it's not auto
>> detecting the right one. Make sure the ports like right too.
>> 
>>> waitForReplicasToComeUp
>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=179999
>> 
>> That happens when you stop the cluster and try to start it again - before
>> a leader is chosen, it will wait for all known replicas fora shard to come
>> up so that everyone can sync up and have a chance to be the best leader. So
>> at this point it was only finding one of 2 known replicas and waiting for
>> the second to come up. After a couple minutes (configurable) it will just
>> continue anyway without the missing replica (if it doesn't show up).
>> 
>> - Mark
>> 
>> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <ma...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> We are uploading solr documents to the index in batches using 30 threads
>>> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
>>> 10000.
>>> In the code, we are using HttpSolrServer and add(inputDoc) method to add
>>> docx.
>>> And, we have the following commit settings in solrconfig:
>>> 
>>>    <autoCommit>
>>>      <maxTime>300000</maxTime>
>>>      <maxDocs>10000</maxDocs>
>>>      <openSearcher>false</openSearcher>
>>>    </autoCommit>
>>> 
>>>      <autoSoftCommit>
>>>        <maxTime>1000</maxTime>
>>>      </autoSoftCommit>
>>> 
>>> Cluster Details:
>>> ----------------------------
>>> solr version - 4.0
>>> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
>>> numshards=2 ,
>>> 001, 002, 003 are the solr nodes and these three are behind the
>>> loadbalancer  <vip>
>>> 001, 003 assigned to shard1; 002 assigned to shard2
>>> 
>>> 
>>> Logs:Getting the errors in the below sequence after uploading some docx:
>>> 
>> -----------------------------------------------------------------------------------------------------------
>>> 003
>>> Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
>>> waitForReplicasToComeUp
>>> INFO: Waiting until we see more replicas up: total=2 found=1
>>> timeoutin=179999
>>> 
>>> 001
>>> Dec 4, 2012 12:12:59 PM
>>> org.apache.solr.update.processor.DistributedUpdateProcessor
>>> doDefensiveChecks
>>> SEVERE: ClusterState says we are the leader, but locally we don't think
>> so
>>> 
>>> 003
>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying ...
>>> 
>>> 001
>>> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
>>> SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
>>> <vip>/solr/core1. returned non ok status:503, message:Service Unavailable
>>> at
>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>>> at
>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> 001
>>> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
>>> SEVERE: Error while trying to recover.
>>> core=core1:org.apache.solr.common.SolrException: We are not the leader
>>> at
>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
>>> 
>>> 001
>>> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
>>> SEVERE: Error uploading:
>> org.apache.solr.client.solrj.SolrServerException:
>>> IOException occured when talking to server at <vip>/solr/core1
>>> at
>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
>>> at
>>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>> at
>>> 
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>>> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
>>> ... 5 lines omitted ...
>>> at java.lang.Thread.run(Unknown Source)
>>> Caused by: java.net.SocketException: Connection reset
>>> 
>>> 
>>> After sometime, all the three servers are going down.
>>> 
>>> Appreciate, if someone could let us know what we are missing.
>>> 
>>> Thx,Sudhakar.
>> 
>> 


Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Sudhakar Maddineni <ma...@gmail.com>.
Hey Mark,

Yes, I am able to access all of the nodes under each shard from solrcloud
admin UI.


   - *It kind of looks like the urls solrcloud is using are not accessible.
   When you go to the admin page and the cloud tab, can you access the urls it
   shows for each shard? That is, if you click on of the links or copy and
   paste the address into a web browser, does it work?*

Actually, I got these errors when my document upload task/job was running,
not during the cluster restart. Also,job ran fine initially for the first
one hour and started throwing these errors after indexing some docx.

Thx, Sudhakar.




On Wed, Dec 5, 2012 at 5:38 PM, Mark Miller <ma...@gmail.com> wrote:

> It kind of looks like the urls solrcloud is using are not accessible. When
> you go to the admin page and the cloud tab, can you access the urls it
> shows for each shard? That is, if you click on of the links or copy and
> paste the address into a web browser, does it work?
>
> You may have to explicitly set the host= in solr.xml if it's not auto
> detecting the right one. Make sure the ports like right too.
>
> > waitForReplicasToComeUp
> > INFO: Waiting until we see more replicas up: total=2 found=1
> > timeoutin=179999
>
> That happens when you stop the cluster and try to start it again - before
> a leader is chosen, it will wait for all known replicas fora shard to come
> up so that everyone can sync up and have a chance to be the best leader. So
> at this point it was only finding one of 2 known replicas and waiting for
> the second to come up. After a couple minutes (configurable) it will just
> continue anyway without the missing replica (if it doesn't show up).
>
> - Mark
>
> On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <ma...@gmail.com>
> wrote:
>
> > Hi,
> > We are uploading solr documents to the index in batches using 30 threads
> > and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
> > 10000.
> > In the code, we are using HttpSolrServer and add(inputDoc) method to add
> > docx.
> > And, we have the following commit settings in solrconfig:
> >
> >     <autoCommit>
> >       <maxTime>300000</maxTime>
> >       <maxDocs>10000</maxDocs>
> >       <openSearcher>false</openSearcher>
> >     </autoCommit>
> >
> >       <autoSoftCommit>
> >         <maxTime>1000</maxTime>
> >       </autoSoftCommit>
> >
> > Cluster Details:
> > ----------------------------
> > solr version - 4.0
> > zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> > numshards=2 ,
> > 001, 002, 003 are the solr nodes and these three are behind the
> > loadbalancer  <vip>
> > 001, 003 assigned to shard1; 002 assigned to shard2
> >
> >
> > Logs:Getting the errors in the below sequence after uploading some docx:
> >
> -----------------------------------------------------------------------------------------------------------
> > 003
> > Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> > waitForReplicasToComeUp
> > INFO: Waiting until we see more replicas up: total=2 found=1
> > timeoutin=179999
> >
> > 001
> > Dec 4, 2012 12:12:59 PM
> > org.apache.solr.update.processor.DistributedUpdateProcessor
> > doDefensiveChecks
> > SEVERE: ClusterState says we are the leader, but locally we don't think
> so
> >
> > 003
> > Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> > SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying ...
> >
> > 001
> > Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> > SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
> > <vip>/solr/core1. returned non ok status:503, message:Service Unavailable
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> > 001
> > Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> > SEVERE: Error while trying to recover.
> > core=core1:org.apache.solr.common.SolrException: We are not the leader
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> >
> > 001
> > Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> > SEVERE: Error uploading:
> org.apache.solr.client.solrj.SolrServerException:
> > IOException occured when talking to server at <vip>/solr/core1
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> > at
> >
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> > at
> >
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> > at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> > ... 5 lines omitted ...
> > at java.lang.Thread.run(Unknown Source)
> > Caused by: java.net.SocketException: Connection reset
> >
> >
> > After sometime, all the three servers are going down.
> >
> > Appreciate, if someone could let us know what we are missing.
> >
> > Thx,Sudhakar.
>
>

Re: SolrCloud - ClusterState says we are the leader,but locally ...

Posted by Mark Miller <ma...@gmail.com>.
It kind of looks like the urls solrcloud is using are not accessible. When you go to the admin page and the cloud tab, can you access the urls it shows for each shard? That is, if you click on of the links or copy and paste the address into a web browser, does it work?

You may have to explicitly set the host= in solr.xml if it's not auto detecting the right one. Make sure the ports like right too.

> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=179999

That happens when you stop the cluster and try to start it again - before a leader is chosen, it will wait for all known replicas fora shard to come up so that everyone can sync up and have a chance to be the best leader. So at this point it was only finding one of 2 known replicas and waiting for the second to come up. After a couple minutes (configurable) it will just continue anyway without the missing replica (if it doesn't show up).

- Mark

On Dec 5, 2012, at 4:21 PM, Sudhakar Maddineni <ma...@gmail.com> wrote:

> Hi,
> We are uploading solr documents to the index in batches using 30 threads
> and using ThreadPoolExecutor, LinkedBlockingQueue with max limit set to
> 10000.
> In the code, we are using HttpSolrServer and add(inputDoc) method to add
> docx.
> And, we have the following commit settings in solrconfig:
> 
>     <autoCommit>
>       <maxTime>300000</maxTime>
>       <maxDocs>10000</maxDocs>
>       <openSearcher>false</openSearcher>
>     </autoCommit>
> 
>       <autoSoftCommit>
>         <maxTime>1000</maxTime>
>       </autoSoftCommit>
> 
> Cluster Details:
> ----------------------------
> solr version - 4.0
> zookeeper version - 3.4.3 [zookeeper ensemble with 3 nodes]
> numshards=2 ,
> 001, 002, 003 are the solr nodes and these three are behind the
> loadbalancer  <vip>
> 001, 003 assigned to shard1; 002 assigned to shard2
> 
> 
> Logs:Getting the errors in the below sequence after uploading some docx:
> -----------------------------------------------------------------------------------------------------------
> 003
> Dec 4, 2012 12:11:46 PM org.apache.solr.cloud.ShardLeaderElectionContext
> waitForReplicasToComeUp
> INFO: Waiting until we see more replicas up: total=2 found=1
> timeoutin=179999
> 
> 001
> Dec 4, 2012 12:12:59 PM
> org.apache.solr.update.processor.DistributedUpdateProcessor
> doDefensiveChecks
> SEVERE: ClusterState says we are the leader, but locally we don't think so
> 
> 003
> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> SEVERE: forwarding update to <001>:8080/solr/core1/ failed - retrying ...
> 
> 001
> Dec 4, 2012 12:12:59 PM org.apache.solr.common.SolrException log
> SEVERE: Error uploading: org.apache.solr.common.SolrException: Server at
> <vip>/solr/core1. returned non ok status:503, message:Service Unavailable
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> 001
> Dec 4, 2012 12:25:45 PM org.apache.solr.common.SolrException log
> SEVERE: Error while trying to recover.
> core=core1:org.apache.solr.common.SolrException: We are not the leader
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:401)
> 
> 001
> Dec 4, 2012 12:44:38 PM org.apache.solr.common.SolrException log
> SEVERE: Error uploading: org.apache.solr.client.solrj.SolrServerException:
> IOException occured when talking to server at <vip>/solr/core1
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413)
> at
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
> ... 5 lines omitted ...
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.SocketException: Connection reset
> 
> 
> After sometime, all the three servers are going down.
> 
> Appreciate, if someone could let us know what we are missing.
> 
> Thx,Sudhakar.