You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Li Ding <li...@bloomreach.com> on 2016/04/21 00:50:18 UTC

Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Hi All,

We are using SolrCloud 4.6.1.  We have observed following behaviors
recently.  A Solr node in a Solrcloud cluster is up but some of the cores
on the nodes are marked as down in Zookeeper.  If the cores are parts of a
multi-sharded collection with one replica,  the queries to that collection
will fail.  However, when this happened, if we issue queries to the core
directly, it returns 200 and correct info.  But once Solr got into the
state, the core will be marked down forever unless we do a restart on Solr.

Has anyone seen this behavior before?  Is there any to get out of the state
on its own?

Thanks,

Li

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by Don Bosco Durai <bo...@apache.org>.

Hi Li

I got into very similar situation like you. The GC was taking much longer than the zookeeper timeout configured. I had 3 nodes in the SolrCloud and very often I would have my entire cluster totally messed up. Increasing the zookeeper timeout eventually helped. But before that, I was able do some temporary workaround by "rmr /solr/overseer/queue” in the zookeeper (not sure whether I restarted the solr after that). I am not even sure this is the right thing to do, but seem to have unblocked me at time. At least, there were no negative effect.

Thanks

Bosco





On 4/29/16, 7:52 AM, "Erick Erickson" <er...@gmail.com> wrote:

>Well, there have been lots of improvements since 4.6. You're right,
>logically when things come back up and are all reachable, it seems
>like it is theoretically possible to bring a node back up. There
>have been situations where that doesn't happen, and various fixes
>have been implemented to fix them as they're identified....
>
>You might try reloading the core from the core admin (that's
>about the only thing you should try in SolrCloud from the
>core admin screen)....
>
>Best,
>Erick
>
>On Wed, Apr 27, 2016 at 10:58 AM, Li Ding <li...@bloomreach.com> wrote:
>> Hi Erick,
>>
>> I don't have the GC log.  But after the GC finished.  Isn't zk ping
>> succeeds and the core should be back to normal state?  From the log I
>> posted.  The sequence is:
>>
>> 1) Solr Detects itself can't connect to ZK and reconnect to ZK
>> 2) Solr marked all cores are down
>> 3) Solr recovery each cores, some succeeds, some failed.
>> 4) After 30 minutes, the cores that are failed still marked as down.
>>
>> So my questions is, during the 30 minutes interval, if GC takes too long,
>> all cores should failed.  And GC doesn't take longer than a minute since
>> all serving requests to other calls succeeds and the next zk ping should
>> bring the core back to normal? right?  We have an active monitor running at
>> the same time querying every core in distrib=false mode and every query
>> succeeds.
>>
>> Thanks,
>>
>> Li
>>
>> On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> One of the reasons this happens is if you have very
>>> long GC cycles, longer than the Zookeeper "keep alive"
>>> timeout. During a full GC pause, Solr is unresponsive and
>>> if the ZK ping times out, ZK assumes the machine is
>>> gone and you get into this recovery state.
>>>
>>> So I'd collect GC logs and see if you have any
>>> stop-the-world GC pauses that take longer than the ZK
>>> timeout.
>>>
>>> see Mark Millers primer on GC here:
>>> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding <li...@bloomreach.com> wrote:
>>> > Thank you all for your help!
>>> >
>>> > The zookeeper log rolled over, thisis from Solr.log:
>>> >
>>> > Looks like the solr and zk connection is gone for some reason
>>> >
>>> > INFO  - 2016-04-21 12:37:57.536;
>>> > org.apache.solr.common.cloud.ConnectionManager; Watcher
>>> > org.apache.solr.common.cloud.ConnectionManager@19789a96
>>> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
>>> > state:Disconnected type:None path:null path:null type:None
>>> >
>>> > INFO  - 2016-04-21 12:37:57.536;
>>> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>>> >
>>> > INFO  - 2016-04-21 12:38:24.248;
>>> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection
>>> expired
>>> > - starting a new one...
>>> >
>>> > INFO  - 2016-04-21 12:38:24.262;
>>> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
>>> > connect to ZooKeeper
>>> >
>>> > INFO  - 2016-04-21 12:38:24.269;
>>> > org.apache.solr.common.cloud.ConnectionManager; Connected:true
>>> >
>>> >
>>> > Then it publishes all cores on the hosts are down.  I just list three
>>> cores
>>> > here:
>>> >
>>> > INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
>>> > publishing core=product1_shard1_replica1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
>>> > publishing core=collection1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
>>> > numShards not found on descriptor - reading it from system property
>>> >
>>> > INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
>>> > publishing core=product2_shard5_replica1 state=down
>>> >
>>> > INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
>>> > publishing core=product2_shard13_replica1 state=down
>>> >
>>> >
>>> > product1 has only one shard one replica and it's able to be active
>>> > successfully:
>>> >
>>> > INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
>>> > Register replica - core:product1_shard1_replica1 address:http://
>>> > {internalIp}:8983/solr collection:product1 shard:shard1
>>> >
>>> > WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
>>> > cancelElection did not find election node to remove
>>> >
>>> > INFO  - 2016-04-21 12:38:26.393;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>>> > process for shard shard1
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
>>> to
>>> > continue.
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
>>> leader -
>>> > try and sync
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>>> > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>>> > Success - now sync replicas to me
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/
>>> > has no replicas
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1
>>> >
>>> > INFO  - 2016-04-21 12:38:26.399;
>>> org.apache.solr.common.cloud.SolrZkClient;
>>> > makePath: /collections/product1/leaders/shard1
>>> >
>>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We
>>> are
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
>>> > http://{internalIp}:8983/solr/product1_shard1_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
>>> > LogReplay needed for core=product1_replica1 baseURL=http://
>>> > {internalIp}:8983/solr
>>> >
>>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
>>> > the leader, no recovery necessary
>>> >
>>> > INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
>>> > publishing core=product1_shard1_replica1 state=active
>>> >
>>> >
>>> > product2 has 15 shards one replica but only two shards lived on this
>>> > machine, this is one of the failed shard that I never seen the message of
>>> > the core product2_shard5_replica1 active:
>>> >
>>> > INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
>>> > Register replica - product2_shard5_replica1 address:http://
>>> > {internalIp}:8983/solr collection:product2 shard:shard5
>>> >
>>> > WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
>>> > cancelElection did not find election node to remove
>>> >
>>> > INFO  - 2016-04-21 12:38:26.625;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>>> > process for shard shard5
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
>>> to
>>> > continue.
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
>>> leader -
>>> > try and sync
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
>>> > replicas to http://
>>> > {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
>>> > Success - now sync replicas to me
>>> >
>>> > INFO  - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy;
>>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> > has no replicas
>>> >
>>> > INFO  - 2016-04-21 12:38:26.632;
>>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
>>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> > shard5
>>> >
>>> > INFO  - 2016-04-21 12:38:26.632;
>>> org.apache.solr.common.cloud.SolrZkClient;
>>> > makePath: /collections/product2_shard5_replica1/leaders/shard5
>>> >
>>> > INFO  - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We
>>> are
>>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>>> and
>>> > leader is http://{internalIp}:8983/solr
>>> > product2_shard5_replica1_shard5_replica1/
>>> >
>>> > INFO  - 2016-04-21 12:38:26.646;
>>> > org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
>>> > ZooKeeper...
>>> >
>>> >
>>> > Before I restarted this server, a bunch of queries failed for this
>>> > collection product2.  But I don't think it will affect the core status.
>>> >
>>> >
>>> > Do you guys have any idea about why this particular core is not published
>>> > as active since from the log, most steps are done except the very last
>>> one
>>> > to publish info to ZK.
>>> >
>>> >
>>> > Thanks,
>>> >
>>> >
>>> > Li
>>> > On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <ra...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi Li,
>>> >>
>>> >> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
>>> >> if its the case, this may be related to
>>> >> https://issues.apache.org/jira/browse/SOLR-7940,
>>> >> and i would say either use the patch file or upgrade.
>>> >>
>>> >>
>>> >> *Thanks,*
>>> >> *Rajesh,*
>>> >> *8328789519,*
>>> >> *If I don't answer your call please leave a voicemail with your contact
>>> >> info, *
>>> >> *will return your call ASAP.*
>>> >>
>>> >> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <
>>> yypvsxf19870706@gmail.com>
>>> >> wrote:
>>> >>
>>> >> > Hi
>>> >> >    We have used Solr4.6 for 2 years,If you post more logs ,maybe we
>>> can
>>> >> > fixed it.
>>> >> >
>>> >> > 2016-04-21 6:50 GMT+08:00 Li Ding <li...@bloomreach.com>:
>>> >> >
>>> >> > > Hi All,
>>> >> > >
>>> >> > > We are using SolrCloud 4.6.1.  We have observed following behaviors
>>> >> > > recently.  A Solr node in a Solrcloud cluster is up but some of the
>>> >> cores
>>> >> > > on the nodes are marked as down in Zookeeper.  If the cores are
>>> parts
>>> >> of
>>> >> > a
>>> >> > > multi-sharded collection with one replica,  the queries to that
>>> >> > collection
>>> >> > > will fail.  However, when this happened, if we issue queries to the
>>> >> core
>>> >> > > directly, it returns 200 and correct info.  But once Solr got into
>>> the
>>> >> > > state, the core will be marked down forever unless we do a restart
>>> on
>>> >> > Solr.
>>> >> > >
>>> >> > > Has anyone seen this behavior before?  Is there any to get out of
>>> the
>>> >> > state
>>> >> > > on its own?
>>> >> > >
>>> >> > > Thanks,
>>> >> > >
>>> >> > > Li
>>> >> > >
>>> >> >
>>> >>
>>>

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by Erick Erickson <er...@gmail.com>.

Well, there have been lots of improvements since 4.6. You're right,
logically when things come back up and are all reachable, it seems
like it is theoretically possible to bring a node back up. There
have been situations where that doesn't happen, and various fixes
have been implemented to fix them as they're identified....

You might try reloading the core from the core admin (that's
about the only thing you should try in SolrCloud from the
core admin screen)....

Best,
Erick

On Wed, Apr 27, 2016 at 10:58 AM, Li Ding <li...@bloomreach.com> wrote:
> Hi Erick,
>
> I don't have the GC log.  But after the GC finished.  Isn't zk ping
> succeeds and the core should be back to normal state?  From the log I
> posted.  The sequence is:
>
> 1) Solr Detects itself can't connect to ZK and reconnect to ZK
> 2) Solr marked all cores are down
> 3) Solr recovery each cores, some succeeds, some failed.
> 4) After 30 minutes, the cores that are failed still marked as down.
>
> So my questions is, during the 30 minutes interval, if GC takes too long,
> all cores should failed.  And GC doesn't take longer than a minute since
> all serving requests to other calls succeeds and the next zk ping should
> bring the core back to normal? right?  We have an active monitor running at
> the same time querying every core in distrib=false mode and every query
> succeeds.
>
> Thanks,
>
> Li
>
> On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> One of the reasons this happens is if you have very
>> long GC cycles, longer than the Zookeeper "keep alive"
>> timeout. During a full GC pause, Solr is unresponsive and
>> if the ZK ping times out, ZK assumes the machine is
>> gone and you get into this recovery state.
>>
>> So I'd collect GC logs and see if you have any
>> stop-the-world GC pauses that take longer than the ZK
>> timeout.
>>
>> see Mark Millers primer on GC here:
>> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>>
>> Best,
>> Erick
>>
>> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding <li...@bloomreach.com> wrote:
>> > Thank you all for your help!
>> >
>> > The zookeeper log rolled over, thisis from Solr.log:
>> >
>> > Looks like the solr and zk connection is gone for some reason
>> >
>> > INFO  - 2016-04-21 12:37:57.536;
>> > org.apache.solr.common.cloud.ConnectionManager; Watcher
>> > org.apache.solr.common.cloud.ConnectionManager@19789a96
>> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
>> > state:Disconnected type:None path:null path:null type:None
>> >
>> > INFO  - 2016-04-21 12:37:57.536;
>> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>> >
>> > INFO  - 2016-04-21 12:38:24.248;
>> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection
>> expired
>> > - starting a new one...
>> >
>> > INFO  - 2016-04-21 12:38:24.262;
>> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
>> > connect to ZooKeeper
>> >
>> > INFO  - 2016-04-21 12:38:24.269;
>> > org.apache.solr.common.cloud.ConnectionManager; Connected:true
>> >
>> >
>> > Then it publishes all cores on the hosts are down.  I just list three
>> cores
>> > here:
>> >
>> > INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
>> > publishing core=product1_shard1_replica1 state=down
>> >
>> > INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
>> > publishing core=collection1 state=down
>> >
>> > INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
>> > numShards not found on descriptor - reading it from system property
>> >
>> > INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
>> > publishing core=product2_shard5_replica1 state=down
>> >
>> > INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
>> > publishing core=product2_shard13_replica1 state=down
>> >
>> >
>> > product1 has only one shard one replica and it's able to be active
>> > successfully:
>> >
>> > INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
>> > Register replica - core:product1_shard1_replica1 address:http://
>> > {internalIp}:8983/solr collection:product1 shard:shard1
>> >
>> > WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
>> > cancelElection did not find election node to remove
>> >
>> > INFO  - 2016-04-21 12:38:26.393;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>> > process for shard shard1
>> >
>> > INFO  - 2016-04-21 12:38:26.399;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
>> to
>> > continue.
>> >
>> > INFO  - 2016-04-21 12:38:26.399;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
>> leader -
>> > try and sync
>> >
>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>> > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
>> >
>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
>> > Success - now sync replicas to me
>> >
>> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
>> > http://{internalIp}:8983/solr/product1_shard1_replica1/
>> > has no replicas
>> >
>> > INFO  - 2016-04-21 12:38:26.399;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1
>> >
>> > INFO  - 2016-04-21 12:38:26.399;
>> org.apache.solr.common.cloud.SolrZkClient;
>> > makePath: /collections/product1/leaders/shard1
>> >
>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We
>> are
>> > http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
>> > http://{internalIp}:8983/solr/product1_shard1_replica1/
>> >
>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
>> > LogReplay needed for core=product1_replica1 baseURL=http://
>> > {internalIp}:8983/solr
>> >
>> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
>> > the leader, no recovery necessary
>> >
>> > INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
>> > publishing core=product1_shard1_replica1 state=active
>> >
>> >
>> > product2 has 15 shards one replica but only two shards lived on this
>> > machine, this is one of the failed shard that I never seen the message of
>> > the core product2_shard5_replica1 active:
>> >
>> > INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
>> > Register replica - product2_shard5_replica1 address:http://
>> > {internalIp}:8983/solr collection:product2 shard:shard5
>> >
>> > WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
>> > cancelElection did not find election node to remove
>> >
>> > INFO  - 2016-04-21 12:38:26.625;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
>> > process for shard shard5
>> >
>> > INFO  - 2016-04-21 12:38:26.631;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
>> to
>> > continue.
>> >
>> > INFO  - 2016-04-21 12:38:26.631;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
>> leader -
>> > try and sync
>> >
>> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
>> > replicas to http://
>> > {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>> >
>> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
>> > Success - now sync replicas to me
>> >
>> > INFO  - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy;
>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>> > has no replicas
>> >
>> > INFO  - 2016-04-21 12:38:26.632;
>> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>> > shard5
>> >
>> > INFO  - 2016-04-21 12:38:26.632;
>> org.apache.solr.common.cloud.SolrZkClient;
>> > makePath: /collections/product2_shard5_replica1/leaders/shard5
>> >
>> > INFO  - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We
>> are
>> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>> and
>> > leader is http://{internalIp}:8983/solr
>> > product2_shard5_replica1_shard5_replica1/
>> >
>> > INFO  - 2016-04-21 12:38:26.646;
>> > org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
>> > ZooKeeper...
>> >
>> >
>> > Before I restarted this server, a bunch of queries failed for this
>> > collection product2.  But I don't think it will affect the core status.
>> >
>> >
>> > Do you guys have any idea about why this particular core is not published
>> > as active since from the log, most steps are done except the very last
>> one
>> > to publish info to ZK.
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Li
>> > On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <ra...@gmail.com>
>> > wrote:
>> >
>> >> Hi Li,
>> >>
>> >> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
>> >> if its the case, this may be related to
>> >> https://issues.apache.org/jira/browse/SOLR-7940,
>> >> and i would say either use the patch file or upgrade.
>> >>
>> >>
>> >> *Thanks,*
>> >> *Rajesh,*
>> >> *8328789519,*
>> >> *If I don't answer your call please leave a voicemail with your contact
>> >> info, *
>> >> *will return your call ASAP.*
>> >>
>> >> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <
>> yypvsxf19870706@gmail.com>
>> >> wrote:
>> >>
>> >> > Hi
>> >> >    We have used Solr4.6 for 2 years,If you post more logs ,maybe we
>> can
>> >> > fixed it.
>> >> >
>> >> > 2016-04-21 6:50 GMT+08:00 Li Ding <li...@bloomreach.com>:
>> >> >
>> >> > > Hi All,
>> >> > >
>> >> > > We are using SolrCloud 4.6.1.  We have observed following behaviors
>> >> > > recently.  A Solr node in a Solrcloud cluster is up but some of the
>> >> cores
>> >> > > on the nodes are marked as down in Zookeeper.  If the cores are
>> parts
>> >> of
>> >> > a
>> >> > > multi-sharded collection with one replica,  the queries to that
>> >> > collection
>> >> > > will fail.  However, when this happened, if we issue queries to the
>> >> core
>> >> > > directly, it returns 200 and correct info.  But once Solr got into
>> the
>> >> > > state, the core will be marked down forever unless we do a restart
>> on
>> >> > Solr.
>> >> > >
>> >> > > Has anyone seen this behavior before?  Is there any to get out of
>> the
>> >> > state
>> >> > > on its own?
>> >> > >
>> >> > > Thanks,
>> >> > >
>> >> > > Li
>> >> > >
>> >> >
>> >>
>>

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by Li Ding <li...@bloomreach.com>.

Hi Erick,

I don't have the GC log.  But after the GC finished.  Isn't zk ping
succeeds and the core should be back to normal state?  From the log I
posted.  The sequence is:

1) Solr Detects itself can't connect to ZK and reconnect to ZK
2) Solr marked all cores are down
3) Solr recovery each cores, some succeeds, some failed.
4) After 30 minutes, the cores that are failed still marked as down.

So my questions is, during the 30 minutes interval, if GC takes too long,
all cores should failed.  And GC doesn't take longer than a minute since
all serving requests to other calls succeeds and the next zk ping should
bring the core back to normal? right?  We have an active monitor running at
the same time querying every core in distrib=false mode and every query
succeeds.

Thanks,

Li

On Tue, Apr 26, 2016 at 6:20 PM, Erick Erickson <er...@gmail.com>
wrote:

> One of the reasons this happens is if you have very
> long GC cycles, longer than the Zookeeper "keep alive"
> timeout. During a full GC pause, Solr is unresponsive and
> if the ZK ping times out, ZK assumes the machine is
> gone and you get into this recovery state.
>
> So I'd collect GC logs and see if you have any
> stop-the-world GC pauses that take longer than the ZK
> timeout.
>
> see Mark Millers primer on GC here:
> https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
>
> Best,
> Erick
>
> On Tue, Apr 26, 2016 at 2:13 PM, Li Ding <li...@bloomreach.com> wrote:
> > Thank you all for your help!
> >
> > The zookeeper log rolled over, thisis from Solr.log:
> >
> > Looks like the solr and zk connection is gone for some reason
> >
> > INFO  - 2016-04-21 12:37:57.536;
> > org.apache.solr.common.cloud.ConnectionManager; Watcher
> > org.apache.solr.common.cloud.ConnectionManager@19789a96
> > name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
> > state:Disconnected type:None path:null path:null type:None
> >
> > INFO  - 2016-04-21 12:37:57.536;
> > org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
> >
> > INFO  - 2016-04-21 12:38:24.248;
> > org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection
> expired
> > - starting a new one...
> >
> > INFO  - 2016-04-21 12:38:24.262;
> > org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
> > connect to ZooKeeper
> >
> > INFO  - 2016-04-21 12:38:24.269;
> > org.apache.solr.common.cloud.ConnectionManager; Connected:true
> >
> >
> > Then it publishes all cores on the hosts are down.  I just list three
> cores
> > here:
> >
> > INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
> > publishing core=product1_shard1_replica1 state=down
> >
> > INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
> > publishing core=collection1 state=down
> >
> > INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
> > numShards not found on descriptor - reading it from system property
> >
> > INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
> > publishing core=product2_shard5_replica1 state=down
> >
> > INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
> > publishing core=product2_shard13_replica1 state=down
> >
> >
> > product1 has only one shard one replica and it's able to be active
> > successfully:
> >
> > INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
> > Register replica - core:product1_shard1_replica1 address:http://
> > {internalIp}:8983/solr collection:product1 shard:shard1
> >
> > WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
> > cancelElection did not find election node to remove
> >
> > INFO  - 2016-04-21 12:38:26.393;
> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
> > process for shard shard1
> >
> > INFO  - 2016-04-21 12:38:26.399;
> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
> to
> > continue.
> >
> > INFO  - 2016-04-21 12:38:26.399;
> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
> leader -
> > try and sync
> >
> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> > replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
> >
> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> > Success - now sync replicas to me
> >
> > INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
> > http://{internalIp}:8983/solr/product1_shard1_replica1/
> > has no replicas
> >
> > INFO  - 2016-04-21 12:38:26.399;
> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
> > http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1
> >
> > INFO  - 2016-04-21 12:38:26.399;
> org.apache.solr.common.cloud.SolrZkClient;
> > makePath: /collections/product1/leaders/shard1
> >
> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We
> are
> > http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
> > http://{internalIp}:8983/solr/product1_shard1_replica1/
> >
> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
> > LogReplay needed for core=product1_replica1 baseURL=http://
> > {internalIp}:8983/solr
> >
> > INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
> > the leader, no recovery necessary
> >
> > INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
> > publishing core=product1_shard1_replica1 state=active
> >
> >
> > product2 has 15 shards one replica but only two shards lived on this
> > machine, this is one of the failed shard that I never seen the message of
> > the core product2_shard5_replica1 active:
> >
> > INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
> > Register replica - product2_shard5_replica1 address:http://
> > {internalIp}:8983/solr collection:product2 shard:shard5
> >
> > WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
> > cancelElection did not find election node to remove
> >
> > INFO  - 2016-04-21 12:38:26.625;
> > org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
> > process for shard shard5
> >
> > INFO  - 2016-04-21 12:38:26.631;
> > org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found
> to
> > continue.
> >
> > INFO  - 2016-04-21 12:38:26.631;
> > org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new
> leader -
> > try and sync
> >
> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
> > replicas to http://
> > {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
> >
> > INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
> > Success - now sync replicas to me
> >
> > INFO  - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy;
> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
> > has no replicas
> >
> > INFO  - 2016-04-21 12:38:26.632;
> > org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
> > shard5
> >
> > INFO  - 2016-04-21 12:38:26.632;
> org.apache.solr.common.cloud.SolrZkClient;
> > makePath: /collections/product2_shard5_replica1/leaders/shard5
> >
> > INFO  - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We
> are
> > http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
> and
> > leader is http://{internalIp}:8983/solr
> > product2_shard5_replica1_shard5_replica1/
> >
> > INFO  - 2016-04-21 12:38:26.646;
> > org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
> > ZooKeeper...
> >
> >
> > Before I restarted this server, a bunch of queries failed for this
> > collection product2.  But I don't think it will affect the core status.
> >
> >
> > Do you guys have any idea about why this particular core is not published
> > as active since from the log, most steps are done except the very last
> one
> > to publish info to ZK.
> >
> >
> > Thanks,
> >
> >
> > Li
> > On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <ra...@gmail.com>
> > wrote:
> >
> >> Hi Li,
> >>
> >> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
> >> if its the case, this may be related to
> >> https://issues.apache.org/jira/browse/SOLR-7940,
> >> and i would say either use the patch file or upgrade.
> >>
> >>
> >> *Thanks,*
> >> *Rajesh,*
> >> *8328789519,*
> >> *If I don't answer your call please leave a voicemail with your contact
> >> info, *
> >> *will return your call ASAP.*
> >>
> >> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <
> yypvsxf19870706@gmail.com>
> >> wrote:
> >>
> >> > Hi
> >> >    We have used Solr4.6 for 2 years,If you post more logs ,maybe we
> can
> >> > fixed it.
> >> >
> >> > 2016-04-21 6:50 GMT+08:00 Li Ding <li...@bloomreach.com>:
> >> >
> >> > > Hi All,
> >> > >
> >> > > We are using SolrCloud 4.6.1.  We have observed following behaviors
> >> > > recently.  A Solr node in a Solrcloud cluster is up but some of the
> >> cores
> >> > > on the nodes are marked as down in Zookeeper.  If the cores are
> parts
> >> of
> >> > a
> >> > > multi-sharded collection with one replica,  the queries to that
> >> > collection
> >> > > will fail.  However, when this happened, if we issue queries to the
> >> core
> >> > > directly, it returns 200 and correct info.  But once Solr got into
> the
> >> > > state, the core will be marked down forever unless we do a restart
> on
> >> > Solr.
> >> > >
> >> > > Has anyone seen this behavior before?  Is there any to get out of
> the
> >> > state
> >> > > on its own?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Li
> >> > >
> >> >
> >>
>

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by Erick Erickson <er...@gmail.com>.

One of the reasons this happens is if you have very
long GC cycles, longer than the Zookeeper "keep alive"
timeout. During a full GC pause, Solr is unresponsive and
if the ZK ping times out, ZK assumes the machine is
gone and you get into this recovery state.

So I'd collect GC logs and see if you have any
stop-the-world GC pauses that take longer than the ZK
timeout.

see Mark Millers primer on GC here:
https://lucidworks.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/

Best,
Erick

On Tue, Apr 26, 2016 at 2:13 PM, Li Ding <li...@bloomreach.com> wrote:
> Thank you all for your help!
>
> The zookeeper log rolled over, thisis from Solr.log:
>
> Looks like the solr and zk connection is gone for some reason
>
> INFO  - 2016-04-21 12:37:57.536;
> org.apache.solr.common.cloud.ConnectionManager; Watcher
> org.apache.solr.common.cloud.ConnectionManager@19789a96
> name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
> state:Disconnected type:None path:null path:null type:None
>
> INFO  - 2016-04-21 12:37:57.536;
> org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
>
> INFO  - 2016-04-21 12:38:24.248;
> org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired
> - starting a new one...
>
> INFO  - 2016-04-21 12:38:24.262;
> org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
> connect to ZooKeeper
>
> INFO  - 2016-04-21 12:38:24.269;
> org.apache.solr.common.cloud.ConnectionManager; Connected:true
>
>
> Then it publishes all cores on the hosts are down.  I just list three cores
> here:
>
> INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
> publishing core=product1_shard1_replica1 state=down
>
> INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
> publishing core=collection1 state=down
>
> INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
> numShards not found on descriptor - reading it from system property
>
> INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
> publishing core=product2_shard5_replica1 state=down
>
> INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
> publishing core=product2_shard13_replica1 state=down
>
>
> product1 has only one shard one replica and it's able to be active
> successfully:
>
> INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
> Register replica - core:product1_shard1_replica1 address:http://
> {internalIp}:8983/solr collection:product1 shard:shard1
>
> WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
> cancelElection did not find election node to remove
>
> INFO  - 2016-04-21 12:38:26.393;
> org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
> process for shard shard1
>
> INFO  - 2016-04-21 12:38:26.399;
> org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
> continue.
>
> INFO  - 2016-04-21 12:38:26.399;
> org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
> try and sync
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
> Success - now sync replicas to me
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
> http://{internalIp}:8983/solr/product1_shard1_replica1/
> has no replicas
>
> INFO  - 2016-04-21 12:38:26.399;
> org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
> http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1
>
> INFO  - 2016-04-21 12:38:26.399; org.apache.solr.common.cloud.SolrZkClient;
> makePath: /collections/product1/leaders/shard1
>
> INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We are
> http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
> http://{internalIp}:8983/solr/product1_shard1_replica1/
>
> INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
> LogReplay needed for core=product1_replica1 baseURL=http://
> {internalIp}:8983/solr
>
> INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
> the leader, no recovery necessary
>
> INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
> publishing core=product1_shard1_replica1 state=active
>
>
> product2 has 15 shards one replica but only two shards lived on this
> machine, this is one of the failed shard that I never seen the message of
> the core product2_shard5_replica1 active:
>
> INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
> Register replica - product2_shard5_replica1 address:http://
> {internalIp}:8983/solr collection:product2 shard:shard5
>
> WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
> cancelElection did not find election node to remove
>
> INFO  - 2016-04-21 12:38:26.625;
> org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
> process for shard shard5
>
> INFO  - 2016-04-21 12:38:26.631;
> org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
> continue.
>
> INFO  - 2016-04-21 12:38:26.631;
> org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
> try and sync
>
> INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
> replicas to http://
> {internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
>
> INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
> Success - now sync replicas to me
>
> INFO  - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy;
> http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
> has no replicas
>
> INFO  - 2016-04-21 12:38:26.632;
> org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
> http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
> shard5
>
> INFO  - 2016-04-21 12:38:26.632; org.apache.solr.common.cloud.SolrZkClient;
> makePath: /collections/product2_shard5_replica1/leaders/shard5
>
> INFO  - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We are
> http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ and
> leader is http://{internalIp}:8983/solr
> product2_shard5_replica1_shard5_replica1/
>
> INFO  - 2016-04-21 12:38:26.646;
> org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
> ZooKeeper...
>
>
> Before I restarted this server, a bunch of queries failed for this
> collection product2.  But I don't think it will affect the core status.
>
>
> Do you guys have any idea about why this particular core is not published
> as active since from the log, most steps are done except the very last one
> to publish info to ZK.
>
>
> Thanks,
>
>
> Li
> On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <ra...@gmail.com>
> wrote:
>
>> Hi Li,
>>
>> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
>> if its the case, this may be related to
>> https://issues.apache.org/jira/browse/SOLR-7940,
>> and i would say either use the patch file or upgrade.
>>
>>
>> *Thanks,*
>> *Rajesh,*
>> *8328789519,*
>> *If I don't answer your call please leave a voicemail with your contact
>> info, *
>> *will return your call ASAP.*
>>
>> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <yy...@gmail.com>
>> wrote:
>>
>> > Hi
>> >    We have used Solr4.6 for 2 years,If you post more logs ,maybe we can
>> > fixed it.
>> >
>> > 2016-04-21 6:50 GMT+08:00 Li Ding <li...@bloomreach.com>:
>> >
>> > > Hi All,
>> > >
>> > > We are using SolrCloud 4.6.1.  We have observed following behaviors
>> > > recently.  A Solr node in a Solrcloud cluster is up but some of the
>> cores
>> > > on the nodes are marked as down in Zookeeper.  If the cores are parts
>> of
>> > a
>> > > multi-sharded collection with one replica,  the queries to that
>> > collection
>> > > will fail.  However, when this happened, if we issue queries to the
>> core
>> > > directly, it returns 200 and correct info.  But once Solr got into the
>> > > state, the core will be marked down forever unless we do a restart on
>> > Solr.
>> > >
>> > > Has anyone seen this behavior before?  Is there any to get out of the
>> > state
>> > > on its own?
>> > >
>> > > Thanks,
>> > >
>> > > Li
>> > >
>> >
>>

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by Li Ding <li...@bloomreach.com>.

Thank you all for your help!

The zookeeper log rolled over, thisis from Solr.log:

Looks like the solr and zk connection is gone for some reason

INFO  - 2016-04-21 12:37:57.536;
org.apache.solr.common.cloud.ConnectionManager; Watcher
org.apache.solr.common.cloud.ConnectionManager@19789a96
name:ZooKeeperConnection Watcher:{ZK HOSTS here} got event WatchedEvent
state:Disconnected type:None path:null path:null type:None

INFO  - 2016-04-21 12:37:57.536;
org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected

INFO  - 2016-04-21 12:38:24.248;
org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired
- starting a new one...

INFO  - 2016-04-21 12:38:24.262;
org.apache.solr.common.cloud.ConnectionManager; Waiting for client to
connect to ZooKeeper

INFO  - 2016-04-21 12:38:24.269;
org.apache.solr.common.cloud.ConnectionManager; Connected:true


Then it publishes all cores on the hosts are down.  I just list three cores
here:

INFO  - 2016-04-21 12:38:24.269; org.apache.solr.cloud.ZkController;
publishing core=product1_shard1_replica1 state=down

INFO  - 2016-04-21 12:38:24.271; org.apache.solr.cloud.ZkController;
publishing core=collection1 state=down

INFO  - 2016-04-21 12:38:24.272; org.apache.solr.cloud.ZkController;
numShards not found on descriptor - reading it from system property

INFO  - 2016-04-21 12:38:24.289; org.apache.solr.cloud.ZkController;
publishing core=product2_shard5_replica1 state=down

INFO  - 2016-04-21 12:38:24.292; org.apache.solr.cloud.ZkController;
publishing core=product2_shard13_replica1 state=down


product1 has only one shard one replica and it's able to be active
successfully:

INFO  - 2016-04-21 12:38:26.383; org.apache.solr.cloud.ZkController;
Register replica - core:product1_shard1_replica1 address:http://
{internalIp}:8983/solr collection:product1 shard:shard1

WARN  - 2016-04-21 12:38:26.385; org.apache.solr.cloud.ElectionContext;
cancelElection did not find election node to remove

INFO  - 2016-04-21 12:38:26.393;
org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
process for shard shard1

INFO  - 2016-04-21 12:38:26.399;
org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
continue.

INFO  - 2016-04-21 12:38:26.399;
org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
try and sync

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
replicas to http://{internalIp}:8983/solr/product1_shard1_replica1/

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy; Sync
Success - now sync replicas to me

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.cloud.SyncStrategy;
http://{internalIp}:8983/solr/product1_shard1_replica1/
has no replicas

INFO  - 2016-04-21 12:38:26.399;
org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
http://{internalIp}:8983/solr/product1_shard1_replica1/ shard1

INFO  - 2016-04-21 12:38:26.399; org.apache.solr.common.cloud.SolrZkClient;
makePath: /collections/product1/leaders/shard1

INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; We are
http://{internalIp}:8983/solr/product1_shard1_replica1/ and leader is
http://{internalIp}:8983/solr/product1_shard1_replica1/

INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; No
LogReplay needed for core=product1_replica1 baseURL=http://
{internalIp}:8983/solr

INFO  - 2016-04-21 12:38:26.412; org.apache.solr.cloud.ZkController; I am
the leader, no recovery necessary

INFO  - 2016-04-21 12:38:26.413; org.apache.solr.cloud.ZkController;
publishing core=product1_shard1_replica1 state=active


product2 has 15 shards one replica but only two shards lived on this
machine, this is one of the failed shard that I never seen the message of
the core product2_shard5_replica1 active:

INFO  - 2016-04-21 12:38:26.616; org.apache.solr.cloud.ZkController;
Register replica - product2_shard5_replica1 address:http://
{internalIp}:8983/solr collection:product2 shard:shard5

WARN  - 2016-04-21 12:38:26.618; org.apache.solr.cloud.ElectionContext;
cancelElection did not find election node to remove

INFO  - 2016-04-21 12:38:26.625;
org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader
process for shard shard5

INFO  - 2016-04-21 12:38:26.631;
org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to
continue.

INFO  - 2016-04-21 12:38:26.631;
org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader -
try and sync

INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
replicas to http://
{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/

INFO  - 2016-04-21 12:38:26.631; org.apache.solr.cloud.SyncStrategy; Sync
Success - now sync replicas to me

INFO  - 2016-04-21 12:38:26.632; org.apache.solr.cloud.SyncStrategy;
http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
has no replicas

INFO  - 2016-04-21 12:38:26.632;
org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader:
http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/
shard5

INFO  - 2016-04-21 12:38:26.632; org.apache.solr.common.cloud.SolrZkClient;
makePath: /collections/product2_shard5_replica1/leaders/shard5

INFO  - 2016-04-21 12:38:26.645; org.apache.solr.cloud.ZkController; We are
http://{internalIp}:8983/solr/product2_shard5_replica1_shard5_replica1/ and
leader is http://{internalIp}:8983/solr
product2_shard5_replica1_shard5_replica1/

INFO  - 2016-04-21 12:38:26.646;
org.apache.solr.common.cloud.ZkStateReader; Updating cloud state from
ZooKeeper...


Before I restarted this server, a bunch of queries failed for this
collection product2.  But I don't think it will affect the core status.


Do you guys have any idea about why this particular core is not published
as active since from the log, most steps are done except the very last one
to publish info to ZK.


Thanks,


Li
On Thu, Apr 21, 2016 at 7:08 AM, Rajesh Hazari <ra...@gmail.com>
wrote:

> Hi Li,
>
> Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
> if its the case, this may be related to
> https://issues.apache.org/jira/browse/SOLR-7940,
> and i would say either use the patch file or upgrade.
>
>
> *Thanks,*
> *Rajesh,*
> *8328789519,*
> *If I don't answer your call please leave a voicemail with your contact
> info, *
> *will return your call ASAP.*
>
> On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <yy...@gmail.com>
> wrote:
>
> > Hi
> >    We have used Solr4.6 for 2 years,If you post more logs ,maybe we can
> > fixed it.
> >
> > 2016-04-21 6:50 GMT+08:00 Li Ding <li...@bloomreach.com>:
> >
> > > Hi All,
> > >
> > > We are using SolrCloud 4.6.1.  We have observed following behaviors
> > > recently.  A Solr node in a Solrcloud cluster is up but some of the
> cores
> > > on the nodes are marked as down in Zookeeper.  If the cores are parts
> of
> > a
> > > multi-sharded collection with one replica,  the queries to that
> > collection
> > > will fail.  However, when this happened, if we issue queries to the
> core
> > > directly, it returns 200 and correct info.  But once Solr got into the
> > > state, the core will be marked down forever unless we do a restart on
> > Solr.
> > >
> > > Has anyone seen this behavior before?  Is there any to get out of the
> > state
> > > on its own?
> > >
> > > Thanks,
> > >
> > > Li
> > >
> >
>

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by Rajesh Hazari <ra...@gmail.com>.

Hi Li,

Do you see timeouts liek "CLUSTERSTATUS the collection time out:180s"
if its the case, this may be related to
https://issues.apache.org/jira/browse/SOLR-7940,
and i would say either use the patch file or upgrade.


*Thanks,*
*Rajesh,*
*8328789519,*
*If I don't answer your call please leave a voicemail with your contact
info, *
*will return your call ASAP.*

On Thu, Apr 21, 2016 at 6:02 AM, YouPeng Yang <yy...@gmail.com>
wrote:

> Hi
>    We have used Solr4.6 for 2 years,If you post more logs ,maybe we can
> fixed it.
>
> 2016-04-21 6:50 GMT+08:00 Li Ding <li...@bloomreach.com>:
>
> > Hi All,
> >
> > We are using SolrCloud 4.6.1.  We have observed following behaviors
> > recently.  A Solr node in a Solrcloud cluster is up but some of the cores
> > on the nodes are marked as down in Zookeeper.  If the cores are parts of
> a
> > multi-sharded collection with one replica,  the queries to that
> collection
> > will fail.  However, when this happened, if we issue queries to the core
> > directly, it returns 200 and correct info.  But once Solr got into the
> > state, the core will be marked down forever unless we do a restart on
> Solr.
> >
> > Has anyone seen this behavior before?  Is there any to get out of the
> state
> > on its own?
> >
> > Thanks,
> >
> > Li
> >
>

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by YouPeng Yang <yy...@gmail.com>.

Hi
   We have used Solr4.6 for 2 years,If you post more logs ,maybe we can
fixed it.

2016-04-21 6:50 GMT+08:00 Li Ding <li...@bloomreach.com>:

> Hi All,
>
> We are using SolrCloud 4.6.1.  We have observed following behaviors
> recently.  A Solr node in a Solrcloud cluster is up but some of the cores
> on the nodes are marked as down in Zookeeper.  If the cores are parts of a
> multi-sharded collection with one replica,  the queries to that collection
> will fail.  However, when this happened, if we issue queries to the core
> directly, it returns 200 and correct info.  But once Solr got into the
> state, the core will be marked down forever unless we do a restart on Solr.
>
> Has anyone seen this behavior before?  Is there any to get out of the state
> on its own?
>
> Thanks,
>
> Li
>

Re: Questions on SolrCloud core state, when will Solr recover a "DOWN" core to "ACTIVE" core.

Posted by danny teichthal <da...@gmail.com>.

Hi Li,
If you could supply some more info from your logs would help.
We also had some similar issue. There were some bugs related to SolrCloud
that were solved on solr 4.10.4 and further on solr 5.x.
I would suggest you compare your logs with defects on 4.10.4 release notes
to see if they are the same.
Also, send relevant solr/zookeeper parts of logs to the mailing list.

On Thu, Apr 21, 2016 at 1:50 AM, Li Ding <li...@bloomreach.com> wrote:

> Hi All,
>
> We are using SolrCloud 4.6.1.  We have observed following behaviors
> recently.  A Solr node in a Solrcloud cluster is up but some of the cores
> on the nodes are marked as down in Zookeeper.  If the cores are parts of a
> multi-sharded collection with one replica,  the queries to that collection
> will fail.  However, when this happened, if we issue queries to the core
> directly, it returns 200 and correct info.  But once Solr got into the
> state, the core will be marked down forever unless we do a restart on Solr.
>
> Has anyone seen this behavior before?  Is there any to get out of the state
> on its own?
>
> Thanks,
>
> Li
>