You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Rohit Walecha <ro...@fnp.com> on 2023/01/18 09:13:13 UTC

Solr Restarting frequently.

Hi,

We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments
which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have been
facing frequent restarts of solr cloud nodes since the last few
months..tried to debug this and while looking into the logs and other stats
we have been seeing that the node which has restarted says :

*1. *
2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [ ]
o.a.s.c.c.ConnectionManager Watcher
org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
ZooKeeperConnection
Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
got event WatchedEvent state:Disconnected type:None path:null path: null
type: None
which probably says *event state is either disconnected or expired*, and
says following as a warning :
WARN (zkConnectionManagerCallback-13-thread-1) [ ]
o.a.s.c.c.ConnectionManager zkClient has disconnected



*2*.
Client session timed out, have not heard from server in 30018ms for
sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
*And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04 21:50:10.685
INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
DOWN
Attached *050120223-solr-cloud-0.log*



*Meanwhile zookeeper node says following the time at which solr node gets
restarted : *

2023-01-15 07:11:44,349 [myid:2] - WARN
[NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old
client /10.70.26.0:54584; will be dropped if server is in r-o mode
2023-01-15 07:11:44,350 [myid:2] - INFO
[CommitProcessor:2:LearnerSessionTracker@116] - Committing global
session 0x200042f19cf130f
2023-01-15 07:11:44,352 [myid:2] - INFO
[RequestThrottler:QuorumZooKeeperServer@159] - Submitting global
closeSession request for session 0x200042f19cf130f


Now we are at a point where *we know that when the solr node is
getting restarted, who is is pushed down the node and as we can see in
the logs at [#2]* which says something like Client session timed out
and it is a session which is getting established between solr node and
zookeeper also  while debugging this issue we have went through a
series of issues reported in the current version of *zookeeper *we are
using which in gist says about slower leader election and zookeeper
nodes getting restarted and the whole zookeeper cluster going down
while a leader is getting unhealthy/stopped/restarted and leader
election happening again which is taking a long time which leads to
client sessions are getting timed out during that period of time.

We have tried to replicate the same on the local env by setting up a
solr and zookeeper cluster by forcefully restarting/stopping leader
zookeeper nodes and we have got something like :
*have-not-heard-back-local-cluster.log *and We could replicate [#2].

Seeking help here..to find out what could be the possible reason for
these frequent restarts of solr cloud nodes.
*Regards.
*

Re: Solr Restarting frequently.

Posted by Mikhail Khludnev <mk...@apache.org>.
It seems like Zookeeper have stopped due to some reason. It's worth to
clarify where the particular Zk instances are running, check that they are
standalone and not embedded into other processes like Solr. Then it's worth
to focus on stabilizing the ZK ensemble first, after that let Solr nodes
connect to it one by one; watch cpu/ram. After solr is running stable it
makes sense to try to turn Kafka on.

On Thu, Jan 19, 2023 at 12:15 AM Rohit Walecha <ro...@fnp.com> wrote:

> [image: Screenshot from 2023-01-18 19-06-33.png]
>
> Restart pattern is above.
>
> On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <ro...@fnp.com> wrote:
>
>> Hi,
>>
>> We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments
>> which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have
>> been facing frequent restarts of solr cloud nodes since the last few
>> months..tried to debug this and while looking into the logs and other stats
>> we have been seeing that the node which has restarted says :
>>
>> *1. *
>> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [
>> ] o.a.s.c.c.ConnectionManager Watcher
>> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
>> ZooKeeperConnection
>> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
>> got event WatchedEvent state:Disconnected type:None path:null path: null
>> type: None
>> which probably says *event state is either disconnected or expired*, and
>> says following as a warning :
>> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
>> o.a.s.c.c.ConnectionManager zkClient has disconnected
>>
>>
>>
>> *2*.
>> Client session timed out, have not heard from server in 30018ms for
>> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
>> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
>> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04
>> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
>> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
>> DOWN
>> Attached *050120223-solr-cloud-0.log*
>>
>>
>>
>> *Meanwhile zookeeper node says following the time at which solr node gets
>> restarted : *
>>
>> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
>> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
>> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>>
>>
>> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>>
>> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>>
>> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
>> *Regards.
>> *
>>
>>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Solr Restarting frequently.

Posted by Rohit Walecha <ro...@fnp.com>.
We are also using solr-operator in kubernetes.

On Mon, Jan 23, 2023 at 12:28 PM Rohit Walecha <ro...@fnp.com> wrote:

> Yes, Vincenzo it is deployed in kubernetes...any suggestions?
>
> On Mon, Jan 23, 2023 at 12:22 PM Rohit Walecha <ro...@fnp.com> wrote:
>
>> Hi Shawn,
>>
>> Will try applying the changes..you have suggested..and get back on this.
>>
>> On Thu, Jan 19, 2023 at 4:08 PM Rohit Walecha <ro...@fnp.com> wrote:
>>
>>> We have multiple collections inside our cluster(3 node), but we have
>>> some collections having replication factor 1 and some collections having
>>> replication factor 2..should this be impacting our nodes..and sending them
>>> in recovery state..and restart !!
>>>
>>> On Wed, Jan 18, 2023 at 7:07 PM Rohit Walecha <ro...@fnp.com> wrote:
>>>
>>>> [image: Screenshot from 2023-01-18 19-06-33.png]
>>>>
>>>> Restart pattern is above.
>>>>
>>>> On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <ro...@fnp.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We have a 3 node *solr(8.8.0)* cluster deployed on multiple
>>>>> environments which is connected to a 3 node *zookeeper(3.6.2)*
>>>>> cluster And, we have been facing frequent restarts of solr cloud nodes
>>>>> since the last few months..tried to debug this and while looking into the
>>>>> logs and other stats we have been seeing that the node which has restarted
>>>>> says :
>>>>>
>>>>> *1. *
>>>>> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1)
>>>>> [ ] o.a.s.c.c.ConnectionManager Watcher
>>>>> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
>>>>> ZooKeeperConnection
>>>>> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
>>>>> got event WatchedEvent state:Disconnected type:None path:null path: null
>>>>> type: None
>>>>> which probably says *event state is either disconnected or expired*,
>>>>> and says following as a warning :
>>>>> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
>>>>> o.a.s.c.c.ConnectionManager zkClient has disconnected
>>>>>
>>>>>
>>>>>
>>>>> *2*.
>>>>> Client session timed out, have not heard from server in 30018ms for
>>>>> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
>>>>> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
>>>>> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04
>>>>> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
>>>>> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
>>>>> DOWN
>>>>> Attached *050120223-solr-cloud-0.log*
>>>>>
>>>>>
>>>>>
>>>>> *Meanwhile zookeeper node says following the time at which solr node
>>>>> gets restarted : *
>>>>>
>>>>> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
>>>>> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
>>>>> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>>>>>
>>>>>
>>>>> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>>>>>
>>>>> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>>>>>
>>>>> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
>>>>> *Regards.
>>>>> *
>>>>>
>>>>>

Re: Solr Restarting frequently.

Posted by Rohit Walecha <ro...@fnp.com>.
Yes, Vincenzo it is deployed in kubernetes...any suggestions?

On Mon, Jan 23, 2023 at 12:22 PM Rohit Walecha <ro...@fnp.com> wrote:

> Hi Shawn,
>
> Will try applying the changes..you have suggested..and get back on this.
>
> On Thu, Jan 19, 2023 at 4:08 PM Rohit Walecha <ro...@fnp.com> wrote:
>
>> We have multiple collections inside our cluster(3 node), but we have some
>> collections having replication factor 1 and some collections having
>> replication factor 2..should this be impacting our nodes..and sending them
>> in recovery state..and restart !!
>>
>> On Wed, Jan 18, 2023 at 7:07 PM Rohit Walecha <ro...@fnp.com> wrote:
>>
>>> [image: Screenshot from 2023-01-18 19-06-33.png]
>>>
>>> Restart pattern is above.
>>>
>>> On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <ro...@fnp.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> We have a 3 node *solr(8.8.0)* cluster deployed on multiple
>>>> environments which is connected to a 3 node *zookeeper(3.6.2)* cluster
>>>> And, we have been facing frequent restarts of solr cloud nodes since the
>>>> last few months..tried to debug this and while looking into the logs and
>>>> other stats we have been seeing that the node which has restarted says :
>>>>
>>>> *1. *
>>>> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1)
>>>> [ ] o.a.s.c.c.ConnectionManager Watcher
>>>> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
>>>> ZooKeeperConnection
>>>> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
>>>> got event WatchedEvent state:Disconnected type:None path:null path: null
>>>> type: None
>>>> which probably says *event state is either disconnected or expired*,
>>>> and says following as a warning :
>>>> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
>>>> o.a.s.c.c.ConnectionManager zkClient has disconnected
>>>>
>>>>
>>>>
>>>> *2*.
>>>> Client session timed out, have not heard from server in 30018ms for
>>>> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
>>>> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
>>>> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04
>>>> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
>>>> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
>>>> DOWN
>>>> Attached *050120223-solr-cloud-0.log*
>>>>
>>>>
>>>>
>>>> *Meanwhile zookeeper node says following the time at which solr node
>>>> gets restarted : *
>>>>
>>>> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
>>>> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
>>>> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>>>>
>>>>
>>>> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>>>>
>>>> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>>>>
>>>> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
>>>> *Regards.
>>>> *
>>>>
>>>>

Re: Solr Restarting frequently.

Posted by Rohit Walecha <ro...@fnp.com>.
Hi Shawn,

Will try applying the changes..you have suggested..and get back on this.

On Thu, Jan 19, 2023 at 4:08 PM Rohit Walecha <ro...@fnp.com> wrote:

> We have multiple collections inside our cluster(3 node), but we have some
> collections having replication factor 1 and some collections having
> replication factor 2..should this be impacting our nodes..and sending them
> in recovery state..and restart !!
>
> On Wed, Jan 18, 2023 at 7:07 PM Rohit Walecha <ro...@fnp.com> wrote:
>
>> [image: Screenshot from 2023-01-18 19-06-33.png]
>>
>> Restart pattern is above.
>>
>> On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <ro...@fnp.com> wrote:
>>
>>> Hi,
>>>
>>> We have a 3 node *solr(8.8.0)* cluster deployed on multiple
>>> environments which is connected to a 3 node *zookeeper(3.6.2)* cluster
>>> And, we have been facing frequent restarts of solr cloud nodes since the
>>> last few months..tried to debug this and while looking into the logs and
>>> other stats we have been seeing that the node which has restarted says :
>>>
>>> *1. *
>>> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [
>>> ] o.a.s.c.c.ConnectionManager Watcher
>>> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
>>> ZooKeeperConnection
>>> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
>>> got event WatchedEvent state:Disconnected type:None path:null path: null
>>> type: None
>>> which probably says *event state is either disconnected or expired*,
>>> and says following as a warning :
>>> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
>>> o.a.s.c.c.ConnectionManager zkClient has disconnected
>>>
>>>
>>>
>>> *2*.
>>> Client session timed out, have not heard from server in 30018ms for
>>> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
>>> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
>>> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04
>>> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
>>> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
>>> DOWN
>>> Attached *050120223-solr-cloud-0.log*
>>>
>>>
>>>
>>> *Meanwhile zookeeper node says following the time at which solr node
>>> gets restarted : *
>>>
>>> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
>>> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
>>> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>>>
>>>
>>> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>>>
>>> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>>>
>>> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
>>> *Regards.
>>> *
>>>
>>>

Re: Solr Restarting frequently.

Posted by Rohit Walecha <ro...@fnp.com>.
We have multiple collections inside our cluster(3 node), but we have some
collections having replication factor 1 and some collections having
replication factor 2..should this be impacting our nodes..and sending them
in recovery state..and restart !!

On Wed, Jan 18, 2023 at 7:07 PM Rohit Walecha <ro...@fnp.com> wrote:

> [image: Screenshot from 2023-01-18 19-06-33.png]
>
> Restart pattern is above.
>
> On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <ro...@fnp.com> wrote:
>
>> Hi,
>>
>> We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments
>> which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have
>> been facing frequent restarts of solr cloud nodes since the last few
>> months..tried to debug this and while looking into the logs and other stats
>> we have been seeing that the node which has restarted says :
>>
>> *1. *
>> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [
>> ] o.a.s.c.c.ConnectionManager Watcher
>> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
>> ZooKeeperConnection
>> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
>> got event WatchedEvent state:Disconnected type:None path:null path: null
>> type: None
>> which probably says *event state is either disconnected or expired*, and
>> says following as a warning :
>> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
>> o.a.s.c.c.ConnectionManager zkClient has disconnected
>>
>>
>>
>> *2*.
>> Client session timed out, have not heard from server in 30018ms for
>> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
>> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
>> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04
>> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
>> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
>> DOWN
>> Attached *050120223-solr-cloud-0.log*
>>
>>
>>
>> *Meanwhile zookeeper node says following the time at which solr node gets
>> restarted : *
>>
>> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
>> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
>> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>>
>>
>> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>>
>> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>>
>> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
>> *Regards.
>> *
>>
>>

Re: Solr Restarting frequently.

Posted by Vincenzo D'Amore <v....@gmail.com>.
Images were automatically removed by mailing list.

On Wed, 18 Jan 2023 at 22:16, Rohit Walecha <ro...@fnp.com> wrote:

> [image: Screenshot from 2023-01-18 19-06-33.png]
>
> Restart pattern is above.
>
> On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <ro...@fnp.com> wrote:
>
>> Hi,
>>
>> We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments
>> which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have
>> been facing frequent restarts of solr cloud nodes since the last few
>> months..tried to debug this and while looking into the logs and other stats
>> we have been seeing that the node which has restarted says :
>>
>> *1. *
>> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [
>> ] o.a.s.c.c.ConnectionManager Watcher
>> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
>> ZooKeeperConnection
>> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
>> got event WatchedEvent state:Disconnected type:None path:null path: null
>> type: None
>> which probably says *event state is either disconnected or expired*, and
>> says following as a warning :
>> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
>> o.a.s.c.c.ConnectionManager zkClient has disconnected
>>
>>
>>
>> *2*.
>> Client session timed out, have not heard from server in 30018ms for
>> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
>> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
>> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04
>> 21:50:10.685 INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
>> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
>> DOWN
>> Attached *050120223-solr-cloud-0.log*
>>
>>
>>
>> *Meanwhile zookeeper node says following the time at which solr node gets
>> restarted : *
>>
>> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
>> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
>> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>>
>>
>> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>>
>> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>>
>> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
>> *Regards.
>> *
>>
>> --
Vincenzo D'Amore

Re: Solr Restarting frequently.

Posted by Rohit Walecha <ro...@fnp.com>.
[image: Screenshot from 2023-01-18 19-06-33.png]

Restart pattern is above.

On Wed, Jan 18, 2023 at 2:43 PM Rohit Walecha <ro...@fnp.com> wrote:

> Hi,
>
> We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments
> which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have
> been facing frequent restarts of solr cloud nodes since the last few
> months..tried to debug this and while looking into the logs and other stats
> we have been seeing that the node which has restarted says :
>
> *1. *
> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [ ]
> o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
> ZooKeeperConnection
> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
> got event WatchedEvent state:Disconnected type:None path:null path: null
> type: None
> which probably says *event state is either disconnected or expired*, and
> says following as a warning :
> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
> o.a.s.c.c.ConnectionManager zkClient has disconnected
>
>
>
> *2*.
> Client session timed out, have not heard from server in 30018ms for
> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04 21:50:10.685
> INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
> DOWN
> Attached *050120223-solr-cloud-0.log*
>
>
>
> *Meanwhile zookeeper node says following the time at which solr node gets
> restarted : *
>
> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>
>
> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>
> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>
> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
> *Regards.
> *
>
>

Re: Solr Restarting frequently.

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/18/23 02:13, Rohit Walecha wrote:
> We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments 
> which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have 
> been facing frequent restarts of solr cloud nodes since the last few 
> months..tried to debug this and while looking into the logs and other 
> stats we have been seeing that the node which has restarted says :

Out of the box, Solr does NOT have any built-in functionality that will 
restart it if it goes down.  That must have been added.

If a system is sized appropriately and doesn't have any issues with the 
hardware or the system software (including Java), then Solr should never 
crash.

Most of the time when Solr actually does go down, it is for one of two 
reasons:

1) The operating system's "out of memory killer" process was triggered 
because of system memory pressure, which found the largest memory 
consuming program and killed it.  Fixing that often requires adding 
memory to the server.

2) While running Solr, Java encountered an "OutOfMemoryError" exception. 
  For 8.8.0, if you're running on a non-windows platform, Solr includes 
functionality that makes it kill itself on OOME.  Starting in 9.2.0, 
which is not yet released, that functionality comes to Solr running on 
Windows too.  Note that there are several different resource depletion 
conditions that result in OOME, and not all of them are actually related 
to memory.  Very often when OOME is thrown, Solr will not actually log 
the exception.  In 9.2.0 the reason for the OOME will always be logged.

I don't really know what the zookeeper logs are saying, but Solr should 
never die if everything is sized appropriately.  That is probably an 
indication of a problem that needs correcting.

The changes coming in version 9.2.0 can be applied to your 8.8.0 version 
with the info in the patch on the following issue:

https://issues.apache.org/jira/browse/SOLR-8803

You would only need the changes to bin/solr or bin/solr.cmd.  The code 
changes are not necessary for the new functionality.  They are there so 
info about the error log is included in solr.log.

Thanks,
Shawn

Re: Solr Restarting frequently.

Posted by Vincenzo D'Amore <v....@gmail.com>.
Hi, is this solrcloud deployed in kubernetes?

On Wed, Jan 18, 2023 at 11:05 AM Rohit Walecha <ro...@fnp.com> wrote:

> Hi,
>
> We have a 3 node *solr(8.8.0)* cluster deployed on multiple environments
> which is connected to a 3 node *zookeeper(3.6.2)* cluster And, we have
> been facing frequent restarts of solr cloud nodes since the last few
> months..tried to debug this and while looking into the logs and other stats
> we have been seeing that the node which has restarted says :
>
> *1. *
> 2023-01-04 21:50:09.186 WARN (zkConnectionManagerCallback-15-thread-1) [ ]
> o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@731cf36d name:
> ZooKeeperConnection
> Watcher:apache-solrcloud-zookeeper-0.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-1.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181,apache-solrcloud-zookeeper-2.apache-solrcloud-zookeeper-headless.production.svc.cluster.local:2181/
> got event WatchedEvent state:Disconnected type:None path:null path: null
> type: None
> which probably says *event state is either disconnected or expired*, and
> says following as a warning :
> WARN (zkConnectionManagerCallback-13-thread-1) [ ]
> o.a.s.c.c.ConnectionManager zkClient has disconnected
>
>
>
> *2*.
> Client session timed out, have not heard from server in 30018ms for
> sessionid 0x1000091fcbe0001 A session timeout from ZkClient inside Solr.
> *And 3.* 2023-01-04 21:50:10.685 INFO (ShutdownMonitor) [ ]
> o.a.s.c.ZkController Publish this node as DOWN... 2023-01-04 21:50:10.685
> INFO (ShutdownMonitor) [ ] o.a.s.c.ZkController Publish
> node=apache-solrcloud-0.apache-solrcloud-headless.production:8983_solr as
> DOWN
> Attached *050120223-solr-cloud-0.log*
>
>
>
> *Meanwhile zookeeper node says following the time at which solr node gets
> restarted : *
>
> 2023-01-15 07:11:44,349 [myid:2] - WARN  [NIOWorkerThread-2:ZooKeeperServer@1384] - Connection request from old client /10.70.26.0:54584; will be dropped if server is in r-o mode
> 2023-01-15 07:11:44,350 [myid:2] - INFO  [CommitProcessor:2:LearnerSessionTracker@116] - Committing global session 0x200042f19cf130f
> 2023-01-15 07:11:44,352 [myid:2] - INFO  [RequestThrottler:QuorumZooKeeperServer@159] - Submitting global closeSession request for session 0x200042f19cf130f
>
>
> Now we are at a point where *we know that when the solr node is getting restarted, who is is pushed down the node and as we can see in the logs at [#2]* which says something like Client session timed out and it is a session which is getting established between solr node and zookeeper also  while debugging this issue we have went through a series of issues reported in the current version of *zookeeper *we are using which in gist says about slower leader election and zookeeper nodes getting restarted and the whole zookeeper cluster going down while a leader is getting unhealthy/stopped/restarted and leader election happening again which is taking a long time which leads to client sessions are getting timed out during that period of time.
>
> We have tried to replicate the same on the local env by setting up a solr and zookeeper cluster by forcefully restarting/stopping leader zookeeper nodes and we have got something like : *have-not-heard-back-local-cluster.log *and We could replicate [#2].
>
> Seeking help here..to find out what could be the possible reason for these frequent restarts of solr cloud nodes.
> *Regards.
> *
>
>

-- 
Vincenzo D'Amore