You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Daniel Collins <da...@gmail.com> on 2013/05/24 10:07:03 UTC

zk disconnects and failure to retry?

Had a scenario on a dev system here that has me confused.

We have a simple Solr cloud (dev) system running 4.3, 4 shards, running on
2 machines (2 instances per machine), 2 ZKs (external) and no replicas (or
1 replica depending on your definition, we only have 1 instance of each
shard!)

Yes, we have no backups, and we only have 2 ZKs which is bad, but its a dev
system, so not mission critical.

What I saw last night was that various shards disconnected from ZK (still
trying to work out why that was in itself), and some reconnected, some
didn't.  The ones that failed eventually had this error:

2013-05-23 14:27:38,876 ERROR [main-EventThread]
o.a.s.c.c.DefaultConnectionStrategy [SolrException.java:119] Reconnect to
ZooKeeper failed:java.lang.RuntimeException:
java.util.concurrent.TimeoutException: Could not connect to ZooKeeper
xxx1:11600,xxx2:11600 within 30000 ms

2013-05-23 14:27:38,877 INFO [main-EventThread]
o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:51]
Reconnect to ZooKeeper failed
2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.s.c.c.ConnectionManager
[ConnectionManager.java:130] Connected:false
2013-05-23 14:27:38,877 INFO [main-EventThread] o.a.z.ClientCnxn
[ClientCnxn.java:509] EventThread shut down

So my question is why don't they keep re-trying?  Yes I could increase the
timeout, but that feels like the wrong action.  If the core had failed to
connect to ZK, shouldn't it keep trying to re-enter the cloud, why does it
"give up"?  From that point onwards, those cores just give errors during
update

2013-05-23 14:30:39,605 ERROR [qtp21465667-1439] o.a.s.c.SolrCore
[SolrException.java:108] org.apache.solr.common.SolrException: Cannot talk
to ZooKeeper - Updates are disabled.
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:999)

Now I understand the reason for the errors, but surprised it didn't try to
fix itself.  I eventually bounced the core and it reconnected, but why does
it need a manual fix?

Re: zk disconnects and failure to retry?

Posted by Erick Erickson <er...@gmail.com>.

Oh yes, lots in the past 8 months, the JIRAs can give details.

Best,
Erick

On Thu, Jan 22, 2015 at 4:10 PM, deniz <de...@gmail.com> wrote:

> bumping an old entry... but are there any improvements on this issue?
>
>
>
> -----
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/zk-disconnects-and-failure-to-retry-tp4065877p4181370.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: zk disconnects and failure to retry?

Posted by deniz <de...@gmail.com>.

bumping an old entry... but are there any improvements on this issue?



-----
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: http://lucene.472066.n3.nabble.com/zk-disconnects-and-failure-to-retry-tp4065877p4181370.html
Sent from the Solr - User mailing list archive at Nabble.com.