You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andre Bois-Crettez <an...@kelkoo.com> on 2013/10/15 14:37:49 UTC

Solr 4.4 : using SolrCloud, on reconnection to zookeeper, core sometimes goes down, never coming back alive

Hello all,


We had this problem twice in 4 days, only in one of our 14 servers (2
shards 7 replicas) in Solr 4.4 : after successful re-connection to
Zookeeper (triggered by "Connection expired - starting a new one"),
sometimes the core stays down without coming back, and we have to
restart the solr instance to make it go back as alive.

Most of the time there is no problem reconnecting to ZK, sometimes a
LogReplay or recovery process happens and successfully brings the core
alive again quickly.
But sometimes having a core going down without any error in logs (either
on the core itself or on the leader) is worrying.

 From the logs, (for the problematic server and the leader:
http://pastebin.com/CvcEQtwe ) it looks the core is happy to publish
itself as 'down' as soon as the connection to ZK is reestablished, but
then never tries to go back 'alive'.

Maybe someone has already seen this kind of behaviour ?
The only post vaguely related I found was this :
http://lucene.472066.n3.nabble.com/SolrCloud-looses-connection-to-Zookeeper-but-stays-down-td4093083.html
But it seems caused by something different, as we have no WARN in the
logs when core goes down.

I have not seen any related bugfix in the 4.5 release either.

What is surprising is that only this problematic server has more than 30
"zkClient has disconnected" lines in logs for the past 6 days (0 for all
the other servers). We did not found any difference between this server
(solr-16) and the others, the network interface does not show any errors
for example.


1) Maybe we can increase the zkClientTimeout from 15000 to 60000 to
avoid having too many ZK disconnects ?
2) Are there other ways to help tracing and solving why one server is
affected by expiration of Zookeeper connection ?
3) It seems there is a bug in some cases of the reconnection process
that prevents core going back alive ?

André

--
André Bois-Crettez

Software Architect
Search Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.