You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2013/08/06 21:56:32 UTC

Interesting failure scenario, SolrCloud and ZK nodes on different times

I've become aware of a situation I thought I'd pass along. A SolrCloud
installation had several ZK nodes that has very significantly offset times.
They were being hit with the "ClusterState says we are the leader, but
locally we don't think we are" error when nodes were recovering. Of course
whether this problem is now taken care of with recent Solr releases (I
haven't seen this go by the user's list for quite a while) I don't quite
know.

When the times were coordinated, many of the problems with recovery went
away. We're trying to reconstruct the scenario from memory, but it prompted
me to pass the incident in case it sparked any thoughts. Specifically, I
wonder if there's anything that comes to mind if the ZK nodes are
significantly out of synch with each other time-wise.

FWIW,
Erick

Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

Posted by Grant Ingersoll <gs...@apache.org>.

I seem to recall seeing this on my cluster when we didn't have clocks in sync, but perhaps my memory is fuzzy as well.

-Grant

On Aug 7, 2013, at 7:41 AM, Erick Erickson <er...@gmail.com> wrote:

> Well, we're reconstructing a chain of _possibilities_ post-mortem,
> so there's not much I can say for sure. Mostly just throwing this 
> out there in case it sparks some "aha" moments. Not knowing
> ZK well, anything I say is speculation.
> 
> But I speculate that this isn't really the root of the problem given
> that we haven't been seeing the "ClusterState says we are the leader..."
> error go by the user lists for a while. It may well be a coincidence. The
> place that this happened reported that the problem "seemed to 
> be better" after adjusting the ZK nodes' times. I know when I
> reconstruct events like this I'm never sure about cause and
> effect since I'm usually doing several things at once.
> 
> Erick
> 
> 
> On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter <ho...@fucit.org> wrote:
> 
> : > When the times were coordinated, many of the problems with recovery went
> : > away. We're trying to reconstruct the scenario from memory, but it
> : > prompted me to pass the incident in case it sparked any thoughts.
> : > Specifically, I wonder if there's anything that comes to mind if the ZK
> : > nodes are significantly out of synch with each other time-wise.
> :
> : Does this mean that ntp or other strict time synchronization is important for
> : SolrCloud?  I strive for this anyway, just to ensure that when I'm researching
> : log files between two machines that I can match things up properly.
> 
> I don't know if/how Solr/ZK is affected by having machines with clocks out
> of sync, but i do remember seeing discussions a while back about weird
> things happening ot ZK client apps *while* time adjustments are taking
> place to get back in sync.
> 
> IIRC: as the local clock starts accelerating and jumping ahead in
> increments to "correct" itself with ntp, then those jumps can confuse the
> ZK code into thinking it's been waiting a lot longer then it really
> has for zk heartbeat (or whatever it's called) and it can trigger a
> timeout situation.
> 
> 
> -Hoss
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 
> 

--------------------------------------------
Grant Ingersoll | @gsingers
http://www.lucidworks.com

Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

Posted by Erick Erickson <er...@gmail.com>.

Well, we're reconstructing a chain of _possibilities_ post-mortem,
so there's not much I can say for sure. Mostly just throwing this
out there in case it sparks some "aha" moments. Not knowing
ZK well, anything I say is speculation.

But I speculate that this isn't really the root of the problem given
that we haven't been seeing the "ClusterState says we are the leader..."
error go by the user lists for a while. It may well be a coincidence. The
place that this happened reported that the problem "seemed to
be better" after adjusting the ZK nodes' times. I know when I
reconstruct events like this I'm never sure about cause and
effect since I'm usually doing several things at once.

Erick

On Tue, Aug 6, 2013 at 5:51 PM, Chris Hostetter <ho...@fucit.org>wrote:

>
> : > When the times were coordinated, many of the problems with recovery
> went
> : > away. We're trying to reconstruct the scenario from memory, but it
> : > prompted me to pass the incident in case it sparked any thoughts.
> : > Specifically, I wonder if there's anything that comes to mind if the ZK
> : > nodes are significantly out of synch with each other time-wise.
> :
> : Does this mean that ntp or other strict time synchronization is
> important for
> : SolrCloud?  I strive for this anyway, just to ensure that when I'm
> researching
> : log files between two machines that I can match things up properly.
>
> I don't know if/how Solr/ZK is affected by having machines with clocks out
> of sync, but i do remember seeing discussions a while back about weird
> things happening ot ZK client apps *while* time adjustments are taking
> place to get back in sync.
>
> IIRC: as the local clock starts accelerating and jumping ahead in
> increments to "correct" itself with ntp, then those jumps can confuse the
> ZK code into thinking it's been waiting a lot longer then it really
> has for zk heartbeat (or whatever it's called) and it can trigger a
> timeout situation.
>
>
> -Hoss
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

Posted by Chris Hostetter <ho...@fucit.org>.

: > When the times were coordinated, many of the problems with recovery went
: > away. We're trying to reconstruct the scenario from memory, but it
: > prompted me to pass the incident in case it sparked any thoughts.
: > Specifically, I wonder if there's anything that comes to mind if the ZK
: > nodes are significantly out of synch with each other time-wise.
: 
: Does this mean that ntp or other strict time synchronization is important for
: SolrCloud?  I strive for this anyway, just to ensure that when I'm researching
: log files between two machines that I can match things up properly.

I don't know if/how Solr/ZK is affected by having machines with clocks out 
of sync, but i do remember seeing discussions a while back about weird 
things happening ot ZK client apps *while* time adjustments are taking 
place to get back in sync. 

IIRC: as the local clock starts accelerating and jumping ahead in 
increments to "correct" itself with ntp, then those jumps can confuse the 
ZK code into thinking it's been waiting a lot longer then it really 
has for zk heartbeat (or whatever it's called) and it can trigger a 
timeout situation.


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Interesting failure scenario, SolrCloud and ZK nodes on different times

Posted by Shawn Heisey <so...@elyograg.org>.

On 8/6/2013 1:56 PM, Erick Erickson wrote:
> I've become aware of a situation I thought I'd pass along. A SolrCloud
> installation had several ZK nodes that has very significantly offset
> times. They were being hit with the "ClusterState says we are the
> leader, but locally we don't think we are" error when nodes were
> recovering. Of course whether this problem is now taken care of with
> recent Solr releases (I haven't seen this go by the user's list for
> quite a while) I don't quite know.
>
> When the times were coordinated, many of the problems with recovery went
> away. We're trying to reconstruct the scenario from memory, but it
> prompted me to pass the incident in case it sparked any thoughts.
> Specifically, I wonder if there's anything that comes to mind if the ZK
> nodes are significantly out of synch with each other time-wise.

Does this mean that ntp or other strict time synchronization is 
important for SolrCloud?  I strive for this anyway, just to ensure that 
when I'm researching log files between two machines that I can match 
things up properly.

Thanks,
Shawn


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org