You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by Igor Natanzon <ig...@gmail.com> on 2019/10/15 13:59:22 UTC
Clock delta error / Broker J 7.1.3

We have been getting a weird virtual host / cluster error in a BDB-HA setup.
2019-10-10 00:48:40,330 ERROR [Group-Change-Learner:FIX_VHNG:STO_1_FIX_VHN]
(o.a.q.s.u.ServerScopedRuntimeException) - Exception on master check
com.sleepycat.je.EnvironmentFailureException: (JE 7.4.5) Environment must
be closed, caused by: com.sleepycat.je.EnvironmentFailureException:
Environment invalid because of previous exception: (JE 7.4.5)
STO_1_FIX_VHN(2):/data/qpid-work1/STO_1_FIX_VHN/conf
ig Clock delta: 29125 ms. between Feeder: RCO_1_FIX_VHN and this Replica
exceeds max permissible delta: 2000 ms. HANDSHAKE_ERROR: Error during the
handshake between two nodes. Some validity or compatibility check failed,
preventing further communication be
tween the nodes. Environment is invalid and must be closed. Originally
thrown by HA thread: UNKNOWN STO_1_FIX_VHN(2) Originally thrown by HA
thread: UNKNOWN STO_1_FIX_VHN(2)
        at
com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:228)
        at
com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1766)
        at
com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1775)
        at com.sleepycat.je.Environment.checkOpen(Environment.java:2473)
        at com.sleepycat.je.DbInternal.checkOpen(DbInternal.java:105)
        at
com.sleepycat.je.rep.ReplicatedEnvironment.checkOpen(ReplicatedEnvironment.java:1052)
        at
com.sleepycat.je.rep.ReplicatedEnvironment.getState(ReplicatedEnvironment.java:764)
        at
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade$RemoteNodeStateLearner.executeDatabasePingerOnNodeChangesIfMaster(ReplicatedEnvironmentFacade.java:2281)
        at
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade$RemoteNodeStateLearner.call(ReplicatedEnvironmentFacade.java:2047)
        at
org.apache.qpid.server.store.berkeleydb.replication.ReplicatedEnvironmentFacade$RemoteNodeStateLearner.call(ReplicatedEnvironmentFacade.java:2017)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.sleepycat.je.EnvironmentFailureException: Environment
invalid because of previous exception: (JE 7.4.5)
STO_1_FIX_VHN(2):/data/qpid-work1/STO_1_FIX_VHN/config Clock delta: 29125
ms. between Feeder: RCO_1_FIX_VHN and this Replica exceeds max
permissible delta: 2000 ms. HANDSHAKE_ERROR: Error during the handshake
between two nodes. Some validity or compatibility check failed, preventing
further communication between the nodes. Environment is invalid and must be
closed. Originally thrown by HA t
hread: UNKNOWN STO_1_FIX_VHN(2) Originally thrown by HA thread: UNKNOWN
STO_1_FIX_VHN(2)
        at
com.sleepycat.je.rep.stream.ReplicaFeederHandshake.checkClockSkew(ReplicaFeederHandshake.java:424)
        at
com.sleepycat.je.rep.stream.ReplicaFeederHandshake.execute(ReplicaFeederHandshake.java:261)
        at
com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:691)
        at
com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:474)
        at
com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:409)
        at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1873)

I am not sure if the issue is, but there is no network delay between the
nodes, and the clocks match as well (ping latency between hosts is about
2ms on average).

The error was continuously repeating for many hours (though the Clock delta
of 29125 was always the same).

  STO_1_FIX_VHN  was 1st backup, and   RCO_1_FIX_VHN  was master at the
time.

Eventually replicate broke and I had to restart each of the 3 nodes one at
a time in order to recover the cluster. Is there anything I can look at to
find the issue?
Thanks!