You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pegasus.apache.org by GitBox <gi...@apache.org> on 2022/08/04 07:59:30 UTC

[GitHub] [incubator-pegasus] empiredan opened a new issue, #1101: Add a gauge for the duration since the meta server has received the last beacon

empiredan opened a new issue, #1101:
URL: https://github.com/apache/incubator-pegasus/issues/1101

   ## Background
   
   Recently a cluster on production environment was found that primary meta server had frequently disconnected the replica servers, for the reason that the duration since the last beacon from each replica server had been received by the primary meta server was often greater than the grace period (70+ seconds vs. 22 seconds).
   
   The network latency is typically several hundreds of microseconds, which means something must have been wrong for this cluster. After trouble shooting for the root cause, it was found that there are 2 different NTP servers A and B in the configuration. A is slower than B by more than one minute. For example, meta server received a beacon at 12:05:25 from A; then the clocks jumped suddenly to 12:06:35; the meta server found that it has passed far more than the grace period, then disconnected the corresponding replica server.
   
   ## Implementation
   
   The duration since the meta server has received the last beacon for each replica server can be added as a gauge, to find the exception in the system faster.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pegasus.apache.org
For additional commands, e-mail: dev-help@pegasus.apache.org