You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2019/05/27 05:32:06 UTC

[GitHub] [hadoop] supratimdeka opened a new pull request #852: HDDS-1454. GC other system pause events can trigger pipeline destroy for all the nodes in the cluster. Contributed by Supratim Deka

supratimdeka opened a new pull request #852: HDDS-1454. GC other system pause events can trigger pipeline destroy for all the nodes in the cluster. Contributed by Supratim Deka
URL: https://github.com/apache/hadoop/pull/852

https://issues.apache.org/jira/browse/HDDS-1454

Problem:
In a MiniOzoneChaosCluster run it was observed that events like GC pauses or any other pauses in SCM can mark all the datanodes as stale in SCM. This will trigger multiple pipeline destroy and will render the system unusable.

Solution:
Added a timestamp check in NodeStateManager. If the heartbeat task detects a long scheduling delay since the last time it ran, then the task skips doing health checks and node state transitions in the current iteration.

Test:
The unit test simulates a JVM pause by simply pausing the iterations of the health check task. Once the health check task is "unpaused", the system condition will be similar to a JVM pause. The test asserts that any node with heartbeats should not transition to Stale or Dead after such a long delay in scheduling.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org