You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/06/21 11:55:31 UTC

[GitHub] [ozone] adoroszlai opened a new pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

adoroszlai opened a new pull request #2352:
URL: https://github.com/apache/ozone/pull/2352


   ## What changes were proposed in this pull request?
   
   After HDDS-5268, disk check is timing out although all volumes are healthy:
   
   ```
   2021-06-21 09:35:49,202 [Periodic HDDS volume checker] INFO volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
   2021-06-21 09:35:49,204 [Periodic HDDS volume checker] INFO volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
   2021-06-21 09:35:49,205 [DataNode DiskChecker thread 1] DEBUG volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
   2021-06-21 09:36:19,169 [Periodic HDDS volume checker] WARN volume.StorageVolumeChecker: checkAllVolumes timed out after 30000 ms
   2021-06-21 09:36:19,169 [Periodic HDDS volume checker] WARN volume.MutableVolumeSet: checkAllVolumes got 1 failed volumes - [/data/hdds/hdds]
   2021-06-21 09:36:19,172 [Periodic HDDS volume checker] INFO volume.MutableVolumeSet: Moving Volume : /data/hdds/hdds to failed Volumes
   2021-06-21 09:36:19,172 [Periodic HDDS volume checker] ERROR statemachine.DatanodeStateMachine: DatanodeStateMachine Shutdown due to too many bad volumes, check hdds.datanode.failed.data.volumes.tolerated and hdds.datanode.failed.metadata.volumes.tolerated
   ```
   
   This can be reproduced simply by starting up the cluster and waiting long enough for the disk checker to kick in and then time out.
   
   The problem is due to locking:
    * `Periodic HDDS volume checker` thread holds the lock on `StorageVolumeChecker` (acquired in `checkAllVolumeSets()`), and calls `latch.await()`
    * `DataNode DiskChecker thread 1` waits for the lock in `ResultHandler.markHealthy()`, hence does not reach the code that would `countDown()` the latch
    * `Periodic HDDS volume checker` proceeds after the timeout period
   
   Since each `checkAllVolumes()` call uses its own `healthyVolumes` and `failedVolumes` sets, these don't need to be guarded by the lock on `StorageVolumeChecker`.  This way the deadlock can be avoided.
   
   https://issues.apache.org/jira/browse/HDDS-5363
   
   ## How was this patch tested?
   
   Tested locally on Docker Compose cluster with short disk checker interval:
   
   ```
   OZONE-SITE.XML_hdds.datanode.disk.check.min.gap=1m
   OZONE-SITE.XML_hdds.datanode.disk.check.timeout=30s
   OZONE-SITE.XML_hdds.datanode.periodic.disk.check.interval.minutes=2
   ```
   
   Disk checker did not time out and did not cause datanode shutdown:
   
   ```
   datanode_2  | 2021-06-21 10:15:00,071 [Periodic HDDS volume checker] INFO volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
   datanode_2  | 2021-06-21 10:15:00,073 [Periodic HDDS volume checker] INFO volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
   datanode_2  | 2021-06-21 10:15:00,074 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
   datanode_2  | 2021-06-21 10:15:00,076 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
   datanode_2  | 2021-06-21 10:15:00,076 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_2  | 2021-06-21 10:15:00,077 [Periodic HDDS volume checker] TRACE volume.StorageVolumeChecker: Skipped checking all volumes, time since last check 7 is less than the minimum gap between checks (60000 ms).
   datanode_2  | 2021-06-21 10:15:00,077 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_3  | 2021-06-21 10:15:00,277 [Periodic HDDS volume checker] INFO volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
   datanode_3  | 2021-06-21 10:15:00,278 [Periodic HDDS volume checker] INFO volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
   datanode_3  | 2021-06-21 10:15:00,280 [DataNode DiskChecker thread 1] DEBUG volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
   datanode_3  | 2021-06-21 10:15:00,280 [DataNode DiskChecker thread 1] DEBUG volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
   datanode_3  | 2021-06-21 10:15:00,281 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_3  | 2021-06-21 10:15:00,282 [Periodic HDDS volume checker] TRACE volume.StorageVolumeChecker: Skipped checking all volumes, time since last check 6 is less than the minimum gap between checks (60000 ms).
   datanode_3  | 2021-06-21 10:15:00,282 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_1  | 2021-06-21 10:15:00,432 [Periodic HDDS volume checker] INFO volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
   datanode_1  | 2021-06-21 10:15:00,435 [Periodic HDDS volume checker] INFO volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
   datanode_1  | 2021-06-21 10:15:00,436 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
   datanode_1  | 2021-06-21 10:15:00,436 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
   datanode_1  | 2021-06-21 10:15:00,436 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_1  | 2021-06-21 10:15:00,438 [Periodic HDDS volume checker] TRACE volume.StorageVolumeChecker: Skipped checking all volumes, time since last check 6 is less than the minimum gap between checks (60000 ms).
   datanode_1  | 2021-06-21 10:15:00,438 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_2  | 2021-06-21 10:16:59,939 [Periodic HDDS volume checker] INFO volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
   datanode_2  | 2021-06-21 10:16:59,941 [Periodic HDDS volume checker] INFO volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
   datanode_2  | 2021-06-21 10:16:59,942 [DataNode DiskChecker thread 2] DEBUG volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
   datanode_2  | 2021-06-21 10:16:59,942 [DataNode DiskChecker thread 2] DEBUG volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
   datanode_2  | 2021-06-21 10:16:59,943 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_2  | 2021-06-21 10:16:59,943 [Periodic HDDS volume checker] TRACE volume.StorageVolumeChecker: Skipped checking all volumes, time since last check 4 is less than the minimum gap between checks (60000 ms).
   datanode_2  | 2021-06-21 10:16:59,943 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_3  | 2021-06-21 10:17:00,145 [Periodic HDDS volume checker] INFO volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
   datanode_3  | 2021-06-21 10:17:00,151 [Periodic HDDS volume checker] INFO volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
   datanode_3  | 2021-06-21 10:17:00,152 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
   datanode_3  | 2021-06-21 10:17:00,152 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
   datanode_3  | 2021-06-21 10:17:00,153 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_3  | 2021-06-21 10:17:00,153 [Periodic HDDS volume checker] TRACE volume.StorageVolumeChecker: Skipped checking all volumes, time since last check 8 is less than the minimum gap between checks (60000 ms).
   datanode_3  | 2021-06-21 10:17:00,154 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_1  | 2021-06-21 10:17:00,300 [Periodic HDDS volume checker] INFO volume.ThrottledAsyncChecker: Scheduling a check for /data/hdds/hdds
   datanode_1  | 2021-06-21 10:17:00,305 [Periodic HDDS volume checker] INFO volume.StorageVolumeChecker: Scheduled health check for volume /data/hdds/hdds
   datanode_1  | 2021-06-21 10:17:00,307 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volume /data/hdds/hdds is HEALTHY.
   datanode_1  | 2021-06-21 10:17:00,308 [Periodic HDDS volume checker] DEBUG volume.StorageVolumeChecker: Volumes to be checked: 0, invoking callback: true
   datanode_1  | 2021-06-21 10:17:00,310 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   datanode_1  | 2021-06-21 10:17:00,315 [Periodic HDDS volume checker] TRACE volume.StorageVolumeChecker: Skipped checking all volumes, time since last check 15 is less than the minimum gap between checks (60000 ms).
   datanode_1  | 2021-06-21 10:17:00,315 [Periodic HDDS volume checker] DEBUG volume.MutableVolumeSet: checkAllVolumes encountered no failures
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] adoroszlai commented on pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

adoroszlai commented on pull request #2352:
URL: https://github.com/apache/ozone/pull/2352#issuecomment-865018736






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] adoroszlai commented on pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

adoroszlai commented on pull request #2352:
URL: https://github.com/apache/ozone/pull/2352#issuecomment-865184793


   Thanks @guihecheng, @mukul1987 and @bshashikant for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] guihecheng edited a comment on pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

guihecheng edited a comment on pull request #2352:
URL: https://github.com/apache/ozone/pull/2352#issuecomment-865063204


   Ah, nice catch! Thanks very much @adoroszlai , sorry, this is my careless mistake.
   In HDDS-5268, I tried to move the periodic volume checker into `StorageVolumeChecker`, and all volumeSets can register themself for periodic check, so I used the `synchronized` to protect register and check. But I didn't realize that the object lock was already used in the `ResultHandler` to protect the `healthyVolumes` and `failedVolumes` in another thread. So there is a dead lock.
   After your commit, the `healthyVolumes` and `failedVolumes` are all concurrent sets, so locks are separate.
   LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant merged pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

bshashikant merged pull request #2352:
URL: https://github.com/apache/ozone/pull/2352


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] guihecheng edited a comment on pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

guihecheng edited a comment on pull request #2352:
URL: https://github.com/apache/ozone/pull/2352#issuecomment-865063204


   Ah, nice catch! Thanks very much @adoroszlai , sorry, this is my careless mistake.
   In HDDS-5268, I tried to move the periodic volume checker into `StorageVolumeChecker`, and all volumeSets can register themself for periodic check, so I used the `synchronized` to protect register and check. But I didn't realize that the object lock was already used in the `ResultHandler` to protect the `healthyVolumes` and `failedVolumes` in another thread. So there is a dead lock.
   After your commit, the `healthyVolumes` and `failedVolumes` are all concurrent sets, so locks are separate.
   LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] bshashikant merged pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

bshashikant merged pull request #2352:
URL: https://github.com/apache/ozone/pull/2352


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] guihecheng commented on pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

guihecheng commented on pull request #2352:
URL: https://github.com/apache/ozone/pull/2352#issuecomment-865063204


   Ah, nice catch! Thanks very much @adoroszlai , sorry, this is my careless mistake.
   In HDDS-5268, I tried to move the periodic volume checker into `StorageVolumeChecker`, and all volumeSets can register themself for periodic check, so I used the `synchronized` to protect register and check. But I didn't realize that the object lock was already used in the `ResultHandler` to protect the `healthyVolumes` and `failedVolumes` in another thread. So there is a dead lock.
   After your commit, the `healthyVolumes` and `failedVolumes` are all concurrent sets, so locks are separate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] adoroszlai commented on pull request #2352: HDDS-5363. Datanode shutdown due to too many bad volumes in CI

Posted by GitBox <gi...@apache.org>.

adoroszlai commented on pull request #2352:
URL: https://github.com/apache/ozone/pull/2352#issuecomment-865018736


   @guihecheng please review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org