You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/08/05 03:21:57 UTC

[GitHub] [ozone] Xushaohong opened a new pull request, #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Xushaohong opened a new pull request, #3657:
URL: https://github.com/apache/ozone/pull/3657

   
   ## What changes were proposed in this pull request?
   
   **Background:**
   The async write is still not robust enough, sometimes there will be some uncoverable containers (no healthy replicas) when the cluster load is too high.
   
   Currently, such an unrecoverable ratis container will go through the following process.
   
   - DN will mark the container as unhealthy and report it to the SCM.
   
   - SCM then tries to close the container, and the container state will be closing.
   
   - DN won't close an unhealthy replica.
   
   - SCM RM will not send close cmd to those unhealthy containers. 
   
   Hence, the unrecoverable container will be stuck in the state of  Closing.
   
   After the admin fixes some available data in such containers or just abandons them, these containers shall be closed on purpose. 
   
   Under such circumstances,  we shall provide a configurable way to clean up these closed containers.
   After closing the unhealthy container,  the unrecoverable container with only unhealthy replicas could be deleted.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-7099
   
   
   ## How was this patch tested?
   
   UT and in production env
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong closed pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
Xushaohong closed pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container
URL: https://github.com/apache/ozone/pull/3657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1210342590

   What is the expected behavior if there is only one replica left and it is unhealthy? Unhealthy does not imply that there are no customer readable keys that have data in that replica. Deletion, in general, is an unsafe option, and we need to be sure we do not introduce a data loss scenario. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
Xushaohong commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1231091793

   > Adding to @kerneltime's response based on other discussions around this issue, it seems the desired solution would be to provide a path to remove containers from the system who have all replicas unhealthy or missing and **no keys mapped to them**, and that **the system should do this automatically without extra configuration**. My current understanding of this patch is that it is not doing the two parts in bold. Handling this is going to be a bit involved and may require a design document. I will try to write up some ideas to share out soon.
   
   Thx @errose28 for the reply, the auto-detection and cleanup is what we need. Currently, the single component either OM or SCM doesn't have the **map of keys to the container**. If the service is on SCM, it might need another query to OM to check if the container remains some keys. 
   One concern is that such a map is only available in the recon API, which is not clear enough and not commonly used. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
Xushaohong commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1290005935

   We need a more complete patch, close it first


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
errose28 commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1215982234

   Hi @Xushaohong, I don't think we want to be deleting containers with all replicas unhealthy automatically, because this will cause divergence between the OM's metadata and the corresponding block storage. If all container replicas are unhealthy, there is no way to recover the containers, and the admin would like the containers removed, it would be better for the admin to delete the keys with data in those containers. The keys can be found using [Recon's REST API](https://ozone.apache.org/docs/current/interface/reconapi.html). Here you can query an index mapping container ID to keys with blocks in the container from the `/api/v1/containers/:id/keys` endpoint. We should double check that this code path works though. I am not sure what happens if delete block commands get queued for unhealthy containers. If there are bugs in this area we should fix them so that deleting all keys with data in an unhealthy container causes the unhealthy container to eventually be deleted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] neils-dev commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
neils-dev commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1230802460

   @errose28, @kerneltime , I've seen the unrecoverable container condition described in this PR as well where the unrecoverable container is always reported to be in the state of 'Closing' in the SCM where the container is reported as 'Missing' by Recon.  I brought this up with @errose28 offline. 
   In this case, the datanode goes down causing Recon to update the state of the containers to be 'Unheathly' under refesh with an associated Missing Container.  The SCM, however always reports this unrecoverable container in the State - Closing which never changes and misleading to the admin.  It would be helpful if the PR handles this case to handle and cleanup such unrecoverable containers.  See 
   
   attached images:
   
   ![Recon_missing_container](https://user-images.githubusercontent.com/81126310/187288190-7254bf19-5899-4537-8a75-8457beeb3002.png)
   
   ![container_report_closing_container](https://user-images.githubusercontent.com/81126310/187288199-0d95c76e-ee2d-4a95-ae46-23d8cf9f3886.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1208319996

   @duongnguyen0 can you take a look
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1222597333

   @Xushaohong we discussed this in the weekly open source meeting, and we plan to dive a bit deeper into the overall deletion logic and how this should be operationalized. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] kerneltime commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
kerneltime commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1230546870

   cc @GeorgeJahad 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] errose28 commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
errose28 commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1222709518

   Thanks for raising the issue @Xushaohong. Handling containers where all replicas are in a degenerate state is definitely something the system should improve on. Adding to @kerneltime's response based on other discussions around this issue, it seems the desired solution would be to provide a path to remove containers from the system who have all replicas unhealthy or missing and **no keys mapped to them**, and that **the system should do this automatically without extra configuration**. My current understanding of this patch is that it is not doing the two parts in bold. Handling this is going to be a bit involved and may require a design document. I will try to write up some ideas to share out soon. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
Xushaohong commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1210353117

   > What is the expected behavior if there is only one replica left and it is unhealthy? 
   
   By default, such containers will be left alone and ozone will do nothing to them.
   
   >Unhealthy does not imply that there are no customer readable keys that have data in that replica. Deletion, in general, is an unsafe option, and we need to be sure we do not introduce a data loss scenario.
   
   Yes, this is for the case the administrator definitely knows these containers, may have restored part of readable keys, and then need to delete the unrecoverable containers instead of resetting the whole cluster. By enabling the corresponding config, the SCM then will send delete CMDs to DN.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] Xushaohong commented on pull request #3657: HDDS-7099. Provide a configurable way to cleanup closed unrecoverable container

Posted by GitBox <gi...@apache.org>.
Xushaohong commented on PR #3657:
URL: https://github.com/apache/ozone/pull/3657#issuecomment-1216233224

   > Hi @Xushaohong, I don't think we want to be deleting containers with all replicas unhealthy automatically, because this will cause divergence between the OM's metadata and the corresponding block storage. If all container replicas are unhealthy, there is no way to recover the containers, and the admin would like the containers removed, it would be better for the admin to delete the keys with data in those containers. The keys can be found using [Recon's REST API](https://ozone.apache.org/docs/current/interface/reconapi.html). Here you can query an index mapping container ID to keys with blocks in the container from the `/api/v1/containers/:id/keys` endpoint. We should double check that this code path works though. I am not sure what happens if delete block commands get queued for unhealthy containers. If there are bugs in this area we should fix them so that deleting all keys with data in an unhealthy container causes the unhealthy container to eventually be deleted.
   
   @errose28 
   Hi, Ethan. Thx for the reply.  Delete container from SCM RM side seems not a strongly reasonable idea. currently, the logic in `isDeletionAllowed` only permits the closed container to delete blocks, if the container is unhealthy, DN will not process them and hence reported them back to SCM. Can we add the check condition to support the deletion of the unhealthy container?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org