You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2021/07/08 07:52:06 UTC

[GitHub] [ozone] guihecheng opened a new pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

guihecheng opened a new pull request #2395:
URL: https://github.com/apache/ozone/pull/2395


   ## What changes were proposed in this pull request?
   
    Limit num of containers to process per round for ReplicationManager.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-5413
   
   ## How was this patch tested?
   
   new ut.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-876600258


   > This is just a limit on the number of containers to be processed, note that ReplicationManager count each container as processed no mater it is under replicated or over replicated or good.
   
   I'm not sure how well this will work in practice. Lets say we have 1M containers and the default limit here is 10,000. The default RM sleep interval is 5 minutes. It will take 100 iterations to process all the containers, so 500 minutes. If we lose one node on the cluster, then some percentage of these containers will be under-replicated. Lets say we have a 100 node cluster, so:
   
   1M containers * 3 replicas = 3M / 100 DNs = 30k per DN.
   
   In theory we could process them all in 3 iterations, but this change will not do that.
   
   We also have to consider decommission. In the example I gave above, it might take upto 500 minutes for all the containers on the decommissioning node to get processed. Or worse, for maintenance mode, there may be only a handful of containers that need replicated for the node to goto maintenance, but it will potentially get delayed by 500mins due to the batch size.
   
   RM has problems for sure - I don't like the way it calculates all the replication work for all containers and then sends all the work down to the DNs. The DNs don't really give feedback, the attempts timeout on the RM (by default after 10 minutes) and it schedules more. I feel that we need some way to hold back the work in SCM and feed it to the DNs as they have capacity to accept it. That might mean the DNs feeding back via their heartbeat the pending replication tasks, and then RM not scheduling more commands until the DN has free capacity.
   
   Some DNs on the cluster may work faster than others for some reason (newer hardware, under less load somehow etc). Ideally they would clear their replication work faster than others and hence be able to receive more, but as things stand the work is all passed out randomly, so we cannot take advantage of that.
   
   Another thing we might want to consider, is to trigger RM based on a dead node or decommission / maintenance event, rather than waiting for the thread sleep interval.
   
   I believe we need to have a good think about how RM works, and what we could do to improve it generally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] guihecheng commented on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
guihecheng commented on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-876865625


   Thanks @sodonnel @bshashikant for your comments, we are planning to enhance the ReplicationManager these days, and I start to work with some metrics and basic throttling, but there are more problems to handle as donnel said below.
   I think I should put up a simple google doc on this and we could discuss and work together to handle the problems.
   Some replies inline below, thanks again for the suggestions donnel.
   
   > > This is just a limit on the number of containers to be processed, note that ReplicationManager count each container as processed no mater it is under replicated or over replicated or good.
   Yes, we should handle the under and over cases later.
   
   > I'm not sure how well this will work in practice. Lets say we have 1M containers and the default limit here is 10,000. The default RM sleep interval is 5 minutes. It will take 100 iterations to process all the containers, so 500 minutes. If we lose one node on the cluster, then some percentage of these containers will be under-replicated. Lets say we have a 100 node cluster, so:
   > 
   > 1M containers * 3 replicas = 3M / 100 DNs = 30k per DN.
   > 
   > In theory we could process them all in 3 iterations, but this change will not do that.
   Yes, the default value was borrowed from HDFS directly and could be tuned for large clusters based on some numbers(num of DNs, etc).
   The idea is just to let the DNs start to process the replicas(only potential, since we only count for all processed containers for now) early and don't wait for ReplicationManager to make a complete decision on all containers. Between the running interval of ReplicationManager(300s default), there could be good many replication tasks done and only a few tasks queued by DNs. If we don't have a rough throttling, there will be large number of tasks queued on DNs.
   And we could do better in later patches, such as put a limit on the commands(replication/delete/close) send to each DN. 
   
   > We also have to consider decommission. In the example I gave above, it might take upto 500 minutes for all the containers on the decommissioning node to get processed. Or worse, for maintenance mode, there may be only a handful of containers that need replicated for the node to goto maintenance, but it will potentially get delayed by 500mins due to the batch size.
   Oh, this is really a problem for decommission, but we could have different priorities for containers and decommission/maintenance related containers could have higher priorities to be processed.
   
   > RM has problems for sure - I don't like the way it calculates all the replication work for all containers and then sends all the work down to the DNs. The DNs don't really give feedback, the attempts timeout on the RM (by default after 10 minutes) and it schedules more. I feel that we need some way to hold back the work in SCM and feed it to the DNs as they have capacity to accept it. That might mean the DNs feeding back via their heartbeat the pending replication tasks, and then RM not scheduling more commands until the DN has free capacity.
   I like the idea of DN feedbacks about its replication states, such as num of tasks queued to show how busy the DN is.
   Holding back the work in SCM is good, but we don't have to worry about the capacity problem since ContainerPlacementPolicies(based on node reports)will filter out DNs that don't have enough space to hold more container replicas see:
   https://github.com/apache/ozone/pull/2246.
   
   > Some DNs on the cluster may work faster than others for some reason (newer hardware, under less load somehow etc). Ideally they would clear their replication work faster than others and hence be able to receive more, but as things stand the work is all passed out randomly, so we cannot take advantage of that.
   Ah, this implies a new ContainerPlacementPolicy to be implemented to take hardware into consideration, maybe. But in most deployments for us, we have unified hardwares for all nodes, so we could live with the existing policies. For less workload, we could have feedbacks in node reports as discussed above.
   
   > Another thing we might want to consider, is to trigger RM based on a dead node or decommission / maintenance event, rather than waiting for the thread sleep interval.
   Yes, it solves the problem you raised above, similar to the idea to have higher priorities for decommission / maintenance related containers.
   
   > I believe we need to have a good think about how RM works, and what we could do to improve it generally.
   Sure, I will put up a google doc soon, thanks~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] guihecheng edited a comment on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
guihecheng edited a comment on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-876865625


   Thanks @sodonnel @bshashikant for your comments, we are planning to enhance the ReplicationManager these days, and I start to work with some metrics and basic throttling, but there are more problems to handle as donnel said below.
   I think I should put up a simple google doc on this and we could discuss and work together to handle the problems.
   Some replies inline below, thanks again for the suggestions donnel.
   
   > This is just a limit on the number of containers to be processed, note that ReplicationManager count each container as processed no mater it is under replicated or over replicated or good.
   Yes, we should handle the under and over cases later.
   
   > I'm not sure how well this will work in practice. Lets say we have 1M containers and the default limit here is 10,000. The default RM sleep interval is 5 minutes. It will take 100 iterations to process all the containers, so 500 minutes. If we lose one node on the cluster, then some percentage of these containers will be under-replicated. Lets say we have a 100 node cluster, so:
   > 
   > 1M containers * 3 replicas = 3M / 100 DNs = 30k per DN.
   > 
   > In theory we could process them all in 3 iterations, but this change will not do that.
   
   Yes, the default value was borrowed from HDFS directly and could be tuned for large clusters based on some numbers(num of DNs, etc). But I agree that we could take a different way to do counting, such as only count for non-good containers as processed.
   The idea is just to let the DNs start to process the replicas(only potential, since we only count for all processed containers for now) early and don't wait for ReplicationManager to make a complete decision on all containers. Between the running interval of ReplicationManager(300s default), there could be good many replication tasks done and only a few tasks queued by DNs. If we don't have a rough throttling, there will be large number of tasks queued on DNs.
   And we could do better in later patches, such as put a limit on the commands(replication/delete/close) send to each DN. 
   
   > We also have to consider decommission. In the example I gave above, it might take upto 500 minutes for all the containers on the decommissioning node to get processed. Or worse, for maintenance mode, there may be only a handful of containers that need replicated for the node to goto maintenance, but it will potentially get delayed by 500mins due to the batch size.
   
   Oh, this is really a problem for decommission, but we could have different priorities for containers and decommission/maintenance related containers could have higher priorities to be processed.
   
   > RM has problems for sure - I don't like the way it calculates all the replication work for all containers and then sends all the work down to the DNs. The DNs don't really give feedback, the attempts timeout on the RM (by default after 10 minutes) and it schedules more. I feel that we need some way to hold back the work in SCM and feed it to the DNs as they have capacity to accept it. That might mean the DNs feeding back via their heartbeat the pending replication tasks, and then RM not scheduling more commands until the DN has free capacity.
   
   I like the idea of DN feedbacks about its replication states, such as num of tasks queued to show how busy the DN is.
   Holding back the work in SCM is good, but we don't have to worry about the capacity problem since ContainerPlacementPolicies(based on node reports)will filter out DNs that don't have enough space to hold more container replicas see:
   https://github.com/apache/ozone/pull/2246.
   
   > Some DNs on the cluster may work faster than others for some reason (newer hardware, under less load somehow etc). Ideally they would clear their replication work faster than others and hence be able to receive more, but as things stand the work is all passed out randomly, so we cannot take advantage of that.
   
   Ah, this implies a new ContainerPlacementPolicy to be implemented to take hardware into consideration, maybe. But in most deployments for us, we have unified hardwares for all nodes, so we could live with the existing policies. For less workload, we could have feedbacks in node reports as discussed above.
   
   > Another thing we might want to consider, is to trigger RM based on a dead node or decommission / maintenance event, rather than waiting for the thread sleep interval.
   
   Yes, it solves the problem you raised above, similar to the idea to have higher priorities for decommission / maintenance related containers.
   
   > I believe we need to have a good think about how RM works, and what we could do to improve it generally.
   
   Sure, I will put up a google doc soon, thanks~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-877302968


   Thanks for adding the doc. Lets think about this more next week and consider how we would like to see Replication Manager change and we can take it from there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] guihecheng closed pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
guihecheng closed pull request #2395:
URL: https://github.com/apache/ozone/pull/2395


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] guihecheng commented on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
guihecheng commented on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-877066389


   A simple doc updated as discussed, you could find it above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] guihecheng edited a comment on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
guihecheng edited a comment on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-876865625


   Thanks @sodonnel @bshashikant for your comments, we are planning to enhance the ReplicationManager these days, and I start to work with some metrics and basic throttling, but there are more problems to handle as donnel said below.
   I think I should put up a simple google doc on this and we could discuss and work together to handle the problems.
   Some replies inline below, thanks again for the suggestions donnel.
   
   > This is just a limit on the number of containers to be processed, note that ReplicationManager count each container as processed no mater it is under replicated or over replicated or good.
   Yes, we should handle the under and over cases later.
   
   > I'm not sure how well this will work in practice. Lets say we have 1M containers and the default limit here is 10,000. The default RM sleep interval is 5 minutes. It will take 100 iterations to process all the containers, so 500 minutes. If we lose one node on the cluster, then some percentage of these containers will be under-replicated. Lets say we have a 100 node cluster, so:
   > 
   > 1M containers * 3 replicas = 3M / 100 DNs = 30k per DN.
   > 
   > In theory we could process them all in 3 iterations, but this change will not do that.
   
   Yes, the default value was borrowed from HDFS directly and could be tuned for large clusters based on some numbers(num of DNs, etc).
   The idea is just to let the DNs start to process the replicas(only potential, since we only count for all processed containers for now) early and don't wait for ReplicationManager to make a complete decision on all containers. Between the running interval of ReplicationManager(300s default), there could be good many replication tasks done and only a few tasks queued by DNs. If we don't have a rough throttling, there will be large number of tasks queued on DNs.
   And we could do better in later patches, such as put a limit on the commands(replication/delete/close) send to each DN. 
   
   > We also have to consider decommission. In the example I gave above, it might take upto 500 minutes for all the containers on the decommissioning node to get processed. Or worse, for maintenance mode, there may be only a handful of containers that need replicated for the node to goto maintenance, but it will potentially get delayed by 500mins due to the batch size.
   
   Oh, this is really a problem for decommission, but we could have different priorities for containers and decommission/maintenance related containers could have higher priorities to be processed.
   
   > RM has problems for sure - I don't like the way it calculates all the replication work for all containers and then sends all the work down to the DNs. The DNs don't really give feedback, the attempts timeout on the RM (by default after 10 minutes) and it schedules more. I feel that we need some way to hold back the work in SCM and feed it to the DNs as they have capacity to accept it. That might mean the DNs feeding back via their heartbeat the pending replication tasks, and then RM not scheduling more commands until the DN has free capacity.
   
   I like the idea of DN feedbacks about its replication states, such as num of tasks queued to show how busy the DN is.
   Holding back the work in SCM is good, but we don't have to worry about the capacity problem since ContainerPlacementPolicies(based on node reports)will filter out DNs that don't have enough space to hold more container replicas see:
   https://github.com/apache/ozone/pull/2246.
   
   > Some DNs on the cluster may work faster than others for some reason (newer hardware, under less load somehow etc). Ideally they would clear their replication work faster than others and hence be able to receive more, but as things stand the work is all passed out randomly, so we cannot take advantage of that.
   
   Ah, this implies a new ContainerPlacementPolicy to be implemented to take hardware into consideration, maybe. But in most deployments for us, we have unified hardwares for all nodes, so we could live with the existing policies. For less workload, we could have feedbacks in node reports as discussed above.
   
   > Another thing we might want to consider, is to trigger RM based on a dead node or decommission / maintenance event, rather than waiting for the thread sleep interval.
   
   Yes, it solves the problem you raised above, similar to the idea to have higher priorities for decommission / maintenance related containers.
   
   > I believe we need to have a good think about how RM works, and what we could do to improve it generally.
   
   Sure, I will put up a google doc soon, thanks~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] guihecheng edited a comment on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
guihecheng edited a comment on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-876865625


   Thanks @sodonnel @bshashikant for your comments, we are planning to enhance the ReplicationManager these days, and I start to work with some metrics and basic throttling, but there are more problems to handle as donnel said below.
   I think I should put up a simple google doc on this and we could discuss and work together to handle the problems.
   Some replies inline below, thanks again for the suggestions donnel.
   
   > This is just a limit on the number of containers to be processed, note that ReplicationManager count each container as processed no mater it is under replicated or over replicated or good.
   Yes, we should handle the under and over cases later.
   
   > I'm not sure how well this will work in practice. Lets say we have 1M containers and the default limit here is 10,000. The default RM sleep interval is 5 minutes. It will take 100 iterations to process all the containers, so 500 minutes. If we lose one node on the cluster, then some percentage of these containers will be under-replicated. Lets say we have a 100 node cluster, so:
   > 
   > 1M containers * 3 replicas = 3M / 100 DNs = 30k per DN.
   > 
   > In theory we could process them all in 3 iterations, but this change will not do that.
   
   Yes, the default value was borrowed from HDFS directly and could be tuned for large clusters based on some numbers(num of DNs, etc). But I agree that we could take a different way to do counting, such as only count for non-good containers as processed.
   The general idea is just to let the DNs start to process the replicas(only potential, since we only count for all processed containers for now) early and don't wait for ReplicationManager to make a complete decision on all containers. Between the running interval of ReplicationManager(300s default), there could be good many replication tasks done and only a few tasks queued by DNs. If we don't have a rough throttling, there will be large number of tasks queued on DNs.
   And we could do better in later patches, such as put a limit on the commands(replication/delete/close) send to each DN. 
   
   > We also have to consider decommission. In the example I gave above, it might take upto 500 minutes for all the containers on the decommissioning node to get processed. Or worse, for maintenance mode, there may be only a handful of containers that need replicated for the node to goto maintenance, but it will potentially get delayed by 500mins due to the batch size.
   
   Oh, this is really a problem for decommission, but we could have different priorities for containers and decommission/maintenance related containers could have higher priorities to be processed.
   
   > RM has problems for sure - I don't like the way it calculates all the replication work for all containers and then sends all the work down to the DNs. The DNs don't really give feedback, the attempts timeout on the RM (by default after 10 minutes) and it schedules more. I feel that we need some way to hold back the work in SCM and feed it to the DNs as they have capacity to accept it. That might mean the DNs feeding back via their heartbeat the pending replication tasks, and then RM not scheduling more commands until the DN has free capacity.
   
   I like the idea of DN feedbacks about its replication states, such as num of tasks queued to show how busy the DN is.
   Holding back the work in SCM is good, but we don't have to worry about the capacity problem since ContainerPlacementPolicies(based on node reports)will filter out DNs that don't have enough space to hold more container replicas see:
   https://github.com/apache/ozone/pull/2246.
   
   > Some DNs on the cluster may work faster than others for some reason (newer hardware, under less load somehow etc). Ideally they would clear their replication work faster than others and hence be able to receive more, but as things stand the work is all passed out randomly, so we cannot take advantage of that.
   
   Ah, this implies a new ContainerPlacementPolicy to be implemented to take hardware into consideration, maybe. But in most deployments for us, we have unified hardwares for all nodes, so we could live with the existing policies. For less workload, we could have feedbacks in node reports as discussed above.
   
   > Another thing we might want to consider, is to trigger RM based on a dead node or decommission / maintenance event, rather than waiting for the thread sleep interval.
   
   Yes, it solves the problem you raised above, similar to the idea to have higher priorities for decommission / maintenance related containers.
   
   > I believe we need to have a good think about how RM works, and what we could do to improve it generally.
   
   Sure, I will put up a google doc soon, thanks~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] guihecheng edited a comment on pull request #2395: HDDS-5413. Limit num of containers to process per round for ReplicationManager.

Posted by GitBox <gi...@apache.org>.
guihecheng edited a comment on pull request #2395:
URL: https://github.com/apache/ozone/pull/2395#issuecomment-876865625


   Thanks @sodonnel @bshashikant for your comments, we are planning to enhance the ReplicationManager these days, and I start to work with some metrics and basic throttling, but there are more problems to handle as donnel said below.
   I think I should put up a simple google doc on this and we could discuss and work together to handle the problems.
   Some replies inline below, thanks again for the suggestions donnel.
   
   > > This is just a limit on the number of containers to be processed, note that ReplicationManager count each container as processed no mater it is under replicated or over replicated or good.
   Yes, we should handle the under and over cases later.
   
   > I'm not sure how well this will work in practice. Lets say we have 1M containers and the default limit here is 10,000. The default RM sleep interval is 5 minutes. It will take 100 iterations to process all the containers, so 500 minutes. If we lose one node on the cluster, then some percentage of these containers will be under-replicated. Lets say we have a 100 node cluster, so:
   > 
   > 1M containers * 3 replicas = 3M / 100 DNs = 30k per DN.
   > 
   > In theory we could process them all in 3 iterations, but this change will not do that.
   
   Yes, the default value was borrowed from HDFS directly and could be tuned for large clusters based on some numbers(num of DNs, etc).
   The idea is just to let the DNs start to process the replicas(only potential, since we only count for all processed containers for now) early and don't wait for ReplicationManager to make a complete decision on all containers. Between the running interval of ReplicationManager(300s default), there could be good many replication tasks done and only a few tasks queued by DNs. If we don't have a rough throttling, there will be large number of tasks queued on DNs.
   And we could do better in later patches, such as put a limit on the commands(replication/delete/close) send to each DN. 
   
   > We also have to consider decommission. In the example I gave above, it might take upto 500 minutes for all the containers on the decommissioning node to get processed. Or worse, for maintenance mode, there may be only a handful of containers that need replicated for the node to goto maintenance, but it will potentially get delayed by 500mins due to the batch size.
   
   Oh, this is really a problem for decommission, but we could have different priorities for containers and decommission/maintenance related containers could have higher priorities to be processed.
   
   > RM has problems for sure - I don't like the way it calculates all the replication work for all containers and then sends all the work down to the DNs. The DNs don't really give feedback, the attempts timeout on the RM (by default after 10 minutes) and it schedules more. I feel that we need some way to hold back the work in SCM and feed it to the DNs as they have capacity to accept it. That might mean the DNs feeding back via their heartbeat the pending replication tasks, and then RM not scheduling more commands until the DN has free capacity.
   
   I like the idea of DN feedbacks about its replication states, such as num of tasks queued to show how busy the DN is.
   Holding back the work in SCM is good, but we don't have to worry about the capacity problem since ContainerPlacementPolicies(based on node reports)will filter out DNs that don't have enough space to hold more container replicas see:
   https://github.com/apache/ozone/pull/2246.
   
   > Some DNs on the cluster may work faster than others for some reason (newer hardware, under less load somehow etc). Ideally they would clear their replication work faster than others and hence be able to receive more, but as things stand the work is all passed out randomly, so we cannot take advantage of that.
   
   Ah, this implies a new ContainerPlacementPolicy to be implemented to take hardware into consideration, maybe. But in most deployments for us, we have unified hardwares for all nodes, so we could live with the existing policies. For less workload, we could have feedbacks in node reports as discussed above.
   
   > Another thing we might want to consider, is to trigger RM based on a dead node or decommission / maintenance event, rather than waiting for the thread sleep interval.
   
   Yes, it solves the problem you raised above, similar to the idea to have higher priorities for decommission / maintenance related containers.
   
   > I believe we need to have a good think about how RM works, and what we could do to improve it generally.
   
   Sure, I will put up a google doc soon, thanks~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org