You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/07/27 04:12:01 UTC

[GitHub] [incubator-uniffle] smallzhongfeng opened a new issue, #89: [Improvement] Add a load policy based on disk performance

smallzhongfeng opened a new issue, #89:
URL: https://github.com/apache/incubator-uniffle/issues/89

   The current load balancing strategy only considers the available memory, whether to add another strategy based on disk performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196532415

   OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196322424

   The current idea is to sort the IO cases of available disks by heartbeat when the parameter "rss.server.health.check.enable" is turned on, regardless of the number of partitions and memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196369819

   1. I think this scheme can only support MEMORY LOCALFILE.
   2. Since this HealthCheck collects the information of the local disk, we can use this feature. This health check now defaults to true, which hides some danger because RSS is sensitive to disk IO.  
   3. And if there are any problems during the test phase, I will report back promptly.
   4. I will use the configured disk capacity of each server as an indicator of the policy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng closed issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng closed issue #89: [Improvement] Add a load policy based on disk performance
URL: https://github.com/apache/incubator-uniffle/issues/89


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196406719

   @colinmjj I think you are right, but is it possible that memory is allocated normally, but disk IO has problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196251071

   It's ok to add extra strategy.  But we need a good method to tell us whether the strategy is good enough.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196467989

   I think we should verify the function `HealCheck` in production environment first before we turn it on. But there are fewer broken disk in our production environment. I prefer use `false` as default value currently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196452995

   Do we need to turn this parameter HealthCheck on by default? This allows for better screening of healthy machines. @colinmjj @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196332327

   The idea is acceptable. But there are some questions
   1. Why do we need HealthCheck? HealthCheck is an experimental feature now, maybe it's not stable enough. That means that you need do some extra work to improve the feature.
   2. Should we add some extra metrics to measure the effect of strategy?  We usually use the memory of server to measure current strategy, we hope every server use similar quantity of memory.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196522252

   > Maybe you are right, but I think we should open it up so that we can verify this situation in more production environments.
   
   You can turn it on when you deploy the shuffle server rather than use `true` as a default value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196480313

   But I think we should open it up so that we can verify this situation in more production environments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196481405

   Maybe you are right, but I think we should open it up so that we can verify this situation in more production environments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196378640

   > 1. I think this scheme can only support MEMORY LOCALFILE.
   > 2. Since this HealthCheck collects the information of the local disk, we can use this feature. This health check now defaults to true, which hides some danger because RSS is sensitive to disk IO.
   > 3. And if there are any problems during the test phase, I will report back promptly.
   > 4. I will use the configured disk capacity of each server as an indicator of the policy.
   
   1. Why do the strategy only support MEMORY_LOCALFILE?  We would like to use MEMORY_LOCALFILE_HDFS in our production environment. Our MEMORY_LOCALFILE_HDFS can use multiple HDFS.
   2. The HealthCheck's default value is false now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] colinmjj commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
colinmjj commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196400738

   @smallzhongfeng The workload of Shuffle Server depends on a lot of things, eg, Memory, Disk IO, NetworkIO, etc. To simplify the assignment strategy, memory is chosen as the most important metric, because any problem in shuffle server will cause much memory usage. For your case, if there has problem in Disk IO, data won't be flushed as expected, and more and more data will be stored in memory.
   Uniffle is kind of producer & consumer model, and memory is the cache, I think we can check the workload according to memory usage and do the assignment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196405275

   I mean the shuffleServer's property isHealthy returns true by default, but not the HealthCheck's default value.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1198914310

   > @smallzhongfeng The workload of Shuffle Server depends on a lot of things, eg, Memory, Disk IO, NetworkIO, etc. To simplify the assignment strategy, memory is chosen as the most important metric, because any problem in shuffle server will cause much memory usage. For your case, if there has problem in Disk IO, data won't be flushed as expected, and more and more data will be stored in memory. Uniffle is kind of producer & consumer model, and memory is the cache, I think we can check the workload according to memory usage and do the assignment.
   
   @colinmjj 
   One question?
   If server A, server B have equal memory, but they have different quantity disks, should we allocate them the same shuffle partitions?
   The server which has fewer disks will be more slower. Although we don't allocate extra shuffle partitions to it, when it have processed these partitions, the coordinator will allocate excessive partitions to it, it will be slower again. I think it's meaningful to consider disk performance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] colinmjj commented on issue #89: [Improvement] Add a load policy based on disk performance

Posted by GitBox <gi...@apache.org>.
colinmjj commented on issue #89:
URL: https://github.com/apache/incubator-uniffle/issues/89#issuecomment-1196418301

   > @colinmjj I think you are right, but is it possible that memory is allocated normally, but disk IO has problems?
   
   I think you're worry about the shuffle server with abnormal disk is assigned to job.
   Currently, LocalStorageChecker is responsible for such check and will exclude abnormal disk.
   But if all disks are broken,  we can improve HealthCheck to handle such problem and exclude the shuffle server during the assignment. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org