You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/08/24 04:21:02 UTC

[GitHub] [incubator-uniffle] smallzhongfeng opened a new issue, #186: [Feature] Select remoteStoragePath based on the length of files and the remaining space from namespace

smallzhongfeng opened a new issue, #186:
URL: https://github.com/apache/incubator-uniffle/issues/186

   At present, if we configure multiple HDFS paths, the selected strategy is based on the number of apps, which is relatively simple. My idea is to compare the results based on the ratio of all file sizes under the `remoteStoragePath` and the remaining space under the namespace corresponding to the `remoteStoragePath`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #186: [Feature] Select remoteStoragePath based on the length of files and the remaining space from namespace

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #186:
URL: https://github.com/apache/incubator-uniffle/issues/186#issuecomment-1225325073

   Yes, we will verify it in the production environment. And thank you again for your advice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #186: [Feature] Select remoteStoragePath based on the length of files and the remaining space from namespace

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #186:
URL: https://github.com/apache/incubator-uniffle/issues/186#issuecomment-1225172262

   WDYT? @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #186: [Feature] Select remoteStoragePath based on the length of files and the remaining space from namespace

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #186:
URL: https://github.com/apache/incubator-uniffle/issues/186#issuecomment-1225260090

   There are many things to consider about HDFS allocation. 
   First, the scale of HDFS cluster. There are more DataNodes, the cluster can provide more IO capability.
   Second, the remaining space of HDFS cluster. If a shuffle will use too much space, we should give it a enough HDFS cluster, but we should notice that shuffle is a temporary data, we will delete them after we use them. Shuffle data usually don't require too much space like input data and output data.
   Third, if you choose to use HDFS with other users, we also need to care the stability of HDFS cluster. If HDFS cluster have two many retries, we should allocate less application to it.
   Fourth, we can't forecast how big the shuffle is when we allocate HDFS cluster to it. So we only assume that the one shuffle with big shuffle is the same as the one with small shuffle, it's absolutely wrong in the production cluster. But I don't have any ideas about it.
   Finally, it's ok for me to add a new strategy. But we should separate the mechanism from strategy and have some data in production environment to improve the effectiveness of the strategy. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #186: [Feature] Select remoteStoragePath based on the length of files and the remaining space from namespace

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #186:
URL: https://github.com/apache/incubator-uniffle/issues/186#issuecomment-1225306182

   > Thank you for your detailed explanation. I have no doubt about the first two points, but the third point is that I want to know what you mean by the retry of HDFS cluster. And the fourth point is that we really have no way to know the amount of shuffle data, so the current idea is that if there are no files at the beginning, we can only compare the remaining capacity of the namespace. If there are already shuffle files, Then we can compare the ratio of the size of all the shuffle files and the remaining capacity of the namespace under different HDFS paths.
   
   Third point, sorry... I should give more explanation. If shuffle server fail to  write data to HDFS because HDFS high load, shuffle will retry. If there are many retries about HDFS, it means that HDFS have bad status, we should avoid using it.
   Fourth, your solution is  not a perfect solution, it must depend data in the production environment  to prove the  effectiveness. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi closed issue #186: [Feature] A better strategy to select remoteStoragePath

Posted by GitBox <gi...@apache.org>.
jerqi closed issue #186: [Feature] A better strategy to select remoteStoragePath
URL: https://github.com/apache/incubator-uniffle/issues/186


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] smallzhongfeng commented on issue #186: [Feature] Select remoteStoragePath based on the length of files and the remaining space from namespace

Posted by GitBox <gi...@apache.org>.
smallzhongfeng commented on issue #186:
URL: https://github.com/apache/incubator-uniffle/issues/186#issuecomment-1225291201

   Thank you for your detailed explanation. I have no doubt about the first two points, but the third point is that I want to know what you mean by the retry of HDFS cluster. And the fourth point is that we really have no way to know the amount of shuffle data, so the current idea is that if there are no files at the beginning, we can only compare the remaining capacity of the namespace. If there are already shuffle files, Then we can compare the ratio of the size of all the shuffle files and the remaining capacity of the namespace under different HDFS paths.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #186: [Feature] A better strategy to select remoteStoragePath

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #186:
URL: https://github.com/apache/incubator-uniffle/issues/186#issuecomment-1239457073

   resolved by #192 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org