You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/08/24 06:32:13 UTC

[GitHub] [incubator-uniffle] jerqi commented on issue #186: [Feature] Select remoteStoragePath based on the length of files and the remaining space from namespace

jerqi commented on issue #186:
URL: https://github.com/apache/incubator-uniffle/issues/186#issuecomment-1225260090

There are many things to consider about HDFS allocation.
First, the scale of HDFS cluster. There are more DataNodes, the cluster can provide more IO capability.
Second, the remaining space of HDFS cluster. If a shuffle will use too much space, we should give it a enough HDFS cluster, but we should notice that shuffle is a temporary data, we will delete them after we use them. Shuffle data usually don't require too much space like input data and output data.
Third, if you choose to use HDFS with other users, we also need to care the stability of HDFS cluster. If HDFS cluster have two many retries, we should allocate less application to it.
Fourth, we can't forecast how big the shuffle is when we allocate HDFS cluster to it. So we only assume that the one shuffle with big shuffle is the same as the one with small shuffle, it's absolutely wrong in the production cluster. But I don't have any ideas about it.
Finally, it's ok for me to add a new strategy. But we should separate the mechanism from strategy and have some data in production environment to improve the effectiveness of the strategy.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org