You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/11/29 06:21:24 UTC

[GitHub] [incubator-uniffle] zuston opened a new issue, #372: [Improvement] Mark app data lost when encountering events dropped

zuston opened a new issue, #372:
URL: https://github.com/apache/incubator-uniffle/issues/372

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### What would you like to be improved?
   
   When shuffle-server encounters disk problems and triggers the mechanism of dropping events, we should mark the app data lost.
   
   After then, we could reject the write/read requests by these apps to make job fast fail. By the way, this mechanism is compatible with the multiple replica
   
   ### How should we improve?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1331699694

   We should be careful about involving the replica concept in server. Especially, we need modify rpc proto. I think this will be an important concept. How to development the concept in the future? This is what we need to think of?
   It's better to ask @xianjingfeng for some advice. Because he has more business needs about production environment.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1330152113

   > > > @xianjingfeng @jerqi PTAL. I think this is an improvement.
   > > 
   > > 
   > > For multi replicas, it won't be suitable. Because one replica don't need store complete data for multi replicas.
   > 
   > I missed this point. So maybe we need to make job fast fail only when single replica is enabled?
   
   Server don't have replica concept.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1331705508

   > Feel free to discuss more. Now we also suffer from the problem of single partition data too large and need this to make job fast fail to improve stability.
   
   For huge shuffle, we really recommend to use MEMORY_LOCALFILE_HDFS. You can use a large value `rss.server.flush.cold.storage.threshold.size` to reduce the data size of shuffle which were written to HDFS.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1331701533

   Feel free to discuss more. Now we also suffer from the problem of single partition data too large and need this to make job fast fail to improve stability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1330146133

   > > @xianjingfeng @jerqi PTAL. I think this is an improvement.
   > 
   > For multi replicas, it won't be suitable. Because one replica don't need store complete data for multi replicas.
   
   I missed this point. So maybe we need to make job fast fail only when single replica is enabled? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1330144710

   > @xianjingfeng @jerqi PTAL. I think this is an improvement.
    
   For multi replicas, it won't be suitable. Because one replica don't need store complete data for multi replicas.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1331708640

   > Anyway, this issue is still reasonable.
   Yes. you're right. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1331689053

   We could populate this replica concept into ShuffleTaskInfo when client registers to shuffle-server.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1331706515

   > > Feel free to discuss more. Now we also suffer from the problem of single partition data too large and need this to make job fast fail to improve stability.
   > 
   > For huge shuffle, we really recommend to use MEMORY_LOCALFILE_HDFS. You can use a large value `rss.server.flush.cold.storage.threshold.size` to reduce the data size of shuffle which were written to HDFS.
   
   Yes. I have to evaluate this proposal of `MEMORY_LOCALFILE_HDFS`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1331708711

   I think it is possible that both disk and hdfs can't be write. Maybe because of the network or GC. I am ok if just for one replica. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #372: [Improvement] Mark app data lost when encountering events dropped

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #372:
URL: https://github.com/apache/incubator-uniffle/issues/372#issuecomment-1330143565

   @xianjingfeng @jerqi PTAL. I think this is an improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org