You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uniffle.apache.org by GitBox <gi...@apache.org> on 2022/11/03 13:02:03 UTC

[GitHub] [incubator-uniffle] zuston opened a new issue, #297: [Bug] Possible data lost when local storage meets high-watermark

zuston opened a new issue, #297:
URL: https://github.com/apache/incubator-uniffle/issues/297

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues.
   
   
   ### Describe the bug
   
   When enable the MEMORY_LOCALFILE storage type in uniffle shuffle-server(it has the 4 disks), the first event of (appId:x,shuffleId:x,partition:1) is flushing from memory to localfile.
   But when selecting the storage in `LocalStorageManager`, the disk selected by `localStorages.get(ShuffleStorageUtils.getStorageIndex(localStorages.size(),event.getAppId(),event.getShuffleId(),event.getStartPartition())` is corrupted maybe due to reaching high-watermark (suppose disk0 is corrupted), and so it will fallback to use the disk1.
   
   But the second event of (appId:x,shuffleId:x,partition:1) is flushing, the disk0 has been repaired. It means the second event's data will be flushed to disk0.
   
   And the reading client will fetch the disk0 data directly and ignore data in disk1, that will lost some data for App.
   
   ### Affects Version(s)
   
   master
   
   ### Uniffle Server Log Output
   
   _No response_
   
   ### Uniffle Engine Log Output
   
   _No response_
   
   ### Uniffle Server Configurations
   
   _No response_
   
   ### Uniffle Engine Configurations
   
   _No response_
   
   ### Additional context
   
   Currently the storage event data flushed is determined by the hash of appId&shuffleId&partitionId and localstorages size, it's a static strategy. That means we should store the state of flushing storages for one partition to solve the corrupted storage problem.
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #297: [Bug] Possible data lost when local storage meets high-watermark

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #297:
URL: https://github.com/apache/incubator-uniffle/issues/297#issuecomment-1302102975

   > Maybe we should merge data in disk0 and disk1?
   
   It is OK but firstly we should know the partial data in other storage. Right? This should be considered in detail.
   
   By the way, when reviewing your PR #281 and then surfing this part, I found this bug 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #297: [Bug] Potenial data lost when local storage meets high-watermark

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #297:
URL: https://github.com/apache/incubator-uniffle/issues/297#issuecomment-1303614932

   > How do we determine that disk is broken?
   
   I think you are right. The broken disk is marked as corrupted due to the checking of read and write. It wont happen in my above description.
   
   But I think it's necessary to add some test case to keep this logic and I will do this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #297: [Bug] Potenial data lost when local storage meets high-watermark

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #297:
URL: https://github.com/apache/incubator-uniffle/issues/297#issuecomment-1304419280

   close by #298 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] zuston commented on issue #297: [Bug] Possible data lost when local storage meets high-watermark

Posted by GitBox <gi...@apache.org>.
zuston commented on issue #297:
URL: https://github.com/apache/incubator-uniffle/issues/297#issuecomment-1302076638

   It's better to introduce dynamic storage selection mechanism to solve this
   
   @jerqi PTAL, I just read the code to find this problem, If I'm wrong, please tell me. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi closed issue #297: [Bug] Potenial data lost when local storage meets high-watermark

Posted by GitBox <gi...@apache.org>.
jerqi closed issue #297: [Bug] Potenial data lost when local storage meets high-watermark 
URL: https://github.com/apache/incubator-uniffle/issues/297


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] jerqi commented on issue #297: [Bug] Potenial data lost when local storage meets high-watermark

Posted by GitBox <gi...@apache.org>.
jerqi commented on issue #297:
URL: https://github.com/apache/incubator-uniffle/issues/297#issuecomment-1303045963

   How do we determine that disk is broken?
   I think the disk can't be read or written. So I don't think this is a bug.
   So we should let the application fail fast. But I already have multiple replicas, so it's unnecessary to fail fast.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-uniffle] xianjingfeng commented on issue #297: [Bug] Possible data lost when local storage meets high-watermark

Posted by GitBox <gi...@apache.org>.
xianjingfeng commented on issue #297:
URL: https://github.com/apache/incubator-uniffle/issues/297#issuecomment-1302092752

   Maybe we should merge data in disk0 and disk1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@uniffle.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org