You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Shashikant Banerjee (Jira)" <ji...@apache.org> on 2021/08/23 06:37:00 UTC

[jira] [Updated] (HDDS-5619) Ozone data corruption issue on Datanodes

     [ https://issues.apache.org/jira/browse/HDDS-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Banerjee updated HDDS-5619:
--------------------------------------
    Summary: Ozone data corruption issue on Datanodes  (was: Ozone data corruption issue on follower node)

> Ozone data corruption issue on Datanodes
> ----------------------------------------
>
>                 Key: HDDS-5619
>                 URL: https://issues.apache.org/jira/browse/HDDS-5619
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Aravindan Vijayan
>            Assignee: Shashikant Banerjee
>            Priority: Blocker
>              Labels: pull-request-available
>         Attachments: repro.patch
>
>
> A data corruption issue was recently observed in one of the clusters where  replica of containers were found corrupted. The issue was primarily happening happening bcoz of a race condition among, readStateMachine  /writeStateMachine threads which were reading and writing the chunks concurrently.  Following logs confirm this:
> {code:java}
> INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56510
> 2021-08-11 2028,524 [ChunkWriter-1-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56513
> 2021-08-11 2028,524 [ChunkWriter-1-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-1-0 file 107544261427200162_chunk_2 Address 0.0.0.0:56507
> 2021-08-11 2028,542 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
> 2021-08-11 2028,543 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
> 2021-08-11 2028,544 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56510
> 2021-08-11 2028,545 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56513
> 2021-08-11 2028,549 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
> 2021-08-11 2028,550 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
> 2021-08-11 2028,551 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:readStateMachineData(598)) - Thread for Read ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56510
> 2021-08-11 2028,553 [ChunkWriter-0-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-0-0 file 107544261427200162_chunk_4 Address 0.0.0.0:56513
> 2021-08-11 2028,648 [ChunkWriter-3-0] INFO  ratis.ContainerStateMachine (ContainerStateMachine.java:lambda$handleWriteChunk$2(467)) - Thread for Write ChunkWriter-3-0 file 107544261427200162_chunk_3 Address 0.0.0.0:56507
> {code}
> The assumption was till now, that readStateMachine and WriteStateMachine Threads are executed serially on a single thread executor using a hash function on the BlockId which doesn't seem to work well.
> With a file channel, being written/read concurrent threads, will end up writing sparse files, read all 0's , etc and the end result becomes u predictable and cause corrupt data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org