You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Wei-Chiu Chuang (JIRA)" <ji...@apache.org> on 2016/11/20 06:59:58 UTC

[jira] [Created] (HDFS-11160) VolumeScanner incorrectly reports good replicas as corrupt due to race condition

Wei-Chiu Chuang created HDFS-11160:
--------------------------------------

             Summary: VolumeScanner incorrectly reports good replicas as corrupt due to race condition
                 Key: HDFS-11160
                 URL: https://issues.apache.org/jira/browse/HDFS-11160
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
         Environment: CDH5.7.4
            Reporter: Wei-Chiu Chuang
            Assignee: Wei-Chiu Chuang


Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt.

We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056, we are still seeing VolumeScanner reporting corrupt replicas.

It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch.

I have a unit test to reproduce the error. Will attach later.

To fix it, I propose a FinalizedReplica object should also have a lastChecksum field like ReplicaBeingWritten, and BlockSender should use the in-memory lastChecksum to verify the partial data in the last chunk on disk. File this jira to discuss a good fix for this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org