You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by "Wei-Chiu Chuang (JIRA)" <ji...@apache.org> on 2016/11/20 06:59:58 UTC

[jira] [Created] (HDFS-11160) VolumeScanner incorrectly reports good replicas as corrupt due to race condition

Wei-Chiu Chuang created HDFS-11160:
--------------------------------------

Summary: VolumeScanner incorrectly reports good replicas as corrupt due to race condition
Key: HDFS-11160
URL: https://issues.apache.org/jira/browse/HDFS-11160
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Environment: CDH5.7.4
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang

Due to a race condition initially reported in HDFS-6804, VolumeScanner may erroneously detect good replicas as corrupt. This is serious because in some cases it results in data loss if all replicas are declared corrupt.

We are investigating an incidence that caused very high block corruption rate in a relatively small cluster. Initially, we thought HDFS-11056 is to blame. However, after applying HDFS-11056, we are still seeing VolumeScanner reporting corrupt replicas.

It turns out that if a replica is being appended while VolumeScanner is scanning it, VolumeScanner may use the new checksum to compare against old data, causing checksum mismatch.

I have a unit test to reproduce the error. Will attach later.

To fix it, I propose a FinalizedReplica object should also have a lastChecksum field like ReplicaBeingWritten, and BlockSender should use the in-memory lastChecksum to verify the partial data in the last chunk on disk. File this jira to discuss a good fix for this issue.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org