You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "PengZhang (JIRA)" <ji...@apache.org> on 2013/04/02 08:53:15 UTC
[jira] [Created] (HDFS-4660) Duplicated checksum on DN in a recovered pipeline

PengZhang created HDFS-4660:
-------------------------------

             Summary: Duplicated checksum on DN in a recovered pipeline
                 Key: HDFS-4660
                 URL: https://issues.apache.org/jira/browse/HDFS-4660
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
    Affects Versions: 2.0.3-alpha
            Reporter: PengZhang
            Priority: Critical


pipeline DN1  DN2  DN3
stop DN2

Add a node DN4 in 2nd position in pipeline
DN1  DN4  DN3

recover RBW
DN4 after recover rbw
2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004
2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
  getNumBytes() = 134144
  getBytesOnDisk() = 134144
  getVisibleLength()= 134144
end at chunk (134144/512=262)

DN3 after recover rbw
2013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW
  getNumBytes() = 134028 
  getBytesOnDisk() = 134028
  getVisibleLength()= 134028

client send packet after recover pipeline
offset=133632  len=1008

DN4 after flush 
2013-04-01 21:02:31,779 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1063
// meta end position should be floor(134640/512)*4 + 7 == 1059, but now it is 1063.

DN3 after flush
2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005, type=LAST_IN_PIPELINE, downstreams=0:[]: enqueue Packet(seqno=219, lastPacketInBlock=false, offsetInBlock=134640, ackEnqueueNanoTime=8817026136871545)
2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Changing meta file offset of block BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1005 from 1055 to 1051
2013-04-01 21:02:31,782 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: FlushOrsync, file offset:134640; meta offset:1059

After checking meta on DN4, I found checksum of chunk 262 is duplicated, but data not.
Later after block was finalized, DN4's scanner detected bad block, and then reported it to NM. NM send a command to delete this block, and replicate this block from other DN in pipeline to satisfy duplication num.

I think this is because in BlockReceiver it skips data bytes already written, but not skips checksum bytes already written. And function adjustCrcFilePosition is only used for last non-completed chunk, but
not for this situation.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira