You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Dave Latham <la...@davelink.net> on 2013/07/02 00:00:42 UTC

data loss after cluster wide power loss

We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
data center we were in had a total power failure and the cluster went down
hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
recovered the data in question from our secondary datacenter, but I'm
trying to understand what happened and whether this is a bug in HDFS that
should be fixed.

>From what I can tell the file was created and closed by the dfs client
(hbase).  Then HBase renamed it into a new directory and deleted some other
files containing the same data.  Then the cluster lost power.  After the
cluster was restarted, the datanodes reported into the namenode but the
blocks for this file appeared as "blocks being written" - the namenode
rejected them and the datanodes deleted the blocks.  At this point there
were no replicas for the blocks and the files were marked CORRUPT.  The
underlying file systems are ext3.  Some questions that I would love get
answers for if anyone with deeper understanding of HDFS can chime in:

 - Is this a known scenario where data loss is expected?  (I found
HDFS-1539 but that seems different)
 - When are blocks moved from blocksBeingWritten to current?  Does that
happen before a file close operation is acknowledged to a hdfs client?
 - Could it be that the DataNodes actually moved the blocks to current but
after the restart ext3 rewound state somehow (forgive my ignorance of
underlying file system behavior)?
 - Is there any other explanation for how this can happen?

Here is a sequence of selected relevant log lines from the RS (HBase Region
Server) NN (NameNode) and DN (DataNode - 1 example of 3 in question).  It
includes everything that mentions the block in question in the NameNode and
one DataNode log.  Please let me know if this more information that would
be helpful.

RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
Creating
file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
with permission=rwxrwxrwx
NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
blk_1395839728632046111_357084589
DN 2013-06-29 11:16:06,832 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
10.0.5.237:50010
NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
blk_1395839728632046111_357084589 size 25418340
NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
blk_1395839728632046111_357084589 size 25418340
NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to
blk_1395839728632046111_357084589 size 25418340
DN 2013-06-29 11:16:11,385 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
DN 2013-06-29 11:16:11,385 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
block blk_1395839728632046111_357084589 terminating
NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
Removing lease on  file
/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
from client DFSClient_hb_rs_hs745,60020,1372470111932
NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: file
/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
is closed by DFSClient_hb_rs_hs745,60020,1372470111932
RS 2013-06-29 11:16:11,393 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming compacted file at
hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
to
hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
RS 2013-06-29 11:16:11,505 INFO org.apache.hadoop.hbase.regionserver.Store:
Completed major compaction of 7 file(s) in n of
users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into
6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m

-------  CRASH, RESTART ---------

NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: addStoredBlock request received for
blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was
rejected: Reported as block being written but is a block of closed file.
NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
of 10.0.6.1:50010
NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: addStoredBlock request received for
blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but was
rejected: Reported as block being written but is a block of closed file.
NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
of 10.0.5.237:50010
NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addStoredBlock: addStoredBlock request received for
blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but was
rejected: Reported as block being written but is a block of closed file.
NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
of 10.0.6.24:50010
(note the clock on the server running DN is wrong after restart.  I believe
timestamps are off by 6 hours:)
DN 2013-06-29 06:07:22,877 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
blk_1395839728632046111_357084589 file
/data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
deletion
DN 2013-06-29 06:07:24,952 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
blk_1395839728632046111_357084589 at file
/data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111


Thanks,
Dave

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the detailed information Michael, Colin, Suresh, Kiwhal.

It looks like we're on CentOS 5.7 (kernel 2.6.18-274.el5) so if the fix was
included in 5.4 then it sounds like we should have it.  I don't believe
we're using LVM.

It sounds like HDFS could improve handling of this scenario by fsyncing
directories after finalizing blocks or perhaps accepting blocksBeingWritten
that are reported later.  At least as an option for clients, like HBase,
that need to rely on durability.

Dave



On Wed, Jul 3, 2013 at 11:57 AM, Kihwal Lee <ki...@yahoo-inc.com> wrote:

> For the ext3 bug Colin mentioned, see
> https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
> 2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
> file data and affects NN more. Since NN preallocates blocks for edits,
> almost all data writes are done without modifying the file length. If
> fsync() loses race to pdflush, the edit log content won't be flushed. I am
> glad this bug was fixed.
>
> Regardless of this bug, renames will be journaled but not necessarily be
> committed before returning. If power fails before the transaction gets
> committed, the transaction can be lost. Journaling will make sure the file
> system will come back in a consistent state, but not all changes will be
> there. To make a rename durable, the directory can be opened and
> fsync()'ed. I think both source and destination are recorded in one
> transaction, so fsyncing on destination directory should be enough.
>
> When RWR blocks are reported, NN can probably accept and finalize it if
> the block size is as expected.
>
> Kihwal
>
>
>
>
>
>
> On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:
>
> >On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> >wrote:
> >> Dave,
> >>
> >> Thanks for the detailed email. Sorry I did not read all the details you
> >>had
> >> sent earlier completely (on my phone). As you said, this is not related
> >>to
> >> data loss related to HBase log and hsync. I think you are right; the
> >>rename
> >> operation itself might not have hit the disk. I think we should either
> >> ensure metadata operation is synced on the datanode or handle it being
> >> reported as blockBeingWritten. Let me spend sometime to debug this
> >>issue.
> >
> >In theory, ext3 is journaled, so all metadata operations should be
> >durable in the case of a power outage.  It is only data operations
> >that should be possible to lose.  It is the same for ext4.  (Assuming
> >you are not using nonstandard mount options.)
> >
> >In practice, it is possible that your hard disks didn't actually
> >persist the data that they said they did.  Rumor has it that some
> >drives ignore the SATA FLUSH CACHE command in some cases, since it
> >makes them look bad in benchmarks.  In that case, there is nothing the
> >filesystem or any other software can do.
> >
> >There was also a bug in older linux kernels where the kernel would not
> >actually send FLUSH CACHE.  Since the poster is using ext3 and
> >hadoop-1, it's possible he's also using an antique kernel as well.  I
> >know for sure this affected LVM-- it used to ignore barriers until
> >fairly recently.
> >
> >In Ceph, we used to recommend disabling the hard drive write cache if
> >your kernel was older than 2.6.33.  You can read the recommendation
> >for yourself here:
> >
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
> >/
> > This will have an impact on performance, however.
> >
> >An uninterruptable power supply is not a bad idea.
> >
> >I am curious:
> >what kernel version you are using?
> >are you using LVM?
> >
> >Colin
> >
> >
> >>
> >> One surprising thing is, all the replicas were reported as
> >> blockBeingWritten.
> >>
> >> Regards,
> >> Suresh
> >>
> >>
> >> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>>
> >>> (Removing hbase list and adding hdfs-dev list as this is pretty
> >>>internal
> >>> stuff).
> >>>
> >>> Reading through the code a bit:
> >>>
> >>> FSDataOutputStream.close calls
> >>> DFSOutputStream.close calls
> >>> DFSOutputStream.closeInternal
> >>>  - sets currentPacket.lastPacketInBlock = true
> >>>  - then calls
> >>> DFSOutputStream.flushInternal
> >>>  - enqueues current packet
> >>>  - waits for ack
> >>>
> >>> BlockReceiver.run
> >>>  - if (lastPacketInBlock && !receiver.finalized) calls
> >>> FSDataset.finalizeBlock calls
> >>> FSDataset.finalizeBlockInternal calls
> >>> FSVolume.addBlock calls
> >>> FSDir.addBlock calls
> >>> FSDir.addBlock
> >>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >>>dir
> >>>
> >>> This looks to me as I would expect a synchronous chain from a DFS
> >>>client
> >>> to moving the file from blocksBeingWritten to the current dir so that
> >>>once
> >>> the file is closed that it the block files would be in the proper
> >>>directory
> >>> - even if the contents of the file are still in the OS buffer rather
> >>>than
> >>> synced to disk.  It's only after this moving of blocks that
> >>> NameNode.complete file is called.  There are several conditions and
> >>>loops in
> >>> there that I'm not certain this chain is fully reliable in all cases
> >>>without
> >>> a greater understanding of the code.
> >>>
> >>> Could it be the case that the rename operation itself is not synced and
> >>> that ext3 lost the fact that the block files were moved?
> >>> Or is there a bug in the close file logic that for some reason the
> >>>block
> >>> files are not always moved into place when a file is closed?
> >>>
> >>> Thanks for your patience,
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> >>>wrote:
> >>>>
> >>>> Thanks for the response, Suresh.
> >>>>
> >>>> I'm not sure that I understand the details properly.  From my reading
> >>>>of
> >>>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>>>point
> >>>> in time it's writes so far hit the disk.  For example, for HBase it
> >>>>could
> >>>> apply a fsync after adding some edits to its WAL to ensure those
> >>>>edits are
> >>>> fully durable for a file which is still open.
> >>>>
> >>>> However, in this case the dfs file was closed and even renamed.  Is it
> >>>> the case that even after a dfs file is closed and renamed that the
> >>>>data
> >>>> blocks would still not be synced and would still be stored by the
> >>>>datanode
> >>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
> >>>>would it
> >>>> be better for the NameNode not to reject replicas that are in
> >>>> blocksBeingWritten, especially if it doesn't have any other replicas
> >>>> available?
> >>>>
> >>>> Dave
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
> >>>><su...@hortonworks.com>
> >>>> wrote:
> >>>>>
> >>>>> Yes this is a known issue.
> >>>>>
> >>>>> The HDFS part of this was addressed in
> >>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> is
> >>>>> not
> >>>>> available in 1.x  release. I think HBase does not use this API yet.
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>>>wrote:
> >>>>>
> >>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>>>Saturday
> >>>>> > the
> >>>>> > data center we were in had a total power failure and the cluster
> >>>>>went
> >>>>> > down
> >>>>> > hard.  When we brought it back up, HDFS reported 4 files as
> >>>>>CORRUPT.
> >>>>> > We
> >>>>> > recovered the data in question from our secondary datacenter, but
> >>>>>I'm
> >>>>> > trying to understand what happened and whether this is a bug in
> >>>>>HDFS
> >>>>> > that
> >>>>> > should be fixed.
> >>>>> >
> >>>>> > From what I can tell the file was created and closed by the dfs
> >>>>>client
> >>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
> >>>>>some
> >>>>> > other
> >>>>> > files containing the same data.  Then the cluster lost power.
> >>>>>After
> >>>>> > the
> >>>>> > cluster was restarted, the datanodes reported into the namenode but
> >>>>> > the
> >>>>> > blocks for this file appeared as "blocks being written" - the
> >>>>>namenode
> >>>>> > rejected them and the datanodes deleted the blocks.  At this point
> >>>>> > there
> >>>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>> > The
> >>>>> > underlying file systems are ext3.  Some questions that I would love
> >>>>> > get
> >>>>> > answers for if anyone with deeper understanding of HDFS can chime
> >>>>>in:
> >>>>> >
> >>>>> >  - Is this a known scenario where data loss is expected?  (I found
> >>>>> > HDFS-1539 but that seems different)
> >>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
> >>>>> > that
> >>>>> > happen before a file close operation is acknowledged to a hdfs
> >>>>>client?
> >>>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>>>current
> >>>>> > but
> >>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
> >>>>>of
> >>>>> > underlying file system behavior)?
> >>>>> >  - Is there any other explanation for how this can happen?
> >>>>> >
> >>>>> > Here is a sequence of selected relevant log lines from the RS
> >>>>>(HBase
> >>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> >>>>> > question).  It includes everything that mentions the block in
> >>>>>question
> >>>>> > in
> >>>>> > the NameNode and one DataNode log.  Please let me know if this more
> >>>>> > information that would be helpful.
> >>>>> >
> >>>>> > RS 2013-06-29 11:16:06,812 DEBUG
> >>>>>org.apache.hadoop.hbase.util.FSUtils:
> >>>>> > Creating
> >>>>> >
> >>>>> >
> >>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
> >>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > with permission=rwxrwxrwx
> >>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.allocateBlock:
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c.
> >>>>> > blk_1395839728632046111_357084589
> >>>>> > DN 2013-06-29 11:16:06,832 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>>> > 10.0.5.237:50010
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>>> > blk_1395839728632046111_357084589 of size 25418340 from
> >>>>> > /10.0.5.237:14327
> >>>>>
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
> >>>>>for
> >>>>> > block blk_1395839728632046111_357084589 terminating
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > Removing lease on  file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > DIR*
> >>>>> > NameSystem.completeFile: file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>>>file at
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
> >>>>>0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > to
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
> >>>>>30af6e64e56ba5a539fdf159c4c
> >>>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>>>compaction
> >>>>> > of 7
> >>>>> > file(s) in n of
> >>>>> >
> >>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>>> > into
> >>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
> >>>>>is
> >>>>> > 24.2m
> >>>>> >
> >>>>> > -------  CRASH, RESTART ---------
> >>>>> >
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>>>but
> >>>>> > was
> >>>>> > rejected: Reported as block being written but is a block of closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.1:50010
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> 16971264
> >>>>> > but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.5.237:50010
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
> >>>>>but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.24:50010
> >>>>> > (note the clock on the server running DN is wrong after restart.  I
> >>>>> > believe timestamps are off by 6 hours:)
> >>>>> > DN 2013-06-29 06:07:22,877 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>>> > blk_1395839728632046111_357084589 file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>>for
> >>>>> > deletion
> >>>>> > DN 2013-06-29 06:07:24,952 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>>> > blk_1395839728632046111_357084589 at file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>> >
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Dave
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://hortonworks.com/download/
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> http://hortonworks.com/download/
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the detailed information Michael, Colin, Suresh, Kiwhal.

It looks like we're on CentOS 5.7 (kernel 2.6.18-274.el5) so if the fix was
included in 5.4 then it sounds like we should have it.  I don't believe
we're using LVM.

It sounds like HDFS could improve handling of this scenario by fsyncing
directories after finalizing blocks or perhaps accepting blocksBeingWritten
that are reported later.  At least as an option for clients, like HBase,
that need to rely on durability.

Dave



On Wed, Jul 3, 2013 at 11:57 AM, Kihwal Lee <ki...@yahoo-inc.com> wrote:

> For the ext3 bug Colin mentioned, see
> https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
> 2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
> file data and affects NN more. Since NN preallocates blocks for edits,
> almost all data writes are done without modifying the file length. If
> fsync() loses race to pdflush, the edit log content won't be flushed. I am
> glad this bug was fixed.
>
> Regardless of this bug, renames will be journaled but not necessarily be
> committed before returning. If power fails before the transaction gets
> committed, the transaction can be lost. Journaling will make sure the file
> system will come back in a consistent state, but not all changes will be
> there. To make a rename durable, the directory can be opened and
> fsync()'ed. I think both source and destination are recorded in one
> transaction, so fsyncing on destination directory should be enough.
>
> When RWR blocks are reported, NN can probably accept and finalize it if
> the block size is as expected.
>
> Kihwal
>
>
>
>
>
>
> On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:
>
> >On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> >wrote:
> >> Dave,
> >>
> >> Thanks for the detailed email. Sorry I did not read all the details you
> >>had
> >> sent earlier completely (on my phone). As you said, this is not related
> >>to
> >> data loss related to HBase log and hsync. I think you are right; the
> >>rename
> >> operation itself might not have hit the disk. I think we should either
> >> ensure metadata operation is synced on the datanode or handle it being
> >> reported as blockBeingWritten. Let me spend sometime to debug this
> >>issue.
> >
> >In theory, ext3 is journaled, so all metadata operations should be
> >durable in the case of a power outage.  It is only data operations
> >that should be possible to lose.  It is the same for ext4.  (Assuming
> >you are not using nonstandard mount options.)
> >
> >In practice, it is possible that your hard disks didn't actually
> >persist the data that they said they did.  Rumor has it that some
> >drives ignore the SATA FLUSH CACHE command in some cases, since it
> >makes them look bad in benchmarks.  In that case, there is nothing the
> >filesystem or any other software can do.
> >
> >There was also a bug in older linux kernels where the kernel would not
> >actually send FLUSH CACHE.  Since the poster is using ext3 and
> >hadoop-1, it's possible he's also using an antique kernel as well.  I
> >know for sure this affected LVM-- it used to ignore barriers until
> >fairly recently.
> >
> >In Ceph, we used to recommend disabling the hard drive write cache if
> >your kernel was older than 2.6.33.  You can read the recommendation
> >for yourself here:
> >
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
> >/
> > This will have an impact on performance, however.
> >
> >An uninterruptable power supply is not a bad idea.
> >
> >I am curious:
> >what kernel version you are using?
> >are you using LVM?
> >
> >Colin
> >
> >
> >>
> >> One surprising thing is, all the replicas were reported as
> >> blockBeingWritten.
> >>
> >> Regards,
> >> Suresh
> >>
> >>
> >> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>>
> >>> (Removing hbase list and adding hdfs-dev list as this is pretty
> >>>internal
> >>> stuff).
> >>>
> >>> Reading through the code a bit:
> >>>
> >>> FSDataOutputStream.close calls
> >>> DFSOutputStream.close calls
> >>> DFSOutputStream.closeInternal
> >>>  - sets currentPacket.lastPacketInBlock = true
> >>>  - then calls
> >>> DFSOutputStream.flushInternal
> >>>  - enqueues current packet
> >>>  - waits for ack
> >>>
> >>> BlockReceiver.run
> >>>  - if (lastPacketInBlock && !receiver.finalized) calls
> >>> FSDataset.finalizeBlock calls
> >>> FSDataset.finalizeBlockInternal calls
> >>> FSVolume.addBlock calls
> >>> FSDir.addBlock calls
> >>> FSDir.addBlock
> >>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >>>dir
> >>>
> >>> This looks to me as I would expect a synchronous chain from a DFS
> >>>client
> >>> to moving the file from blocksBeingWritten to the current dir so that
> >>>once
> >>> the file is closed that it the block files would be in the proper
> >>>directory
> >>> - even if the contents of the file are still in the OS buffer rather
> >>>than
> >>> synced to disk.  It's only after this moving of blocks that
> >>> NameNode.complete file is called.  There are several conditions and
> >>>loops in
> >>> there that I'm not certain this chain is fully reliable in all cases
> >>>without
> >>> a greater understanding of the code.
> >>>
> >>> Could it be the case that the rename operation itself is not synced and
> >>> that ext3 lost the fact that the block files were moved?
> >>> Or is there a bug in the close file logic that for some reason the
> >>>block
> >>> files are not always moved into place when a file is closed?
> >>>
> >>> Thanks for your patience,
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> >>>wrote:
> >>>>
> >>>> Thanks for the response, Suresh.
> >>>>
> >>>> I'm not sure that I understand the details properly.  From my reading
> >>>>of
> >>>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>>>point
> >>>> in time it's writes so far hit the disk.  For example, for HBase it
> >>>>could
> >>>> apply a fsync after adding some edits to its WAL to ensure those
> >>>>edits are
> >>>> fully durable for a file which is still open.
> >>>>
> >>>> However, in this case the dfs file was closed and even renamed.  Is it
> >>>> the case that even after a dfs file is closed and renamed that the
> >>>>data
> >>>> blocks would still not be synced and would still be stored by the
> >>>>datanode
> >>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
> >>>>would it
> >>>> be better for the NameNode not to reject replicas that are in
> >>>> blocksBeingWritten, especially if it doesn't have any other replicas
> >>>> available?
> >>>>
> >>>> Dave
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
> >>>><su...@hortonworks.com>
> >>>> wrote:
> >>>>>
> >>>>> Yes this is a known issue.
> >>>>>
> >>>>> The HDFS part of this was addressed in
> >>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> is
> >>>>> not
> >>>>> available in 1.x  release. I think HBase does not use this API yet.
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>>>wrote:
> >>>>>
> >>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>>>Saturday
> >>>>> > the
> >>>>> > data center we were in had a total power failure and the cluster
> >>>>>went
> >>>>> > down
> >>>>> > hard.  When we brought it back up, HDFS reported 4 files as
> >>>>>CORRUPT.
> >>>>> > We
> >>>>> > recovered the data in question from our secondary datacenter, but
> >>>>>I'm
> >>>>> > trying to understand what happened and whether this is a bug in
> >>>>>HDFS
> >>>>> > that
> >>>>> > should be fixed.
> >>>>> >
> >>>>> > From what I can tell the file was created and closed by the dfs
> >>>>>client
> >>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
> >>>>>some
> >>>>> > other
> >>>>> > files containing the same data.  Then the cluster lost power.
> >>>>>After
> >>>>> > the
> >>>>> > cluster was restarted, the datanodes reported into the namenode but
> >>>>> > the
> >>>>> > blocks for this file appeared as "blocks being written" - the
> >>>>>namenode
> >>>>> > rejected them and the datanodes deleted the blocks.  At this point
> >>>>> > there
> >>>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>> > The
> >>>>> > underlying file systems are ext3.  Some questions that I would love
> >>>>> > get
> >>>>> > answers for if anyone with deeper understanding of HDFS can chime
> >>>>>in:
> >>>>> >
> >>>>> >  - Is this a known scenario where data loss is expected?  (I found
> >>>>> > HDFS-1539 but that seems different)
> >>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
> >>>>> > that
> >>>>> > happen before a file close operation is acknowledged to a hdfs
> >>>>>client?
> >>>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>>>current
> >>>>> > but
> >>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
> >>>>>of
> >>>>> > underlying file system behavior)?
> >>>>> >  - Is there any other explanation for how this can happen?
> >>>>> >
> >>>>> > Here is a sequence of selected relevant log lines from the RS
> >>>>>(HBase
> >>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> >>>>> > question).  It includes everything that mentions the block in
> >>>>>question
> >>>>> > in
> >>>>> > the NameNode and one DataNode log.  Please let me know if this more
> >>>>> > information that would be helpful.
> >>>>> >
> >>>>> > RS 2013-06-29 11:16:06,812 DEBUG
> >>>>>org.apache.hadoop.hbase.util.FSUtils:
> >>>>> > Creating
> >>>>> >
> >>>>> >
> >>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
> >>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > with permission=rwxrwxrwx
> >>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.allocateBlock:
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c.
> >>>>> > blk_1395839728632046111_357084589
> >>>>> > DN 2013-06-29 11:16:06,832 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>>> > 10.0.5.237:50010
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>>> > blk_1395839728632046111_357084589 of size 25418340 from
> >>>>> > /10.0.5.237:14327
> >>>>>
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
> >>>>>for
> >>>>> > block blk_1395839728632046111_357084589 terminating
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > Removing lease on  file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > DIR*
> >>>>> > NameSystem.completeFile: file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>>>file at
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
> >>>>>0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > to
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
> >>>>>30af6e64e56ba5a539fdf159c4c
> >>>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>>>compaction
> >>>>> > of 7
> >>>>> > file(s) in n of
> >>>>> >
> >>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>>> > into
> >>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
> >>>>>is
> >>>>> > 24.2m
> >>>>> >
> >>>>> > -------  CRASH, RESTART ---------
> >>>>> >
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>>>but
> >>>>> > was
> >>>>> > rejected: Reported as block being written but is a block of closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.1:50010
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> 16971264
> >>>>> > but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.5.237:50010
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
> >>>>>but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.24:50010
> >>>>> > (note the clock on the server running DN is wrong after restart.  I
> >>>>> > believe timestamps are off by 6 hours:)
> >>>>> > DN 2013-06-29 06:07:22,877 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>>> > blk_1395839728632046111_357084589 file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>>for
> >>>>> > deletion
> >>>>> > DN 2013-06-29 06:07:24,952 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>>> > blk_1395839728632046111_357084589 at file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>> >
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Dave
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://hortonworks.com/download/
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> http://hortonworks.com/download/
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the detailed information Michael, Colin, Suresh, Kiwhal.

It looks like we're on CentOS 5.7 (kernel 2.6.18-274.el5) so if the fix was
included in 5.4 then it sounds like we should have it.  I don't believe
we're using LVM.

It sounds like HDFS could improve handling of this scenario by fsyncing
directories after finalizing blocks or perhaps accepting blocksBeingWritten
that are reported later.  At least as an option for clients, like HBase,
that need to rely on durability.

Dave



On Wed, Jul 3, 2013 at 11:57 AM, Kihwal Lee <ki...@yahoo-inc.com> wrote:

> For the ext3 bug Colin mentioned, see
> https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
> 2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
> file data and affects NN more. Since NN preallocates blocks for edits,
> almost all data writes are done without modifying the file length. If
> fsync() loses race to pdflush, the edit log content won't be flushed. I am
> glad this bug was fixed.
>
> Regardless of this bug, renames will be journaled but not necessarily be
> committed before returning. If power fails before the transaction gets
> committed, the transaction can be lost. Journaling will make sure the file
> system will come back in a consistent state, but not all changes will be
> there. To make a rename durable, the directory can be opened and
> fsync()'ed. I think both source and destination are recorded in one
> transaction, so fsyncing on destination directory should be enough.
>
> When RWR blocks are reported, NN can probably accept and finalize it if
> the block size is as expected.
>
> Kihwal
>
>
>
>
>
>
> On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:
>
> >On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> >wrote:
> >> Dave,
> >>
> >> Thanks for the detailed email. Sorry I did not read all the details you
> >>had
> >> sent earlier completely (on my phone). As you said, this is not related
> >>to
> >> data loss related to HBase log and hsync. I think you are right; the
> >>rename
> >> operation itself might not have hit the disk. I think we should either
> >> ensure metadata operation is synced on the datanode or handle it being
> >> reported as blockBeingWritten. Let me spend sometime to debug this
> >>issue.
> >
> >In theory, ext3 is journaled, so all metadata operations should be
> >durable in the case of a power outage.  It is only data operations
> >that should be possible to lose.  It is the same for ext4.  (Assuming
> >you are not using nonstandard mount options.)
> >
> >In practice, it is possible that your hard disks didn't actually
> >persist the data that they said they did.  Rumor has it that some
> >drives ignore the SATA FLUSH CACHE command in some cases, since it
> >makes them look bad in benchmarks.  In that case, there is nothing the
> >filesystem or any other software can do.
> >
> >There was also a bug in older linux kernels where the kernel would not
> >actually send FLUSH CACHE.  Since the poster is using ext3 and
> >hadoop-1, it's possible he's also using an antique kernel as well.  I
> >know for sure this affected LVM-- it used to ignore barriers until
> >fairly recently.
> >
> >In Ceph, we used to recommend disabling the hard drive write cache if
> >your kernel was older than 2.6.33.  You can read the recommendation
> >for yourself here:
> >
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
> >/
> > This will have an impact on performance, however.
> >
> >An uninterruptable power supply is not a bad idea.
> >
> >I am curious:
> >what kernel version you are using?
> >are you using LVM?
> >
> >Colin
> >
> >
> >>
> >> One surprising thing is, all the replicas were reported as
> >> blockBeingWritten.
> >>
> >> Regards,
> >> Suresh
> >>
> >>
> >> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>>
> >>> (Removing hbase list and adding hdfs-dev list as this is pretty
> >>>internal
> >>> stuff).
> >>>
> >>> Reading through the code a bit:
> >>>
> >>> FSDataOutputStream.close calls
> >>> DFSOutputStream.close calls
> >>> DFSOutputStream.closeInternal
> >>>  - sets currentPacket.lastPacketInBlock = true
> >>>  - then calls
> >>> DFSOutputStream.flushInternal
> >>>  - enqueues current packet
> >>>  - waits for ack
> >>>
> >>> BlockReceiver.run
> >>>  - if (lastPacketInBlock && !receiver.finalized) calls
> >>> FSDataset.finalizeBlock calls
> >>> FSDataset.finalizeBlockInternal calls
> >>> FSVolume.addBlock calls
> >>> FSDir.addBlock calls
> >>> FSDir.addBlock
> >>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >>>dir
> >>>
> >>> This looks to me as I would expect a synchronous chain from a DFS
> >>>client
> >>> to moving the file from blocksBeingWritten to the current dir so that
> >>>once
> >>> the file is closed that it the block files would be in the proper
> >>>directory
> >>> - even if the contents of the file are still in the OS buffer rather
> >>>than
> >>> synced to disk.  It's only after this moving of blocks that
> >>> NameNode.complete file is called.  There are several conditions and
> >>>loops in
> >>> there that I'm not certain this chain is fully reliable in all cases
> >>>without
> >>> a greater understanding of the code.
> >>>
> >>> Could it be the case that the rename operation itself is not synced and
> >>> that ext3 lost the fact that the block files were moved?
> >>> Or is there a bug in the close file logic that for some reason the
> >>>block
> >>> files are not always moved into place when a file is closed?
> >>>
> >>> Thanks for your patience,
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> >>>wrote:
> >>>>
> >>>> Thanks for the response, Suresh.
> >>>>
> >>>> I'm not sure that I understand the details properly.  From my reading
> >>>>of
> >>>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>>>point
> >>>> in time it's writes so far hit the disk.  For example, for HBase it
> >>>>could
> >>>> apply a fsync after adding some edits to its WAL to ensure those
> >>>>edits are
> >>>> fully durable for a file which is still open.
> >>>>
> >>>> However, in this case the dfs file was closed and even renamed.  Is it
> >>>> the case that even after a dfs file is closed and renamed that the
> >>>>data
> >>>> blocks would still not be synced and would still be stored by the
> >>>>datanode
> >>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
> >>>>would it
> >>>> be better for the NameNode not to reject replicas that are in
> >>>> blocksBeingWritten, especially if it doesn't have any other replicas
> >>>> available?
> >>>>
> >>>> Dave
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
> >>>><su...@hortonworks.com>
> >>>> wrote:
> >>>>>
> >>>>> Yes this is a known issue.
> >>>>>
> >>>>> The HDFS part of this was addressed in
> >>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> is
> >>>>> not
> >>>>> available in 1.x  release. I think HBase does not use this API yet.
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>>>wrote:
> >>>>>
> >>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>>>Saturday
> >>>>> > the
> >>>>> > data center we were in had a total power failure and the cluster
> >>>>>went
> >>>>> > down
> >>>>> > hard.  When we brought it back up, HDFS reported 4 files as
> >>>>>CORRUPT.
> >>>>> > We
> >>>>> > recovered the data in question from our secondary datacenter, but
> >>>>>I'm
> >>>>> > trying to understand what happened and whether this is a bug in
> >>>>>HDFS
> >>>>> > that
> >>>>> > should be fixed.
> >>>>> >
> >>>>> > From what I can tell the file was created and closed by the dfs
> >>>>>client
> >>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
> >>>>>some
> >>>>> > other
> >>>>> > files containing the same data.  Then the cluster lost power.
> >>>>>After
> >>>>> > the
> >>>>> > cluster was restarted, the datanodes reported into the namenode but
> >>>>> > the
> >>>>> > blocks for this file appeared as "blocks being written" - the
> >>>>>namenode
> >>>>> > rejected them and the datanodes deleted the blocks.  At this point
> >>>>> > there
> >>>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>> > The
> >>>>> > underlying file systems are ext3.  Some questions that I would love
> >>>>> > get
> >>>>> > answers for if anyone with deeper understanding of HDFS can chime
> >>>>>in:
> >>>>> >
> >>>>> >  - Is this a known scenario where data loss is expected?  (I found
> >>>>> > HDFS-1539 but that seems different)
> >>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
> >>>>> > that
> >>>>> > happen before a file close operation is acknowledged to a hdfs
> >>>>>client?
> >>>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>>>current
> >>>>> > but
> >>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
> >>>>>of
> >>>>> > underlying file system behavior)?
> >>>>> >  - Is there any other explanation for how this can happen?
> >>>>> >
> >>>>> > Here is a sequence of selected relevant log lines from the RS
> >>>>>(HBase
> >>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> >>>>> > question).  It includes everything that mentions the block in
> >>>>>question
> >>>>> > in
> >>>>> > the NameNode and one DataNode log.  Please let me know if this more
> >>>>> > information that would be helpful.
> >>>>> >
> >>>>> > RS 2013-06-29 11:16:06,812 DEBUG
> >>>>>org.apache.hadoop.hbase.util.FSUtils:
> >>>>> > Creating
> >>>>> >
> >>>>> >
> >>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
> >>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > with permission=rwxrwxrwx
> >>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.allocateBlock:
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c.
> >>>>> > blk_1395839728632046111_357084589
> >>>>> > DN 2013-06-29 11:16:06,832 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>>> > 10.0.5.237:50010
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>>> > blk_1395839728632046111_357084589 of size 25418340 from
> >>>>> > /10.0.5.237:14327
> >>>>>
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
> >>>>>for
> >>>>> > block blk_1395839728632046111_357084589 terminating
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > Removing lease on  file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > DIR*
> >>>>> > NameSystem.completeFile: file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>>>file at
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
> >>>>>0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > to
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
> >>>>>30af6e64e56ba5a539fdf159c4c
> >>>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>>>compaction
> >>>>> > of 7
> >>>>> > file(s) in n of
> >>>>> >
> >>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>>> > into
> >>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
> >>>>>is
> >>>>> > 24.2m
> >>>>> >
> >>>>> > -------  CRASH, RESTART ---------
> >>>>> >
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>>>but
> >>>>> > was
> >>>>> > rejected: Reported as block being written but is a block of closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.1:50010
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> 16971264
> >>>>> > but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.5.237:50010
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
> >>>>>but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.24:50010
> >>>>> > (note the clock on the server running DN is wrong after restart.  I
> >>>>> > believe timestamps are off by 6 hours:)
> >>>>> > DN 2013-06-29 06:07:22,877 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>>> > blk_1395839728632046111_357084589 file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>>for
> >>>>> > deletion
> >>>>> > DN 2013-06-29 06:07:24,952 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>>> > blk_1395839728632046111_357084589 at file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>> >
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Dave
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://hortonworks.com/download/
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> http://hortonworks.com/download/
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the detailed information Michael, Colin, Suresh, Kiwhal.

It looks like we're on CentOS 5.7 (kernel 2.6.18-274.el5) so if the fix was
included in 5.4 then it sounds like we should have it.  I don't believe
we're using LVM.

It sounds like HDFS could improve handling of this scenario by fsyncing
directories after finalizing blocks or perhaps accepting blocksBeingWritten
that are reported later.  At least as an option for clients, like HBase,
that need to rely on durability.

Dave



On Wed, Jul 3, 2013 at 11:57 AM, Kihwal Lee <ki...@yahoo-inc.com> wrote:

> For the ext3 bug Colin mentioned, see
> https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
> 2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
> file data and affects NN more. Since NN preallocates blocks for edits,
> almost all data writes are done without modifying the file length. If
> fsync() loses race to pdflush, the edit log content won't be flushed. I am
> glad this bug was fixed.
>
> Regardless of this bug, renames will be journaled but not necessarily be
> committed before returning. If power fails before the transaction gets
> committed, the transaction can be lost. Journaling will make sure the file
> system will come back in a consistent state, but not all changes will be
> there. To make a rename durable, the directory can be opened and
> fsync()'ed. I think both source and destination are recorded in one
> transaction, so fsyncing on destination directory should be enough.
>
> When RWR blocks are reported, NN can probably accept and finalize it if
> the block size is as expected.
>
> Kihwal
>
>
>
>
>
>
> On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:
>
> >On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> >wrote:
> >> Dave,
> >>
> >> Thanks for the detailed email. Sorry I did not read all the details you
> >>had
> >> sent earlier completely (on my phone). As you said, this is not related
> >>to
> >> data loss related to HBase log and hsync. I think you are right; the
> >>rename
> >> operation itself might not have hit the disk. I think we should either
> >> ensure metadata operation is synced on the datanode or handle it being
> >> reported as blockBeingWritten. Let me spend sometime to debug this
> >>issue.
> >
> >In theory, ext3 is journaled, so all metadata operations should be
> >durable in the case of a power outage.  It is only data operations
> >that should be possible to lose.  It is the same for ext4.  (Assuming
> >you are not using nonstandard mount options.)
> >
> >In practice, it is possible that your hard disks didn't actually
> >persist the data that they said they did.  Rumor has it that some
> >drives ignore the SATA FLUSH CACHE command in some cases, since it
> >makes them look bad in benchmarks.  In that case, there is nothing the
> >filesystem or any other software can do.
> >
> >There was also a bug in older linux kernels where the kernel would not
> >actually send FLUSH CACHE.  Since the poster is using ext3 and
> >hadoop-1, it's possible he's also using an antique kernel as well.  I
> >know for sure this affected LVM-- it used to ignore barriers until
> >fairly recently.
> >
> >In Ceph, we used to recommend disabling the hard drive write cache if
> >your kernel was older than 2.6.33.  You can read the recommendation
> >for yourself here:
> >
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
> >/
> > This will have an impact on performance, however.
> >
> >An uninterruptable power supply is not a bad idea.
> >
> >I am curious:
> >what kernel version you are using?
> >are you using LVM?
> >
> >Colin
> >
> >
> >>
> >> One surprising thing is, all the replicas were reported as
> >> blockBeingWritten.
> >>
> >> Regards,
> >> Suresh
> >>
> >>
> >> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>>
> >>> (Removing hbase list and adding hdfs-dev list as this is pretty
> >>>internal
> >>> stuff).
> >>>
> >>> Reading through the code a bit:
> >>>
> >>> FSDataOutputStream.close calls
> >>> DFSOutputStream.close calls
> >>> DFSOutputStream.closeInternal
> >>>  - sets currentPacket.lastPacketInBlock = true
> >>>  - then calls
> >>> DFSOutputStream.flushInternal
> >>>  - enqueues current packet
> >>>  - waits for ack
> >>>
> >>> BlockReceiver.run
> >>>  - if (lastPacketInBlock && !receiver.finalized) calls
> >>> FSDataset.finalizeBlock calls
> >>> FSDataset.finalizeBlockInternal calls
> >>> FSVolume.addBlock calls
> >>> FSDir.addBlock calls
> >>> FSDir.addBlock
> >>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >>>dir
> >>>
> >>> This looks to me as I would expect a synchronous chain from a DFS
> >>>client
> >>> to moving the file from blocksBeingWritten to the current dir so that
> >>>once
> >>> the file is closed that it the block files would be in the proper
> >>>directory
> >>> - even if the contents of the file are still in the OS buffer rather
> >>>than
> >>> synced to disk.  It's only after this moving of blocks that
> >>> NameNode.complete file is called.  There are several conditions and
> >>>loops in
> >>> there that I'm not certain this chain is fully reliable in all cases
> >>>without
> >>> a greater understanding of the code.
> >>>
> >>> Could it be the case that the rename operation itself is not synced and
> >>> that ext3 lost the fact that the block files were moved?
> >>> Or is there a bug in the close file logic that for some reason the
> >>>block
> >>> files are not always moved into place when a file is closed?
> >>>
> >>> Thanks for your patience,
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> >>>wrote:
> >>>>
> >>>> Thanks for the response, Suresh.
> >>>>
> >>>> I'm not sure that I understand the details properly.  From my reading
> >>>>of
> >>>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>>>point
> >>>> in time it's writes so far hit the disk.  For example, for HBase it
> >>>>could
> >>>> apply a fsync after adding some edits to its WAL to ensure those
> >>>>edits are
> >>>> fully durable for a file which is still open.
> >>>>
> >>>> However, in this case the dfs file was closed and even renamed.  Is it
> >>>> the case that even after a dfs file is closed and renamed that the
> >>>>data
> >>>> blocks would still not be synced and would still be stored by the
> >>>>datanode
> >>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
> >>>>would it
> >>>> be better for the NameNode not to reject replicas that are in
> >>>> blocksBeingWritten, especially if it doesn't have any other replicas
> >>>> available?
> >>>>
> >>>> Dave
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
> >>>><su...@hortonworks.com>
> >>>> wrote:
> >>>>>
> >>>>> Yes this is a known issue.
> >>>>>
> >>>>> The HDFS part of this was addressed in
> >>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> is
> >>>>> not
> >>>>> available in 1.x  release. I think HBase does not use this API yet.
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>>>wrote:
> >>>>>
> >>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>>>Saturday
> >>>>> > the
> >>>>> > data center we were in had a total power failure and the cluster
> >>>>>went
> >>>>> > down
> >>>>> > hard.  When we brought it back up, HDFS reported 4 files as
> >>>>>CORRUPT.
> >>>>> > We
> >>>>> > recovered the data in question from our secondary datacenter, but
> >>>>>I'm
> >>>>> > trying to understand what happened and whether this is a bug in
> >>>>>HDFS
> >>>>> > that
> >>>>> > should be fixed.
> >>>>> >
> >>>>> > From what I can tell the file was created and closed by the dfs
> >>>>>client
> >>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
> >>>>>some
> >>>>> > other
> >>>>> > files containing the same data.  Then the cluster lost power.
> >>>>>After
> >>>>> > the
> >>>>> > cluster was restarted, the datanodes reported into the namenode but
> >>>>> > the
> >>>>> > blocks for this file appeared as "blocks being written" - the
> >>>>>namenode
> >>>>> > rejected them and the datanodes deleted the blocks.  At this point
> >>>>> > there
> >>>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>> > The
> >>>>> > underlying file systems are ext3.  Some questions that I would love
> >>>>> > get
> >>>>> > answers for if anyone with deeper understanding of HDFS can chime
> >>>>>in:
> >>>>> >
> >>>>> >  - Is this a known scenario where data loss is expected?  (I found
> >>>>> > HDFS-1539 but that seems different)
> >>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
> >>>>> > that
> >>>>> > happen before a file close operation is acknowledged to a hdfs
> >>>>>client?
> >>>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>>>current
> >>>>> > but
> >>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
> >>>>>of
> >>>>> > underlying file system behavior)?
> >>>>> >  - Is there any other explanation for how this can happen?
> >>>>> >
> >>>>> > Here is a sequence of selected relevant log lines from the RS
> >>>>>(HBase
> >>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> >>>>> > question).  It includes everything that mentions the block in
> >>>>>question
> >>>>> > in
> >>>>> > the NameNode and one DataNode log.  Please let me know if this more
> >>>>> > information that would be helpful.
> >>>>> >
> >>>>> > RS 2013-06-29 11:16:06,812 DEBUG
> >>>>>org.apache.hadoop.hbase.util.FSUtils:
> >>>>> > Creating
> >>>>> >
> >>>>> >
> >>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
> >>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > with permission=rwxrwxrwx
> >>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.allocateBlock:
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c.
> >>>>> > blk_1395839728632046111_357084589
> >>>>> > DN 2013-06-29 11:16:06,832 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>>> > 10.0.5.237:50010
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>>> > blk_1395839728632046111_357084589 of size 25418340 from
> >>>>> > /10.0.5.237:14327
> >>>>>
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
> >>>>>for
> >>>>> > block blk_1395839728632046111_357084589 terminating
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > Removing lease on  file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > DIR*
> >>>>> > NameSystem.completeFile: file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>>>file at
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
> >>>>>0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > to
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
> >>>>>30af6e64e56ba5a539fdf159c4c
> >>>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>>>compaction
> >>>>> > of 7
> >>>>> > file(s) in n of
> >>>>> >
> >>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>>> > into
> >>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
> >>>>>is
> >>>>> > 24.2m
> >>>>> >
> >>>>> > -------  CRASH, RESTART ---------
> >>>>> >
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>>>but
> >>>>> > was
> >>>>> > rejected: Reported as block being written but is a block of closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.1:50010
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> 16971264
> >>>>> > but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.5.237:50010
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
> >>>>>but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.24:50010
> >>>>> > (note the clock on the server running DN is wrong after restart.  I
> >>>>> > believe timestamps are off by 6 hours:)
> >>>>> > DN 2013-06-29 06:07:22,877 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>>> > blk_1395839728632046111_357084589 file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>>for
> >>>>> > deletion
> >>>>> > DN 2013-06-29 06:07:24,952 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>>> > blk_1395839728632046111_357084589 at file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>> >
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Dave
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://hortonworks.com/download/
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> http://hortonworks.com/download/
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the detailed information Michael, Colin, Suresh, Kiwhal.

It looks like we're on CentOS 5.7 (kernel 2.6.18-274.el5) so if the fix was
included in 5.4 then it sounds like we should have it.  I don't believe
we're using LVM.

It sounds like HDFS could improve handling of this scenario by fsyncing
directories after finalizing blocks or perhaps accepting blocksBeingWritten
that are reported later.  At least as an option for clients, like HBase,
that need to rely on durability.

Dave



On Wed, Jul 3, 2013 at 11:57 AM, Kihwal Lee <ki...@yahoo-inc.com> wrote:

> For the ext3 bug Colin mentioned, see
> https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
> 2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
> file data and affects NN more. Since NN preallocates blocks for edits,
> almost all data writes are done without modifying the file length. If
> fsync() loses race to pdflush, the edit log content won't be flushed. I am
> glad this bug was fixed.
>
> Regardless of this bug, renames will be journaled but not necessarily be
> committed before returning. If power fails before the transaction gets
> committed, the transaction can be lost. Journaling will make sure the file
> system will come back in a consistent state, but not all changes will be
> there. To make a rename durable, the directory can be opened and
> fsync()'ed. I think both source and destination are recorded in one
> transaction, so fsyncing on destination directory should be enough.
>
> When RWR blocks are reported, NN can probably accept and finalize it if
> the block size is as expected.
>
> Kihwal
>
>
>
>
>
>
> On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:
>
> >On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> >wrote:
> >> Dave,
> >>
> >> Thanks for the detailed email. Sorry I did not read all the details you
> >>had
> >> sent earlier completely (on my phone). As you said, this is not related
> >>to
> >> data loss related to HBase log and hsync. I think you are right; the
> >>rename
> >> operation itself might not have hit the disk. I think we should either
> >> ensure metadata operation is synced on the datanode or handle it being
> >> reported as blockBeingWritten. Let me spend sometime to debug this
> >>issue.
> >
> >In theory, ext3 is journaled, so all metadata operations should be
> >durable in the case of a power outage.  It is only data operations
> >that should be possible to lose.  It is the same for ext4.  (Assuming
> >you are not using nonstandard mount options.)
> >
> >In practice, it is possible that your hard disks didn't actually
> >persist the data that they said they did.  Rumor has it that some
> >drives ignore the SATA FLUSH CACHE command in some cases, since it
> >makes them look bad in benchmarks.  In that case, there is nothing the
> >filesystem or any other software can do.
> >
> >There was also a bug in older linux kernels where the kernel would not
> >actually send FLUSH CACHE.  Since the poster is using ext3 and
> >hadoop-1, it's possible he's also using an antique kernel as well.  I
> >know for sure this affected LVM-- it used to ignore barriers until
> >fairly recently.
> >
> >In Ceph, we used to recommend disabling the hard drive write cache if
> >your kernel was older than 2.6.33.  You can read the recommendation
> >for yourself here:
> >
> http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
> >/
> > This will have an impact on performance, however.
> >
> >An uninterruptable power supply is not a bad idea.
> >
> >I am curious:
> >what kernel version you are using?
> >are you using LVM?
> >
> >Colin
> >
> >
> >>
> >> One surprising thing is, all the replicas were reported as
> >> blockBeingWritten.
> >>
> >> Regards,
> >> Suresh
> >>
> >>
> >> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>>
> >>> (Removing hbase list and adding hdfs-dev list as this is pretty
> >>>internal
> >>> stuff).
> >>>
> >>> Reading through the code a bit:
> >>>
> >>> FSDataOutputStream.close calls
> >>> DFSOutputStream.close calls
> >>> DFSOutputStream.closeInternal
> >>>  - sets currentPacket.lastPacketInBlock = true
> >>>  - then calls
> >>> DFSOutputStream.flushInternal
> >>>  - enqueues current packet
> >>>  - waits for ack
> >>>
> >>> BlockReceiver.run
> >>>  - if (lastPacketInBlock && !receiver.finalized) calls
> >>> FSDataset.finalizeBlock calls
> >>> FSDataset.finalizeBlockInternal calls
> >>> FSVolume.addBlock calls
> >>> FSDir.addBlock calls
> >>> FSDir.addBlock
> >>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >>>dir
> >>>
> >>> This looks to me as I would expect a synchronous chain from a DFS
> >>>client
> >>> to moving the file from blocksBeingWritten to the current dir so that
> >>>once
> >>> the file is closed that it the block files would be in the proper
> >>>directory
> >>> - even if the contents of the file are still in the OS buffer rather
> >>>than
> >>> synced to disk.  It's only after this moving of blocks that
> >>> NameNode.complete file is called.  There are several conditions and
> >>>loops in
> >>> there that I'm not certain this chain is fully reliable in all cases
> >>>without
> >>> a greater understanding of the code.
> >>>
> >>> Could it be the case that the rename operation itself is not synced and
> >>> that ext3 lost the fact that the block files were moved?
> >>> Or is there a bug in the close file logic that for some reason the
> >>>block
> >>> files are not always moved into place when a file is closed?
> >>>
> >>> Thanks for your patience,
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> >>>wrote:
> >>>>
> >>>> Thanks for the response, Suresh.
> >>>>
> >>>> I'm not sure that I understand the details properly.  From my reading
> >>>>of
> >>>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>>>point
> >>>> in time it's writes so far hit the disk.  For example, for HBase it
> >>>>could
> >>>> apply a fsync after adding some edits to its WAL to ensure those
> >>>>edits are
> >>>> fully durable for a file which is still open.
> >>>>
> >>>> However, in this case the dfs file was closed and even renamed.  Is it
> >>>> the case that even after a dfs file is closed and renamed that the
> >>>>data
> >>>> blocks would still not be synced and would still be stored by the
> >>>>datanode
> >>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
> >>>>would it
> >>>> be better for the NameNode not to reject replicas that are in
> >>>> blocksBeingWritten, especially if it doesn't have any other replicas
> >>>> available?
> >>>>
> >>>> Dave
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
> >>>><su...@hortonworks.com>
> >>>> wrote:
> >>>>>
> >>>>> Yes this is a known issue.
> >>>>>
> >>>>> The HDFS part of this was addressed in
> >>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> is
> >>>>> not
> >>>>> available in 1.x  release. I think HBase does not use this API yet.
> >>>>>
> >>>>>
> >>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>>>wrote:
> >>>>>
> >>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>>>Saturday
> >>>>> > the
> >>>>> > data center we were in had a total power failure and the cluster
> >>>>>went
> >>>>> > down
> >>>>> > hard.  When we brought it back up, HDFS reported 4 files as
> >>>>>CORRUPT.
> >>>>> > We
> >>>>> > recovered the data in question from our secondary datacenter, but
> >>>>>I'm
> >>>>> > trying to understand what happened and whether this is a bug in
> >>>>>HDFS
> >>>>> > that
> >>>>> > should be fixed.
> >>>>> >
> >>>>> > From what I can tell the file was created and closed by the dfs
> >>>>>client
> >>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
> >>>>>some
> >>>>> > other
> >>>>> > files containing the same data.  Then the cluster lost power.
> >>>>>After
> >>>>> > the
> >>>>> > cluster was restarted, the datanodes reported into the namenode but
> >>>>> > the
> >>>>> > blocks for this file appeared as "blocks being written" - the
> >>>>>namenode
> >>>>> > rejected them and the datanodes deleted the blocks.  At this point
> >>>>> > there
> >>>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>> > The
> >>>>> > underlying file systems are ext3.  Some questions that I would love
> >>>>> > get
> >>>>> > answers for if anyone with deeper understanding of HDFS can chime
> >>>>>in:
> >>>>> >
> >>>>> >  - Is this a known scenario where data loss is expected?  (I found
> >>>>> > HDFS-1539 but that seems different)
> >>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
> >>>>> > that
> >>>>> > happen before a file close operation is acknowledged to a hdfs
> >>>>>client?
> >>>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>>>current
> >>>>> > but
> >>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
> >>>>>of
> >>>>> > underlying file system behavior)?
> >>>>> >  - Is there any other explanation for how this can happen?
> >>>>> >
> >>>>> > Here is a sequence of selected relevant log lines from the RS
> >>>>>(HBase
> >>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> >>>>> > question).  It includes everything that mentions the block in
> >>>>>question
> >>>>> > in
> >>>>> > the NameNode and one DataNode log.  Please let me know if this more
> >>>>> > information that would be helpful.
> >>>>> >
> >>>>> > RS 2013-06-29 11:16:06,812 DEBUG
> >>>>>org.apache.hadoop.hbase.util.FSUtils:
> >>>>> > Creating
> >>>>> >
> >>>>> >
> >>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
> >>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > with permission=rwxrwxrwx
> >>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.allocateBlock:
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c.
> >>>>> > blk_1395839728632046111_357084589
> >>>>> > DN 2013-06-29 11:16:06,832 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>>> > 10.0.5.237:50010
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>>>added
> >>>>> > to
> >>>>> > blk_1395839728632046111_357084589 size 25418340
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>>> > blk_1395839728632046111_357084589 of size 25418340 from
> >>>>> > /10.0.5.237:14327
> >>>>>
> >>>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
> >>>>>for
> >>>>> > block blk_1395839728632046111_357084589 terminating
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > Removing lease on  file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > DIR*
> >>>>> > NameSystem.completeFile: file
> >>>>> >
> >>>>> >
> >>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
> >>>>>a5a539fdf159c4c
> >>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>>>file at
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
> >>>>>0cc30af6e64e56ba5a539fdf159c4c
> >>>>> > to
> >>>>> >
> >>>>> >
> >>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
> >>>>>30af6e64e56ba5a539fdf159c4c
> >>>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>>>compaction
> >>>>> > of 7
> >>>>> > file(s) in n of
> >>>>> >
> >>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>>> > into
> >>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
> >>>>>is
> >>>>> > 24.2m
> >>>>> >
> >>>>> > -------  CRASH, RESTART ---------
> >>>>> >
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>>>but
> >>>>> > was
> >>>>> > rejected: Reported as block being written but is a block of closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.1:50010
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> 16971264
> >>>>> > but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.5.237:50010
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
> >>>>>but
> >>>>> > was rejected: Reported as block being written but is a block of
> >>>>>closed
> >>>>> > file.
> >>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>>> > BLOCK*
> >>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>>> > invalidSet
> >>>>> > of 10.0.6.24:50010
> >>>>> > (note the clock on the server running DN is wrong after restart.  I
> >>>>> > believe timestamps are off by 6 hours:)
> >>>>> > DN 2013-06-29 06:07:22,877 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>>> > blk_1395839728632046111_357084589 file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>>for
> >>>>> > deletion
> >>>>> > DN 2013-06-29 06:07:24,952 INFO
> >>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>>> > blk_1395839728632046111_357084589 at file
> >>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>>> >
> >>>>> >
> >>>>> > Thanks,
> >>>>> > Dave
> >>>>> >
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> http://hortonworks.com/download/
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> http://hortonworks.com/download/
>
>

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

Thanks.  Suresh and Kihwal are right-- renames are journalled, but not
necessarily durable (stored to disk).  I was getting mixed up with
HDFS semantics, in which we actually do make the journal durable
before returning success to the client.

It might be a good idea for HDFS to fsync the file descriptor of the
directories involved in the rename operation, before assuming that the
operation is durable.

If you're using ext{2,3,4}, a quick fix would be to use mount -o
dirsync.  I haven't tested it out, but it's supposed to make these
operations synchronous.

>From the man page:
       dirsync
              All directory updates within the filesystem should be done  syn-
              chronously.   This  affects  the  following system calls: creat,
              link, unlink, symlink, mkdir, rmdir, mknod and rename.

Colin


On Wed, Jul 3, 2013 at 10:19 AM, Suresh Srinivas <su...@hortonworks.com> wrote:
> On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
>
>> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>> wrote:
>> > Dave,
>> >
>> > Thanks for the detailed email. Sorry I did not read all the details you
>> had
>> > sent earlier completely (on my phone). As you said, this is not related
>> to
>> > data loss related to HBase log and hsync. I think you are right; the
>> rename
>> > operation itself might not have hit the disk. I think we should either
>> > ensure metadata operation is synced on the datanode or handle it being
>> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>>
>> In theory, ext3 is journaled, so all metadata operations should be
>> durable in the case of a power outage.  It is only data operations
>> that should be possible to lose.  It is the same for ext4.  (Assuming
>> you are not using nonstandard mount options.)
>>
>
> ext3 journal may not hit the disk right. From what I read, if you do not
> specifically
> call sync, even the metadata operations do not hit disk.
>
> See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt
>
> commit=nrsec    (*)     Ext3 can be told to sync all its data and metadata
>                         every 'nrsec' seconds. The default value is 5 seconds.
>                         This means that if you lose your power, you will lose
>                         as much as the latest 5 seconds of work (your
>                         filesystem will not be damaged though, thanks to the
>                         journaling).  This default value (or any low value)
>                         will hurt performance, but it's good for data-safety.
>                         Setting it to 0 will have the same effect as leaving
>                         it at the default (5 seconds).
>                         Setting it to very large values will improve
>
> performance.

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

Thanks.  Suresh and Kihwal are right-- renames are journalled, but not
necessarily durable (stored to disk).  I was getting mixed up with
HDFS semantics, in which we actually do make the journal durable
before returning success to the client.

It might be a good idea for HDFS to fsync the file descriptor of the
directories involved in the rename operation, before assuming that the
operation is durable.

If you're using ext{2,3,4}, a quick fix would be to use mount -o
dirsync.  I haven't tested it out, but it's supposed to make these
operations synchronous.

>From the man page:
       dirsync
              All directory updates within the filesystem should be done  syn-
              chronously.   This  affects  the  following system calls: creat,
              link, unlink, symlink, mkdir, rmdir, mknod and rename.

Colin


On Wed, Jul 3, 2013 at 10:19 AM, Suresh Srinivas <su...@hortonworks.com> wrote:
> On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
>
>> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>> wrote:
>> > Dave,
>> >
>> > Thanks for the detailed email. Sorry I did not read all the details you
>> had
>> > sent earlier completely (on my phone). As you said, this is not related
>> to
>> > data loss related to HBase log and hsync. I think you are right; the
>> rename
>> > operation itself might not have hit the disk. I think we should either
>> > ensure metadata operation is synced on the datanode or handle it being
>> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>>
>> In theory, ext3 is journaled, so all metadata operations should be
>> durable in the case of a power outage.  It is only data operations
>> that should be possible to lose.  It is the same for ext4.  (Assuming
>> you are not using nonstandard mount options.)
>>
>
> ext3 journal may not hit the disk right. From what I read, if you do not
> specifically
> call sync, even the metadata operations do not hit disk.
>
> See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt
>
> commit=nrsec    (*)     Ext3 can be told to sync all its data and metadata
>                         every 'nrsec' seconds. The default value is 5 seconds.
>                         This means that if you lose your power, you will lose
>                         as much as the latest 5 seconds of work (your
>                         filesystem will not be damaged though, thanks to the
>                         journaling).  This default value (or any low value)
>                         will hurt performance, but it's good for data-safety.
>                         Setting it to 0 will have the same effect as leaving
>                         it at the default (5 seconds).
>                         Setting it to very large values will improve
>
> performance.

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

Thanks.  Suresh and Kihwal are right-- renames are journalled, but not
necessarily durable (stored to disk).  I was getting mixed up with
HDFS semantics, in which we actually do make the journal durable
before returning success to the client.

It might be a good idea for HDFS to fsync the file descriptor of the
directories involved in the rename operation, before assuming that the
operation is durable.

If you're using ext{2,3,4}, a quick fix would be to use mount -o
dirsync.  I haven't tested it out, but it's supposed to make these
operations synchronous.

>From the man page:
       dirsync
              All directory updates within the filesystem should be done  syn-
              chronously.   This  affects  the  following system calls: creat,
              link, unlink, symlink, mkdir, rmdir, mknod and rename.

Colin


On Wed, Jul 3, 2013 at 10:19 AM, Suresh Srinivas <su...@hortonworks.com> wrote:
> On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
>
>> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>> wrote:
>> > Dave,
>> >
>> > Thanks for the detailed email. Sorry I did not read all the details you
>> had
>> > sent earlier completely (on my phone). As you said, this is not related
>> to
>> > data loss related to HBase log and hsync. I think you are right; the
>> rename
>> > operation itself might not have hit the disk. I think we should either
>> > ensure metadata operation is synced on the datanode or handle it being
>> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>>
>> In theory, ext3 is journaled, so all metadata operations should be
>> durable in the case of a power outage.  It is only data operations
>> that should be possible to lose.  It is the same for ext4.  (Assuming
>> you are not using nonstandard mount options.)
>>
>
> ext3 journal may not hit the disk right. From what I read, if you do not
> specifically
> call sync, even the metadata operations do not hit disk.
>
> See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt
>
> commit=nrsec    (*)     Ext3 can be told to sync all its data and metadata
>                         every 'nrsec' seconds. The default value is 5 seconds.
>                         This means that if you lose your power, you will lose
>                         as much as the latest 5 seconds of work (your
>                         filesystem will not be damaged though, thanks to the
>                         journaling).  This default value (or any low value)
>                         will hurt performance, but it's good for data-safety.
>                         Setting it to 0 will have the same effect as leaving
>                         it at the default (5 seconds).
>                         Setting it to very large values will improve
>
> performance.

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

Thanks.  Suresh and Kihwal are right-- renames are journalled, but not
necessarily durable (stored to disk).  I was getting mixed up with
HDFS semantics, in which we actually do make the journal durable
before returning success to the client.

It might be a good idea for HDFS to fsync the file descriptor of the
directories involved in the rename operation, before assuming that the
operation is durable.

If you're using ext{2,3,4}, a quick fix would be to use mount -o
dirsync.  I haven't tested it out, but it's supposed to make these
operations synchronous.

>From the man page:
       dirsync
              All directory updates within the filesystem should be done  syn-
              chronously.   This  affects  the  following system calls: creat,
              link, unlink, symlink, mkdir, rmdir, mknod and rename.

Colin


On Wed, Jul 3, 2013 at 10:19 AM, Suresh Srinivas <su...@hortonworks.com> wrote:
> On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
>
>> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>> wrote:
>> > Dave,
>> >
>> > Thanks for the detailed email. Sorry I did not read all the details you
>> had
>> > sent earlier completely (on my phone). As you said, this is not related
>> to
>> > data loss related to HBase log and hsync. I think you are right; the
>> rename
>> > operation itself might not have hit the disk. I think we should either
>> > ensure metadata operation is synced on the datanode or handle it being
>> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>>
>> In theory, ext3 is journaled, so all metadata operations should be
>> durable in the case of a power outage.  It is only data operations
>> that should be possible to lose.  It is the same for ext4.  (Assuming
>> you are not using nonstandard mount options.)
>>
>
> ext3 journal may not hit the disk right. From what I read, if you do not
> specifically
> call sync, even the metadata operations do not hit disk.
>
> See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt
>
> commit=nrsec    (*)     Ext3 can be told to sync all its data and metadata
>                         every 'nrsec' seconds. The default value is 5 seconds.
>                         This means that if you lose your power, you will lose
>                         as much as the latest 5 seconds of work (your
>                         filesystem will not be damaged though, thanks to the
>                         journaling).  This default value (or any low value)
>                         will hurt performance, but it's good for data-safety.
>                         Setting it to 0 will have the same effect as leaving
>                         it at the default (5 seconds).
>                         Setting it to very large values will improve
>
> performance.

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

Thanks.  Suresh and Kihwal are right-- renames are journalled, but not
necessarily durable (stored to disk).  I was getting mixed up with
HDFS semantics, in which we actually do make the journal durable
before returning success to the client.

It might be a good idea for HDFS to fsync the file descriptor of the
directories involved in the rename operation, before assuming that the
operation is durable.

If you're using ext{2,3,4}, a quick fix would be to use mount -o
dirsync.  I haven't tested it out, but it's supposed to make these
operations synchronous.

>From the man page:
       dirsync
              All directory updates within the filesystem should be done  syn-
              chronously.   This  affects  the  following system calls: creat,
              link, unlink, symlink, mkdir, rmdir, mknod and rename.

Colin


On Wed, Jul 3, 2013 at 10:19 AM, Suresh Srinivas <su...@hortonworks.com> wrote:
> On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
>
>> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>> wrote:
>> > Dave,
>> >
>> > Thanks for the detailed email. Sorry I did not read all the details you
>> had
>> > sent earlier completely (on my phone). As you said, this is not related
>> to
>> > data loss related to HBase log and hsync. I think you are right; the
>> rename
>> > operation itself might not have hit the disk. I think we should either
>> > ensure metadata operation is synced on the datanode or handle it being
>> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>>
>> In theory, ext3 is journaled, so all metadata operations should be
>> durable in the case of a power outage.  It is only data operations
>> that should be possible to lose.  It is the same for ext4.  (Assuming
>> you are not using nonstandard mount options.)
>>
>
> ext3 journal may not hit the disk right. From what I read, if you do not
> specifically
> call sync, even the metadata operations do not hit disk.
>
> See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt
>
> commit=nrsec    (*)     Ext3 can be told to sync all its data and metadata
>                         every 'nrsec' seconds. The default value is 5 seconds.
>                         This means that if you lose your power, you will lose
>                         as much as the latest 5 seconds of work (your
>                         filesystem will not be damaged though, thanks to the
>                         journaling).  This default value (or any low value)
>                         will hurt performance, but it's good for data-safety.
>                         Setting it to 0 will have the same effect as leaving
>                         it at the default (5 seconds).
>                         Setting it to very large values will improve
>
> performance.

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:

> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> wrote:
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details you
> had
> > sent earlier completely (on my phone). As you said, this is not related
> to
> > data loss related to HBase log and hsync. I think you are right; the
> rename
> > operation itself might not have hit the disk. I think we should either
> > ensure metadata operation is synced on the datanode or handle it being
> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>
> In theory, ext3 is journaled, so all metadata operations should be
> durable in the case of a power outage.  It is only data operations
> that should be possible to lose.  It is the same for ext4.  (Assuming
> you are not using nonstandard mount options.)
>

ext3 journal may not hit the disk right. From what I read, if you do not
specifically
call sync, even the metadata operations do not hit disk.

See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt

commit=nrsec	(*)	Ext3 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve

performance.

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:

> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> wrote:
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details you
> had
> > sent earlier completely (on my phone). As you said, this is not related
> to
> > data loss related to HBase log and hsync. I think you are right; the
> rename
> > operation itself might not have hit the disk. I think we should either
> > ensure metadata operation is synced on the datanode or handle it being
> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>
> In theory, ext3 is journaled, so all metadata operations should be
> durable in the case of a power outage.  It is only data operations
> that should be possible to lose.  It is the same for ext4.  (Assuming
> you are not using nonstandard mount options.)
>

ext3 journal may not hit the disk right. From what I read, if you do not
specifically
call sync, even the metadata operations do not hit disk.

See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt

commit=nrsec	(*)	Ext3 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve

performance.

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:

> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> wrote:
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details you
> had
> > sent earlier completely (on my phone). As you said, this is not related
> to
> > data loss related to HBase log and hsync. I think you are right; the
> rename
> > operation itself might not have hit the disk. I think we should either
> > ensure metadata operation is synced on the datanode or handle it being
> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>
> In theory, ext3 is journaled, so all metadata operations should be
> durable in the case of a power outage.  It is only data operations
> that should be possible to lose.  It is the same for ext4.  (Assuming
> you are not using nonstandard mount options.)
>

ext3 journal may not hit the disk right. From what I read, if you do not
specifically
call sync, even the metadata operations do not hit disk.

See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt

commit=nrsec	(*)	Ext3 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve

performance.

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:

> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> wrote:
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details you
> had
> > sent earlier completely (on my phone). As you said, this is not related
> to
> > data loss related to HBase log and hsync. I think you are right; the
> rename
> > operation itself might not have hit the disk. I think we should either
> > ensure metadata operation is synced on the datanode or handle it being
> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>
> In theory, ext3 is journaled, so all metadata operations should be
> durable in the case of a power outage.  It is only data operations
> that should be possible to lose.  It is the same for ext4.  (Assuming
> you are not using nonstandard mount options.)
>

ext3 journal may not hit the disk right. From what I read, if you do not
specifically
call sync, even the metadata operations do not hit disk.

See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt

commit=nrsec	(*)	Ext3 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve

performance.

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

On Wed, Jul 3, 2013 at 8:12 AM, Colin McCabe <cm...@alumni.cmu.edu> wrote:

> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
> wrote:
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details you
> had
> > sent earlier completely (on my phone). As you said, this is not related
> to
> > data loss related to HBase log and hsync. I think you are right; the
> rename
> > operation itself might not have hit the disk. I think we should either
> > ensure metadata operation is synced on the datanode or handle it being
> > reported as blockBeingWritten. Let me spend sometime to debug this issue.
>
> In theory, ext3 is journaled, so all metadata operations should be
> durable in the case of a power outage.  It is only data operations
> that should be possible to lose.  It is the same for ext4.  (Assuming
> you are not using nonstandard mount options.)
>

ext3 journal may not hit the disk right. From what I read, if you do not
specifically
call sync, even the metadata operations do not hit disk.

See - https://www.kernel.org/doc/Documentation/filesystems/ext3.txt

commit=nrsec	(*)	Ext3 can be told to sync all its data and metadata
			every 'nrsec' seconds. The default value is 5 seconds.
			This means that if you lose your power, you will lose
			as much as the latest 5 seconds of work (your
			filesystem will not be damaged though, thanks to the
			journaling).  This default value (or any low value)
			will hurt performance, but it's good for data-safety.
			Setting it to 0 will have the same effect as leaving
			it at the default (5 seconds).
			Setting it to very large values will improve

performance.

Re: data loss after cluster wide power loss

Posted by Kihwal Lee <ki...@yahoo-inc.com>.

For the ext3 bug Colin mentioned, see
https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
file data and affects NN more. Since NN preallocates blocks for edits,
almost all data writes are done without modifying the file length. If
fsync() loses race to pdflush, the edit log content won't be flushed. I am
glad this bug was fixed.

Regardless of this bug, renames will be journaled but not necessarily be
committed before returning. If power fails before the transaction gets
committed, the transaction can be lost. Journaling will make sure the file
system will come back in a consistent state, but not all changes will be
there. To make a rename durable, the directory can be opened and
fsync()'ed. I think both source and destination are recorded in one
transaction, so fsyncing on destination directory should be enough.

When RWR blocks are reported, NN can probably accept and finalize it if
the block size is as expected.

Kihwal
 





On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:

>On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>wrote:
>> Dave,
>>
>> Thanks for the detailed email. Sorry I did not read all the details you
>>had
>> sent earlier completely (on my phone). As you said, this is not related
>>to
>> data loss related to HBase log and hsync. I think you are right; the
>>rename
>> operation itself might not have hit the disk. I think we should either
>> ensure metadata operation is synced on the datanode or handle it being
>> reported as blockBeingWritten. Let me spend sometime to debug this
>>issue.
>
>In theory, ext3 is journaled, so all metadata operations should be
>durable in the case of a power outage.  It is only data operations
>that should be possible to lose.  It is the same for ext4.  (Assuming
>you are not using nonstandard mount options.)
>
>In practice, it is possible that your hard disks didn't actually
>persist the data that they said they did.  Rumor has it that some
>drives ignore the SATA FLUSH CACHE command in some cases, since it
>makes them look bad in benchmarks.  In that case, there is nothing the
>filesystem or any other software can do.
>
>There was also a bug in older linux kernels where the kernel would not
>actually send FLUSH CACHE.  Since the poster is using ext3 and
>hadoop-1, it's possible he's also using an antique kernel as well.  I
>know for sure this affected LVM-- it used to ignore barriers until
>fairly recently.
>
>In Ceph, we used to recommend disabling the hard drive write cache if
>your kernel was older than 2.6.33.  You can read the recommendation
>for yourself here:
>http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
>/
> This will have an impact on performance, however.
>
>An uninterruptable power supply is not a bad idea.
>
>I am curious:
>what kernel version you are using?
>are you using LVM?
>
>Colin
>
>
>>
>> One surprising thing is, all the replicas were reported as
>> blockBeingWritten.
>>
>> Regards,
>> Suresh
>>
>>
>> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> (Removing hbase list and adding hdfs-dev list as this is pretty
>>>internal
>>> stuff).
>>>
>>> Reading through the code a bit:
>>>
>>> FSDataOutputStream.close calls
>>> DFSOutputStream.close calls
>>> DFSOutputStream.closeInternal
>>>  - sets currentPacket.lastPacketInBlock = true
>>>  - then calls
>>> DFSOutputStream.flushInternal
>>>  - enqueues current packet
>>>  - waits for ack
>>>
>>> BlockReceiver.run
>>>  - if (lastPacketInBlock && !receiver.finalized) calls
>>> FSDataset.finalizeBlock calls
>>> FSDataset.finalizeBlockInternal calls
>>> FSVolume.addBlock calls
>>> FSDir.addBlock calls
>>> FSDir.addBlock
>>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
>>>dir
>>>
>>> This looks to me as I would expect a synchronous chain from a DFS
>>>client
>>> to moving the file from blocksBeingWritten to the current dir so that
>>>once
>>> the file is closed that it the block files would be in the proper
>>>directory
>>> - even if the contents of the file are still in the OS buffer rather
>>>than
>>> synced to disk.  It's only after this moving of blocks that
>>> NameNode.complete file is called.  There are several conditions and
>>>loops in
>>> there that I'm not certain this chain is fully reliable in all cases
>>>without
>>> a greater understanding of the code.
>>>
>>> Could it be the case that the rename operation itself is not synced and
>>> that ext3 lost the fact that the block files were moved?
>>> Or is there a bug in the close file logic that for some reason the
>>>block
>>> files are not always moved into place when a file is closed?
>>>
>>> Thanks for your patience,
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
>>>wrote:
>>>>
>>>> Thanks for the response, Suresh.
>>>>
>>>> I'm not sure that I understand the details properly.  From my reading
>>>>of
>>>> HDFS-744 the hsync API would allow a client to make sure that at any
>>>>point
>>>> in time it's writes so far hit the disk.  For example, for HBase it
>>>>could
>>>> apply a fsync after adding some edits to its WAL to ensure those
>>>>edits are
>>>> fully durable for a file which is still open.
>>>>
>>>> However, in this case the dfs file was closed and even renamed.  Is it
>>>> the case that even after a dfs file is closed and renamed that the
>>>>data
>>>> blocks would still not be synced and would still be stored by the
>>>>datanode
>>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
>>>>would it
>>>> be better for the NameNode not to reject replicas that are in
>>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>>> available?
>>>>
>>>> Dave
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
>>>><su...@hortonworks.com>
>>>> wrote:
>>>>>
>>>>> Yes this is a known issue.
>>>>>
>>>>> The HDFS part of this was addressed in
>>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>>> not
>>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>>
>>>>>
>>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>>>wrote:
>>>>>
>>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
>>>>>Saturday
>>>>> > the
>>>>> > data center we were in had a total power failure and the cluster
>>>>>went
>>>>> > down
>>>>> > hard.  When we brought it back up, HDFS reported 4 files as
>>>>>CORRUPT.
>>>>> > We
>>>>> > recovered the data in question from our secondary datacenter, but
>>>>>I'm
>>>>> > trying to understand what happened and whether this is a bug in
>>>>>HDFS
>>>>> > that
>>>>> > should be fixed.
>>>>> >
>>>>> > From what I can tell the file was created and closed by the dfs
>>>>>client
>>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
>>>>>some
>>>>> > other
>>>>> > files containing the same data.  Then the cluster lost power.
>>>>>After
>>>>> > the
>>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>>> > the
>>>>> > blocks for this file appeared as "blocks being written" - the
>>>>>namenode
>>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>>> > there
>>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>> > The
>>>>> > underlying file systems are ext3.  Some questions that I would love
>>>>> > get
>>>>> > answers for if anyone with deeper understanding of HDFS can chime
>>>>>in:
>>>>> >
>>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>>> > HDFS-1539 but that seems different)
>>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>>> > that
>>>>> > happen before a file close operation is acknowledged to a hdfs
>>>>>client?
>>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>>>current
>>>>> > but
>>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
>>>>>of
>>>>> > underlying file system behavior)?
>>>>> >  - Is there any other explanation for how this can happen?
>>>>> >
>>>>> > Here is a sequence of selected relevant log lines from the RS
>>>>>(HBase
>>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>>> > question).  It includes everything that mentions the block in
>>>>>question
>>>>> > in
>>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>>> > information that would be helpful.
>>>>> >
>>>>> > RS 2013-06-29 11:16:06,812 DEBUG
>>>>>org.apache.hadoop.hbase.util.FSUtils:
>>>>> > Creating
>>>>> >
>>>>> > 
>>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
>>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>>> > with permission=rwxrwxrwx
>>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.allocateBlock:
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c.
>>>>> > blk_1395839728632046111_357084589
>>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>>> > 10.0.5.237:50010
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>>> > /10.0.5.237:14327
>>>>>
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
>>>>>for
>>>>> > block blk_1395839728632046111_357084589 terminating
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > Removing lease on  file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > DIR*
>>>>> > NameSystem.completeFile: file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
>>>>>file at
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
>>>>>0cc30af6e64e56ba5a539fdf159c4c
>>>>> > to
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
>>>>>30af6e64e56ba5a539fdf159c4c
>>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>>>compaction
>>>>> > of 7
>>>>> > file(s) in n of
>>>>> > 
>>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>>> > into
>>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
>>>>>is
>>>>> > 24.2m
>>>>> >
>>>>> > -------  CRASH, RESTART ---------
>>>>> >
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>>>but
>>>>> > was
>>>>> > rejected: Reported as block being written but is a block of closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.1:50010
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>>> > but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.5.237:50010
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>>>but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.24:50010
>>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>>> > believe timestamps are off by 6 hours:)
>>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>>> > blk_1395839728632046111_357084589 file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>>for
>>>>> > deletion
>>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>>> > blk_1395839728632046111_357084589 at file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>> >
>>>>> >
>>>>> > Thanks,
>>>>> > Dave
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://hortonworks.com/download/
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Kihwal Lee <ki...@yahoo-inc.com>.

For the ext3 bug Colin mentioned, see
https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
file data and affects NN more. Since NN preallocates blocks for edits,
almost all data writes are done without modifying the file length. If
fsync() loses race to pdflush, the edit log content won't be flushed. I am
glad this bug was fixed.

Regardless of this bug, renames will be journaled but not necessarily be
committed before returning. If power fails before the transaction gets
committed, the transaction can be lost. Journaling will make sure the file
system will come back in a consistent state, but not all changes will be
there. To make a rename durable, the directory can be opened and
fsync()'ed. I think both source and destination are recorded in one
transaction, so fsyncing on destination directory should be enough.

When RWR blocks are reported, NN can probably accept and finalize it if
the block size is as expected.

Kihwal
 





On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:

>On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>wrote:
>> Dave,
>>
>> Thanks for the detailed email. Sorry I did not read all the details you
>>had
>> sent earlier completely (on my phone). As you said, this is not related
>>to
>> data loss related to HBase log and hsync. I think you are right; the
>>rename
>> operation itself might not have hit the disk. I think we should either
>> ensure metadata operation is synced on the datanode or handle it being
>> reported as blockBeingWritten. Let me spend sometime to debug this
>>issue.
>
>In theory, ext3 is journaled, so all metadata operations should be
>durable in the case of a power outage.  It is only data operations
>that should be possible to lose.  It is the same for ext4.  (Assuming
>you are not using nonstandard mount options.)
>
>In practice, it is possible that your hard disks didn't actually
>persist the data that they said they did.  Rumor has it that some
>drives ignore the SATA FLUSH CACHE command in some cases, since it
>makes them look bad in benchmarks.  In that case, there is nothing the
>filesystem or any other software can do.
>
>There was also a bug in older linux kernels where the kernel would not
>actually send FLUSH CACHE.  Since the poster is using ext3 and
>hadoop-1, it's possible he's also using an antique kernel as well.  I
>know for sure this affected LVM-- it used to ignore barriers until
>fairly recently.
>
>In Ceph, we used to recommend disabling the hard drive write cache if
>your kernel was older than 2.6.33.  You can read the recommendation
>for yourself here:
>http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
>/
> This will have an impact on performance, however.
>
>An uninterruptable power supply is not a bad idea.
>
>I am curious:
>what kernel version you are using?
>are you using LVM?
>
>Colin
>
>
>>
>> One surprising thing is, all the replicas were reported as
>> blockBeingWritten.
>>
>> Regards,
>> Suresh
>>
>>
>> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> (Removing hbase list and adding hdfs-dev list as this is pretty
>>>internal
>>> stuff).
>>>
>>> Reading through the code a bit:
>>>
>>> FSDataOutputStream.close calls
>>> DFSOutputStream.close calls
>>> DFSOutputStream.closeInternal
>>>  - sets currentPacket.lastPacketInBlock = true
>>>  - then calls
>>> DFSOutputStream.flushInternal
>>>  - enqueues current packet
>>>  - waits for ack
>>>
>>> BlockReceiver.run
>>>  - if (lastPacketInBlock && !receiver.finalized) calls
>>> FSDataset.finalizeBlock calls
>>> FSDataset.finalizeBlockInternal calls
>>> FSVolume.addBlock calls
>>> FSDir.addBlock calls
>>> FSDir.addBlock
>>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
>>>dir
>>>
>>> This looks to me as I would expect a synchronous chain from a DFS
>>>client
>>> to moving the file from blocksBeingWritten to the current dir so that
>>>once
>>> the file is closed that it the block files would be in the proper
>>>directory
>>> - even if the contents of the file are still in the OS buffer rather
>>>than
>>> synced to disk.  It's only after this moving of blocks that
>>> NameNode.complete file is called.  There are several conditions and
>>>loops in
>>> there that I'm not certain this chain is fully reliable in all cases
>>>without
>>> a greater understanding of the code.
>>>
>>> Could it be the case that the rename operation itself is not synced and
>>> that ext3 lost the fact that the block files were moved?
>>> Or is there a bug in the close file logic that for some reason the
>>>block
>>> files are not always moved into place when a file is closed?
>>>
>>> Thanks for your patience,
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
>>>wrote:
>>>>
>>>> Thanks for the response, Suresh.
>>>>
>>>> I'm not sure that I understand the details properly.  From my reading
>>>>of
>>>> HDFS-744 the hsync API would allow a client to make sure that at any
>>>>point
>>>> in time it's writes so far hit the disk.  For example, for HBase it
>>>>could
>>>> apply a fsync after adding some edits to its WAL to ensure those
>>>>edits are
>>>> fully durable for a file which is still open.
>>>>
>>>> However, in this case the dfs file was closed and even renamed.  Is it
>>>> the case that even after a dfs file is closed and renamed that the
>>>>data
>>>> blocks would still not be synced and would still be stored by the
>>>>datanode
>>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
>>>>would it
>>>> be better for the NameNode not to reject replicas that are in
>>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>>> available?
>>>>
>>>> Dave
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
>>>><su...@hortonworks.com>
>>>> wrote:
>>>>>
>>>>> Yes this is a known issue.
>>>>>
>>>>> The HDFS part of this was addressed in
>>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>>> not
>>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>>
>>>>>
>>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>>>wrote:
>>>>>
>>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
>>>>>Saturday
>>>>> > the
>>>>> > data center we were in had a total power failure and the cluster
>>>>>went
>>>>> > down
>>>>> > hard.  When we brought it back up, HDFS reported 4 files as
>>>>>CORRUPT.
>>>>> > We
>>>>> > recovered the data in question from our secondary datacenter, but
>>>>>I'm
>>>>> > trying to understand what happened and whether this is a bug in
>>>>>HDFS
>>>>> > that
>>>>> > should be fixed.
>>>>> >
>>>>> > From what I can tell the file was created and closed by the dfs
>>>>>client
>>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
>>>>>some
>>>>> > other
>>>>> > files containing the same data.  Then the cluster lost power.
>>>>>After
>>>>> > the
>>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>>> > the
>>>>> > blocks for this file appeared as "blocks being written" - the
>>>>>namenode
>>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>>> > there
>>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>> > The
>>>>> > underlying file systems are ext3.  Some questions that I would love
>>>>> > get
>>>>> > answers for if anyone with deeper understanding of HDFS can chime
>>>>>in:
>>>>> >
>>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>>> > HDFS-1539 but that seems different)
>>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>>> > that
>>>>> > happen before a file close operation is acknowledged to a hdfs
>>>>>client?
>>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>>>current
>>>>> > but
>>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
>>>>>of
>>>>> > underlying file system behavior)?
>>>>> >  - Is there any other explanation for how this can happen?
>>>>> >
>>>>> > Here is a sequence of selected relevant log lines from the RS
>>>>>(HBase
>>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>>> > question).  It includes everything that mentions the block in
>>>>>question
>>>>> > in
>>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>>> > information that would be helpful.
>>>>> >
>>>>> > RS 2013-06-29 11:16:06,812 DEBUG
>>>>>org.apache.hadoop.hbase.util.FSUtils:
>>>>> > Creating
>>>>> >
>>>>> > 
>>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
>>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>>> > with permission=rwxrwxrwx
>>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.allocateBlock:
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c.
>>>>> > blk_1395839728632046111_357084589
>>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>>> > 10.0.5.237:50010
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>>> > /10.0.5.237:14327
>>>>>
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
>>>>>for
>>>>> > block blk_1395839728632046111_357084589 terminating
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > Removing lease on  file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > DIR*
>>>>> > NameSystem.completeFile: file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
>>>>>file at
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
>>>>>0cc30af6e64e56ba5a539fdf159c4c
>>>>> > to
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
>>>>>30af6e64e56ba5a539fdf159c4c
>>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>>>compaction
>>>>> > of 7
>>>>> > file(s) in n of
>>>>> > 
>>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>>> > into
>>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
>>>>>is
>>>>> > 24.2m
>>>>> >
>>>>> > -------  CRASH, RESTART ---------
>>>>> >
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>>>but
>>>>> > was
>>>>> > rejected: Reported as block being written but is a block of closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.1:50010
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>>> > but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.5.237:50010
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>>>but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.24:50010
>>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>>> > believe timestamps are off by 6 hours:)
>>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>>> > blk_1395839728632046111_357084589 file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>>for
>>>>> > deletion
>>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>>> > blk_1395839728632046111_357084589 at file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>> >
>>>>> >
>>>>> > Thanks,
>>>>> > Dave
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://hortonworks.com/download/
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Kihwal Lee <ki...@yahoo-inc.com>.

For the ext3 bug Colin mentioned, see
https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
file data and affects NN more. Since NN preallocates blocks for edits,
almost all data writes are done without modifying the file length. If
fsync() loses race to pdflush, the edit log content won't be flushed. I am
glad this bug was fixed.

Regardless of this bug, renames will be journaled but not necessarily be
committed before returning. If power fails before the transaction gets
committed, the transaction can be lost. Journaling will make sure the file
system will come back in a consistent state, but not all changes will be
there. To make a rename durable, the directory can be opened and
fsync()'ed. I think both source and destination are recorded in one
transaction, so fsyncing on destination directory should be enough.

When RWR blocks are reported, NN can probably accept and finalize it if
the block size is as expected.

Kihwal
 





On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:

>On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>wrote:
>> Dave,
>>
>> Thanks for the detailed email. Sorry I did not read all the details you
>>had
>> sent earlier completely (on my phone). As you said, this is not related
>>to
>> data loss related to HBase log and hsync. I think you are right; the
>>rename
>> operation itself might not have hit the disk. I think we should either
>> ensure metadata operation is synced on the datanode or handle it being
>> reported as blockBeingWritten. Let me spend sometime to debug this
>>issue.
>
>In theory, ext3 is journaled, so all metadata operations should be
>durable in the case of a power outage.  It is only data operations
>that should be possible to lose.  It is the same for ext4.  (Assuming
>you are not using nonstandard mount options.)
>
>In practice, it is possible that your hard disks didn't actually
>persist the data that they said they did.  Rumor has it that some
>drives ignore the SATA FLUSH CACHE command in some cases, since it
>makes them look bad in benchmarks.  In that case, there is nothing the
>filesystem or any other software can do.
>
>There was also a bug in older linux kernels where the kernel would not
>actually send FLUSH CACHE.  Since the poster is using ext3 and
>hadoop-1, it's possible he's also using an antique kernel as well.  I
>know for sure this affected LVM-- it used to ignore barriers until
>fairly recently.
>
>In Ceph, we used to recommend disabling the hard drive write cache if
>your kernel was older than 2.6.33.  You can read the recommendation
>for yourself here:
>http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
>/
> This will have an impact on performance, however.
>
>An uninterruptable power supply is not a bad idea.
>
>I am curious:
>what kernel version you are using?
>are you using LVM?
>
>Colin
>
>
>>
>> One surprising thing is, all the replicas were reported as
>> blockBeingWritten.
>>
>> Regards,
>> Suresh
>>
>>
>> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> (Removing hbase list and adding hdfs-dev list as this is pretty
>>>internal
>>> stuff).
>>>
>>> Reading through the code a bit:
>>>
>>> FSDataOutputStream.close calls
>>> DFSOutputStream.close calls
>>> DFSOutputStream.closeInternal
>>>  - sets currentPacket.lastPacketInBlock = true
>>>  - then calls
>>> DFSOutputStream.flushInternal
>>>  - enqueues current packet
>>>  - waits for ack
>>>
>>> BlockReceiver.run
>>>  - if (lastPacketInBlock && !receiver.finalized) calls
>>> FSDataset.finalizeBlock calls
>>> FSDataset.finalizeBlockInternal calls
>>> FSVolume.addBlock calls
>>> FSDir.addBlock calls
>>> FSDir.addBlock
>>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
>>>dir
>>>
>>> This looks to me as I would expect a synchronous chain from a DFS
>>>client
>>> to moving the file from blocksBeingWritten to the current dir so that
>>>once
>>> the file is closed that it the block files would be in the proper
>>>directory
>>> - even if the contents of the file are still in the OS buffer rather
>>>than
>>> synced to disk.  It's only after this moving of blocks that
>>> NameNode.complete file is called.  There are several conditions and
>>>loops in
>>> there that I'm not certain this chain is fully reliable in all cases
>>>without
>>> a greater understanding of the code.
>>>
>>> Could it be the case that the rename operation itself is not synced and
>>> that ext3 lost the fact that the block files were moved?
>>> Or is there a bug in the close file logic that for some reason the
>>>block
>>> files are not always moved into place when a file is closed?
>>>
>>> Thanks for your patience,
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
>>>wrote:
>>>>
>>>> Thanks for the response, Suresh.
>>>>
>>>> I'm not sure that I understand the details properly.  From my reading
>>>>of
>>>> HDFS-744 the hsync API would allow a client to make sure that at any
>>>>point
>>>> in time it's writes so far hit the disk.  For example, for HBase it
>>>>could
>>>> apply a fsync after adding some edits to its WAL to ensure those
>>>>edits are
>>>> fully durable for a file which is still open.
>>>>
>>>> However, in this case the dfs file was closed and even renamed.  Is it
>>>> the case that even after a dfs file is closed and renamed that the
>>>>data
>>>> blocks would still not be synced and would still be stored by the
>>>>datanode
>>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
>>>>would it
>>>> be better for the NameNode not to reject replicas that are in
>>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>>> available?
>>>>
>>>> Dave
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
>>>><su...@hortonworks.com>
>>>> wrote:
>>>>>
>>>>> Yes this is a known issue.
>>>>>
>>>>> The HDFS part of this was addressed in
>>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>>> not
>>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>>
>>>>>
>>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>>>wrote:
>>>>>
>>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
>>>>>Saturday
>>>>> > the
>>>>> > data center we were in had a total power failure and the cluster
>>>>>went
>>>>> > down
>>>>> > hard.  When we brought it back up, HDFS reported 4 files as
>>>>>CORRUPT.
>>>>> > We
>>>>> > recovered the data in question from our secondary datacenter, but
>>>>>I'm
>>>>> > trying to understand what happened and whether this is a bug in
>>>>>HDFS
>>>>> > that
>>>>> > should be fixed.
>>>>> >
>>>>> > From what I can tell the file was created and closed by the dfs
>>>>>client
>>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
>>>>>some
>>>>> > other
>>>>> > files containing the same data.  Then the cluster lost power.
>>>>>After
>>>>> > the
>>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>>> > the
>>>>> > blocks for this file appeared as "blocks being written" - the
>>>>>namenode
>>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>>> > there
>>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>> > The
>>>>> > underlying file systems are ext3.  Some questions that I would love
>>>>> > get
>>>>> > answers for if anyone with deeper understanding of HDFS can chime
>>>>>in:
>>>>> >
>>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>>> > HDFS-1539 but that seems different)
>>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>>> > that
>>>>> > happen before a file close operation is acknowledged to a hdfs
>>>>>client?
>>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>>>current
>>>>> > but
>>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
>>>>>of
>>>>> > underlying file system behavior)?
>>>>> >  - Is there any other explanation for how this can happen?
>>>>> >
>>>>> > Here is a sequence of selected relevant log lines from the RS
>>>>>(HBase
>>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>>> > question).  It includes everything that mentions the block in
>>>>>question
>>>>> > in
>>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>>> > information that would be helpful.
>>>>> >
>>>>> > RS 2013-06-29 11:16:06,812 DEBUG
>>>>>org.apache.hadoop.hbase.util.FSUtils:
>>>>> > Creating
>>>>> >
>>>>> > 
>>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
>>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>>> > with permission=rwxrwxrwx
>>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.allocateBlock:
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c.
>>>>> > blk_1395839728632046111_357084589
>>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>>> > 10.0.5.237:50010
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>>> > /10.0.5.237:14327
>>>>>
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
>>>>>for
>>>>> > block blk_1395839728632046111_357084589 terminating
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > Removing lease on  file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > DIR*
>>>>> > NameSystem.completeFile: file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
>>>>>file at
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
>>>>>0cc30af6e64e56ba5a539fdf159c4c
>>>>> > to
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
>>>>>30af6e64e56ba5a539fdf159c4c
>>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>>>compaction
>>>>> > of 7
>>>>> > file(s) in n of
>>>>> > 
>>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>>> > into
>>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
>>>>>is
>>>>> > 24.2m
>>>>> >
>>>>> > -------  CRASH, RESTART ---------
>>>>> >
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>>>but
>>>>> > was
>>>>> > rejected: Reported as block being written but is a block of closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.1:50010
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>>> > but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.5.237:50010
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>>>but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.24:50010
>>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>>> > believe timestamps are off by 6 hours:)
>>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>>> > blk_1395839728632046111_357084589 file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>>for
>>>>> > deletion
>>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>>> > blk_1395839728632046111_357084589 at file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>> >
>>>>> >
>>>>> > Thanks,
>>>>> > Dave
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://hortonworks.com/download/
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Kihwal Lee <ki...@yahoo-inc.com>.

For the ext3 bug Colin mentioned, see
https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
file data and affects NN more. Since NN preallocates blocks for edits,
almost all data writes are done without modifying the file length. If
fsync() loses race to pdflush, the edit log content won't be flushed. I am
glad this bug was fixed.

Regardless of this bug, renames will be journaled but not necessarily be
committed before returning. If power fails before the transaction gets
committed, the transaction can be lost. Journaling will make sure the file
system will come back in a consistent state, but not all changes will be
there. To make a rename durable, the directory can be opened and
fsync()'ed. I think both source and destination are recorded in one
transaction, so fsyncing on destination directory should be enough.

When RWR blocks are reported, NN can probably accept and finalize it if
the block size is as expected.

Kihwal
 





On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:

>On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>wrote:
>> Dave,
>>
>> Thanks for the detailed email. Sorry I did not read all the details you
>>had
>> sent earlier completely (on my phone). As you said, this is not related
>>to
>> data loss related to HBase log and hsync. I think you are right; the
>>rename
>> operation itself might not have hit the disk. I think we should either
>> ensure metadata operation is synced on the datanode or handle it being
>> reported as blockBeingWritten. Let me spend sometime to debug this
>>issue.
>
>In theory, ext3 is journaled, so all metadata operations should be
>durable in the case of a power outage.  It is only data operations
>that should be possible to lose.  It is the same for ext4.  (Assuming
>you are not using nonstandard mount options.)
>
>In practice, it is possible that your hard disks didn't actually
>persist the data that they said they did.  Rumor has it that some
>drives ignore the SATA FLUSH CACHE command in some cases, since it
>makes them look bad in benchmarks.  In that case, there is nothing the
>filesystem or any other software can do.
>
>There was also a bug in older linux kernels where the kernel would not
>actually send FLUSH CACHE.  Since the poster is using ext3 and
>hadoop-1, it's possible he's also using an antique kernel as well.  I
>know for sure this affected LVM-- it used to ignore barriers until
>fairly recently.
>
>In Ceph, we used to recommend disabling the hard drive write cache if
>your kernel was older than 2.6.33.  You can read the recommendation
>for yourself here:
>http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
>/
> This will have an impact on performance, however.
>
>An uninterruptable power supply is not a bad idea.
>
>I am curious:
>what kernel version you are using?
>are you using LVM?
>
>Colin
>
>
>>
>> One surprising thing is, all the replicas were reported as
>> blockBeingWritten.
>>
>> Regards,
>> Suresh
>>
>>
>> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> (Removing hbase list and adding hdfs-dev list as this is pretty
>>>internal
>>> stuff).
>>>
>>> Reading through the code a bit:
>>>
>>> FSDataOutputStream.close calls
>>> DFSOutputStream.close calls
>>> DFSOutputStream.closeInternal
>>>  - sets currentPacket.lastPacketInBlock = true
>>>  - then calls
>>> DFSOutputStream.flushInternal
>>>  - enqueues current packet
>>>  - waits for ack
>>>
>>> BlockReceiver.run
>>>  - if (lastPacketInBlock && !receiver.finalized) calls
>>> FSDataset.finalizeBlock calls
>>> FSDataset.finalizeBlockInternal calls
>>> FSVolume.addBlock calls
>>> FSDir.addBlock calls
>>> FSDir.addBlock
>>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
>>>dir
>>>
>>> This looks to me as I would expect a synchronous chain from a DFS
>>>client
>>> to moving the file from blocksBeingWritten to the current dir so that
>>>once
>>> the file is closed that it the block files would be in the proper
>>>directory
>>> - even if the contents of the file are still in the OS buffer rather
>>>than
>>> synced to disk.  It's only after this moving of blocks that
>>> NameNode.complete file is called.  There are several conditions and
>>>loops in
>>> there that I'm not certain this chain is fully reliable in all cases
>>>without
>>> a greater understanding of the code.
>>>
>>> Could it be the case that the rename operation itself is not synced and
>>> that ext3 lost the fact that the block files were moved?
>>> Or is there a bug in the close file logic that for some reason the
>>>block
>>> files are not always moved into place when a file is closed?
>>>
>>> Thanks for your patience,
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
>>>wrote:
>>>>
>>>> Thanks for the response, Suresh.
>>>>
>>>> I'm not sure that I understand the details properly.  From my reading
>>>>of
>>>> HDFS-744 the hsync API would allow a client to make sure that at any
>>>>point
>>>> in time it's writes so far hit the disk.  For example, for HBase it
>>>>could
>>>> apply a fsync after adding some edits to its WAL to ensure those
>>>>edits are
>>>> fully durable for a file which is still open.
>>>>
>>>> However, in this case the dfs file was closed and even renamed.  Is it
>>>> the case that even after a dfs file is closed and renamed that the
>>>>data
>>>> blocks would still not be synced and would still be stored by the
>>>>datanode
>>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
>>>>would it
>>>> be better for the NameNode not to reject replicas that are in
>>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>>> available?
>>>>
>>>> Dave
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
>>>><su...@hortonworks.com>
>>>> wrote:
>>>>>
>>>>> Yes this is a known issue.
>>>>>
>>>>> The HDFS part of this was addressed in
>>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>>> not
>>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>>
>>>>>
>>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>>>wrote:
>>>>>
>>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
>>>>>Saturday
>>>>> > the
>>>>> > data center we were in had a total power failure and the cluster
>>>>>went
>>>>> > down
>>>>> > hard.  When we brought it back up, HDFS reported 4 files as
>>>>>CORRUPT.
>>>>> > We
>>>>> > recovered the data in question from our secondary datacenter, but
>>>>>I'm
>>>>> > trying to understand what happened and whether this is a bug in
>>>>>HDFS
>>>>> > that
>>>>> > should be fixed.
>>>>> >
>>>>> > From what I can tell the file was created and closed by the dfs
>>>>>client
>>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
>>>>>some
>>>>> > other
>>>>> > files containing the same data.  Then the cluster lost power.
>>>>>After
>>>>> > the
>>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>>> > the
>>>>> > blocks for this file appeared as "blocks being written" - the
>>>>>namenode
>>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>>> > there
>>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>> > The
>>>>> > underlying file systems are ext3.  Some questions that I would love
>>>>> > get
>>>>> > answers for if anyone with deeper understanding of HDFS can chime
>>>>>in:
>>>>> >
>>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>>> > HDFS-1539 but that seems different)
>>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>>> > that
>>>>> > happen before a file close operation is acknowledged to a hdfs
>>>>>client?
>>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>>>current
>>>>> > but
>>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
>>>>>of
>>>>> > underlying file system behavior)?
>>>>> >  - Is there any other explanation for how this can happen?
>>>>> >
>>>>> > Here is a sequence of selected relevant log lines from the RS
>>>>>(HBase
>>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>>> > question).  It includes everything that mentions the block in
>>>>>question
>>>>> > in
>>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>>> > information that would be helpful.
>>>>> >
>>>>> > RS 2013-06-29 11:16:06,812 DEBUG
>>>>>org.apache.hadoop.hbase.util.FSUtils:
>>>>> > Creating
>>>>> >
>>>>> > 
>>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
>>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>>> > with permission=rwxrwxrwx
>>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.allocateBlock:
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c.
>>>>> > blk_1395839728632046111_357084589
>>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>>> > 10.0.5.237:50010
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>>> > /10.0.5.237:14327
>>>>>
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
>>>>>for
>>>>> > block blk_1395839728632046111_357084589 terminating
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > Removing lease on  file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > DIR*
>>>>> > NameSystem.completeFile: file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
>>>>>file at
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
>>>>>0cc30af6e64e56ba5a539fdf159c4c
>>>>> > to
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
>>>>>30af6e64e56ba5a539fdf159c4c
>>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>>>compaction
>>>>> > of 7
>>>>> > file(s) in n of
>>>>> > 
>>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>>> > into
>>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
>>>>>is
>>>>> > 24.2m
>>>>> >
>>>>> > -------  CRASH, RESTART ---------
>>>>> >
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>>>but
>>>>> > was
>>>>> > rejected: Reported as block being written but is a block of closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.1:50010
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>>> > but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.5.237:50010
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>>>but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.24:50010
>>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>>> > believe timestamps are off by 6 hours:)
>>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>>> > blk_1395839728632046111_357084589 file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>>for
>>>>> > deletion
>>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>>> > blk_1395839728632046111_357084589 at file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>> >
>>>>> >
>>>>> > Thanks,
>>>>> > Dave
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://hortonworks.com/download/
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Kihwal Lee <ki...@yahoo-inc.com>.

For the ext3 bug Colin mentioned, see
https://bugzilla.redhat.com/show_bug.cgi?id=592961. This was fixed in
2.6.32 and backported in RHEL 5.4 (or CENTOS). This has more to do with
file data and affects NN more. Since NN preallocates blocks for edits,
almost all data writes are done without modifying the file length. If
fsync() loses race to pdflush, the edit log content won't be flushed. I am
glad this bug was fixed.

Regardless of this bug, renames will be journaled but not necessarily be
committed before returning. If power fails before the transaction gets
committed, the transaction can be lost. Journaling will make sure the file
system will come back in a consistent state, but not all changes will be
there. To make a rename durable, the directory can be opened and
fsync()'ed. I think both source and destination are recorded in one
transaction, so fsyncing on destination directory should be enough.

When RWR blocks are reported, NN can probably accept and finalize it if
the block size is as expected.

Kihwal
 





On 7/3/13 10:12 AM, "Colin McCabe" <cm...@alumni.cmu.edu> wrote:

>On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>
>wrote:
>> Dave,
>>
>> Thanks for the detailed email. Sorry I did not read all the details you
>>had
>> sent earlier completely (on my phone). As you said, this is not related
>>to
>> data loss related to HBase log and hsync. I think you are right; the
>>rename
>> operation itself might not have hit the disk. I think we should either
>> ensure metadata operation is synced on the datanode or handle it being
>> reported as blockBeingWritten. Let me spend sometime to debug this
>>issue.
>
>In theory, ext3 is journaled, so all metadata operations should be
>durable in the case of a power outage.  It is only data operations
>that should be possible to lose.  It is the same for ext4.  (Assuming
>you are not using nonstandard mount options.)
>
>In practice, it is possible that your hard disks didn't actually
>persist the data that they said they did.  Rumor has it that some
>drives ignore the SATA FLUSH CACHE command in some cases, since it
>makes them look bad in benchmarks.  In that case, there is nothing the
>filesystem or any other software can do.
>
>There was also a bug in older linux kernels where the kernel would not
>actually send FLUSH CACHE.  Since the poster is using ext3 and
>hadoop-1, it's possible he's also using an antique kernel as well.  I
>know for sure this affected LVM-- it used to ignore barriers until
>fairly recently.
>
>In Ceph, we used to recommend disabling the hard drive write cache if
>your kernel was older than 2.6.33.  You can read the recommendation
>for yourself here:
>http://ceph.com/docs/master/rados/configuration/filesystem-recommendations
>/
> This will have an impact on performance, however.
>
>An uninterruptable power supply is not a bad idea.
>
>I am curious:
>what kernel version you are using?
>are you using LVM?
>
>Colin
>
>
>>
>> One surprising thing is, all the replicas were reported as
>> blockBeingWritten.
>>
>> Regards,
>> Suresh
>>
>>
>> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> (Removing hbase list and adding hdfs-dev list as this is pretty
>>>internal
>>> stuff).
>>>
>>> Reading through the code a bit:
>>>
>>> FSDataOutputStream.close calls
>>> DFSOutputStream.close calls
>>> DFSOutputStream.closeInternal
>>>  - sets currentPacket.lastPacketInBlock = true
>>>  - then calls
>>> DFSOutputStream.flushInternal
>>>  - enqueues current packet
>>>  - waits for ack
>>>
>>> BlockReceiver.run
>>>  - if (lastPacketInBlock && !receiver.finalized) calls
>>> FSDataset.finalizeBlock calls
>>> FSDataset.finalizeBlockInternal calls
>>> FSVolume.addBlock calls
>>> FSDir.addBlock calls
>>> FSDir.addBlock
>>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
>>>dir
>>>
>>> This looks to me as I would expect a synchronous chain from a DFS
>>>client
>>> to moving the file from blocksBeingWritten to the current dir so that
>>>once
>>> the file is closed that it the block files would be in the proper
>>>directory
>>> - even if the contents of the file are still in the OS buffer rather
>>>than
>>> synced to disk.  It's only after this moving of blocks that
>>> NameNode.complete file is called.  There are several conditions and
>>>loops in
>>> there that I'm not certain this chain is fully reliable in all cases
>>>without
>>> a greater understanding of the code.
>>>
>>> Could it be the case that the rename operation itself is not synced and
>>> that ext3 lost the fact that the block files were moved?
>>> Or is there a bug in the close file logic that for some reason the
>>>block
>>> files are not always moved into place when a file is closed?
>>>
>>> Thanks for your patience,
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
>>>wrote:
>>>>
>>>> Thanks for the response, Suresh.
>>>>
>>>> I'm not sure that I understand the details properly.  From my reading
>>>>of
>>>> HDFS-744 the hsync API would allow a client to make sure that at any
>>>>point
>>>> in time it's writes so far hit the disk.  For example, for HBase it
>>>>could
>>>> apply a fsync after adding some edits to its WAL to ensure those
>>>>edits are
>>>> fully durable for a file which is still open.
>>>>
>>>> However, in this case the dfs file was closed and even renamed.  Is it
>>>> the case that even after a dfs file is closed and renamed that the
>>>>data
>>>> blocks would still not be synced and would still be stored by the
>>>>datanode
>>>> in "blocksBeingWritten" rather than in "current"?  If that is case,
>>>>would it
>>>> be better for the NameNode not to reject replicas that are in
>>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>>> available?
>>>>
>>>> Dave
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas
>>>><su...@hortonworks.com>
>>>> wrote:
>>>>>
>>>>> Yes this is a known issue.
>>>>>
>>>>> The HDFS part of this was addressed in
>>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>>> not
>>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>>
>>>>>
>>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>>>wrote:
>>>>>
>>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
>>>>>Saturday
>>>>> > the
>>>>> > data center we were in had a total power failure and the cluster
>>>>>went
>>>>> > down
>>>>> > hard.  When we brought it back up, HDFS reported 4 files as
>>>>>CORRUPT.
>>>>> > We
>>>>> > recovered the data in question from our secondary datacenter, but
>>>>>I'm
>>>>> > trying to understand what happened and whether this is a bug in
>>>>>HDFS
>>>>> > that
>>>>> > should be fixed.
>>>>> >
>>>>> > From what I can tell the file was created and closed by the dfs
>>>>>client
>>>>> > (hbase).  Then HBase renamed it into a new directory and deleted
>>>>>some
>>>>> > other
>>>>> > files containing the same data.  Then the cluster lost power.
>>>>>After
>>>>> > the
>>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>>> > the
>>>>> > blocks for this file appeared as "blocks being written" - the
>>>>>namenode
>>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>>> > there
>>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>> > The
>>>>> > underlying file systems are ext3.  Some questions that I would love
>>>>> > get
>>>>> > answers for if anyone with deeper understanding of HDFS can chime
>>>>>in:
>>>>> >
>>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>>> > HDFS-1539 but that seems different)
>>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>>> > that
>>>>> > happen before a file close operation is acknowledged to a hdfs
>>>>>client?
>>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>>>current
>>>>> > but
>>>>> > after the restart ext3 rewound state somehow (forgive my ignorance
>>>>>of
>>>>> > underlying file system behavior)?
>>>>> >  - Is there any other explanation for how this can happen?
>>>>> >
>>>>> > Here is a sequence of selected relevant log lines from the RS
>>>>>(HBase
>>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>>> > question).  It includes everything that mentions the block in
>>>>>question
>>>>> > in
>>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>>> > information that would be helpful.
>>>>> >
>>>>> > RS 2013-06-29 11:16:06,812 DEBUG
>>>>>org.apache.hadoop.hbase.util.FSUtils:
>>>>> > Creating
>>>>> >
>>>>> > 
>>>>>file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.t
>>>>>mp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>>> > with permission=rwxrwxrwx
>>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.allocateBlock:
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c.
>>>>> > blk_1395839728632046111_357084589
>>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>>> > 10.0.5.237:50010
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>>>added
>>>>> > to
>>>>> > blk_1395839728632046111_357084589 size 25418340
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>>> > /10.0.5.237:14327
>>>>>
>>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2
>>>>>for
>>>>> > block blk_1395839728632046111_357084589 terminating
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > Removing lease on  file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > DIR*
>>>>> > NameSystem.completeFile: file
>>>>> >
>>>>> > 
>>>>>/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56b
>>>>>a5a539fdf159c4c
>>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
>>>>>file at
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e
>>>>>0cc30af6e64e56ba5a539fdf159c4c
>>>>> > to
>>>>> >
>>>>> > 
>>>>>hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc
>>>>>30af6e64e56ba5a539fdf159c4c
>>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>>>compaction
>>>>> > of 7
>>>>> > file(s) in n of
>>>>> > 
>>>>>users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>>> > into
>>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store
>>>>>is
>>>>> > 24.2m
>>>>> >
>>>>> > -------  CRASH, RESTART ---------
>>>>> >
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>>>but
>>>>> > was
>>>>> > rejected: Reported as block being written but is a block of closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.1:50010
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>>> > but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.5.237:50010
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>>>but
>>>>> > was rejected: Reported as block being written but is a block of
>>>>>closed
>>>>> > file.
>>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>>> > BLOCK*
>>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>>> > invalidSet
>>>>> > of 10.0.6.24:50010
>>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>>> > believe timestamps are off by 6 hours:)
>>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>>> > blk_1395839728632046111_357084589 file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>>for
>>>>> > deletion
>>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>>> > blk_1395839728632046111_357084589 at file
>>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>>> >
>>>>> >
>>>>> > Thanks,
>>>>> > Dave
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> http://hortonworks.com/download/
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com> wrote:
> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you had
> sent earlier completely (on my phone). As you said, this is not related to
> data loss related to HBase log and hsync. I think you are right; the rename
> operation itself might not have hit the disk. I think we should either
> ensure metadata operation is synced on the datanode or handle it being
> reported as blockBeingWritten. Let me spend sometime to debug this issue.

In theory, ext3 is journaled, so all metadata operations should be
durable in the case of a power outage.  It is only data operations
that should be possible to lose.  It is the same for ext4.  (Assuming
you are not using nonstandard mount options.)

In practice, it is possible that your hard disks didn't actually
persist the data that they said they did.  Rumor has it that some
drives ignore the SATA FLUSH CACHE command in some cases, since it
makes them look bad in benchmarks.  In that case, there is nothing the
filesystem or any other software can do.

There was also a bug in older linux kernels where the kernel would not
actually send FLUSH CACHE.  Since the poster is using ext3 and
hadoop-1, it's possible he's also using an antique kernel as well.  I
know for sure this affected LVM-- it used to ignore barriers until
fairly recently.

In Ceph, we used to recommend disabling the hard drive write cache if
your kernel was older than 2.6.33.  You can read the recommendation
for yourself here:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
 This will have an impact on performance, however.

An uninterruptable power supply is not a bad idea.

I am curious:
what kernel version you are using?
are you using LVM?

Colin


>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops in
>> there that I'm not certain this chain is fully reliable in all cases without
>> a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would it
>>> be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>
>>> wrote:
>>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> > the
>>>> > data center we were in had a total power failure and the cluster went
>>>> > down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>> > We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> > that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> > other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> > the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> > the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> > there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>> > The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> > get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> > that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>>> > but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in question
>>>> > in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>> > /10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>>> > of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> > into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> > 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>>> > was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> > but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>
>>>
>>
>
>
>
> --
> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com> wrote:
> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you had
> sent earlier completely (on my phone). As you said, this is not related to
> data loss related to HBase log and hsync. I think you are right; the rename
> operation itself might not have hit the disk. I think we should either
> ensure metadata operation is synced on the datanode or handle it being
> reported as blockBeingWritten. Let me spend sometime to debug this issue.

In theory, ext3 is journaled, so all metadata operations should be
durable in the case of a power outage.  It is only data operations
that should be possible to lose.  It is the same for ext4.  (Assuming
you are not using nonstandard mount options.)

In practice, it is possible that your hard disks didn't actually
persist the data that they said they did.  Rumor has it that some
drives ignore the SATA FLUSH CACHE command in some cases, since it
makes them look bad in benchmarks.  In that case, there is nothing the
filesystem or any other software can do.

There was also a bug in older linux kernels where the kernel would not
actually send FLUSH CACHE.  Since the poster is using ext3 and
hadoop-1, it's possible he's also using an antique kernel as well.  I
know for sure this affected LVM-- it used to ignore barriers until
fairly recently.

In Ceph, we used to recommend disabling the hard drive write cache if
your kernel was older than 2.6.33.  You can read the recommendation
for yourself here:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
 This will have an impact on performance, however.

An uninterruptable power supply is not a bad idea.

I am curious:
what kernel version you are using?
are you using LVM?

Colin


>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops in
>> there that I'm not certain this chain is fully reliable in all cases without
>> a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would it
>>> be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>
>>> wrote:
>>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> > the
>>>> > data center we were in had a total power failure and the cluster went
>>>> > down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>> > We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> > that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> > other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> > the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> > the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> > there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>> > The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> > get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> > that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>>> > but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in question
>>>> > in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>> > /10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>>> > of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> > into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> > 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>>> > was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> > but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>
>>>
>>
>
>
>
> --
> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com> wrote:
> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you had
> sent earlier completely (on my phone). As you said, this is not related to
> data loss related to HBase log and hsync. I think you are right; the rename
> operation itself might not have hit the disk. I think we should either
> ensure metadata operation is synced on the datanode or handle it being
> reported as blockBeingWritten. Let me spend sometime to debug this issue.

In theory, ext3 is journaled, so all metadata operations should be
durable in the case of a power outage.  It is only data operations
that should be possible to lose.  It is the same for ext4.  (Assuming
you are not using nonstandard mount options.)

In practice, it is possible that your hard disks didn't actually
persist the data that they said they did.  Rumor has it that some
drives ignore the SATA FLUSH CACHE command in some cases, since it
makes them look bad in benchmarks.  In that case, there is nothing the
filesystem or any other software can do.

There was also a bug in older linux kernels where the kernel would not
actually send FLUSH CACHE.  Since the poster is using ext3 and
hadoop-1, it's possible he's also using an antique kernel as well.  I
know for sure this affected LVM-- it used to ignore barriers until
fairly recently.

In Ceph, we used to recommend disabling the hard drive write cache if
your kernel was older than 2.6.33.  You can read the recommendation
for yourself here:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
 This will have an impact on performance, however.

An uninterruptable power supply is not a bad idea.

I am curious:
what kernel version you are using?
are you using LVM?

Colin


>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops in
>> there that I'm not certain this chain is fully reliable in all cases without
>> a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would it
>>> be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>
>>> wrote:
>>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> > the
>>>> > data center we were in had a total power failure and the cluster went
>>>> > down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>> > We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> > that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> > other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> > the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> > the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> > there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>> > The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> > get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> > that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>>> > but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in question
>>>> > in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>> > /10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>>> > of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> > into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> > 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>>> > was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> > but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>
>>>
>>
>
>
>
> --
> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com> wrote:
> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you had
> sent earlier completely (on my phone). As you said, this is not related to
> data loss related to HBase log and hsync. I think you are right; the rename
> operation itself might not have hit the disk. I think we should either
> ensure metadata operation is synced on the datanode or handle it being
> reported as blockBeingWritten. Let me spend sometime to debug this issue.

In theory, ext3 is journaled, so all metadata operations should be
durable in the case of a power outage.  It is only data operations
that should be possible to lose.  It is the same for ext4.  (Assuming
you are not using nonstandard mount options.)

In practice, it is possible that your hard disks didn't actually
persist the data that they said they did.  Rumor has it that some
drives ignore the SATA FLUSH CACHE command in some cases, since it
makes them look bad in benchmarks.  In that case, there is nothing the
filesystem or any other software can do.

There was also a bug in older linux kernels where the kernel would not
actually send FLUSH CACHE.  Since the poster is using ext3 and
hadoop-1, it's possible he's also using an antique kernel as well.  I
know for sure this affected LVM-- it used to ignore barriers until
fairly recently.

In Ceph, we used to recommend disabling the hard drive write cache if
your kernel was older than 2.6.33.  You can read the recommendation
for yourself here:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
 This will have an impact on performance, however.

An uninterruptable power supply is not a bad idea.

I am curious:
what kernel version you are using?
are you using LVM?

Colin


>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops in
>> there that I'm not certain this chain is fully reliable in all cases without
>> a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would it
>>> be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>
>>> wrote:
>>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> > the
>>>> > data center we were in had a total power failure and the cluster went
>>>> > down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>> > We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> > that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> > other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> > the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> > the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> > there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>> > The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> > get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> > that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>>> > but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in question
>>>> > in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>> > /10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>>> > of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> > into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> > 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>>> > was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> > but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>
>>>
>>
>
>
>
> --
> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Azuryy Yu <az...@gmail.com>.

Hi Uma,

I  think there is minimum performance degration if set
dfs.datanode.synconclose to true.


On Tue, Jul 2, 2013 at 3:31 PM, Uma Maheswara Rao G <ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Azuryy Yu <az...@gmail.com>.

Hi Uma,

I  think there is minimum performance degration if set
dfs.datanode.synconclose to true.


On Tue, Jul 2, 2013 at 3:31 PM, Uma Maheswara Rao G <ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

On Wed, Jul 3, 2013 at 7:03 AM, Michael Segel <mi...@hotmail.com>wrote:

> How did you lose power to the entire cluster?
>
> I realize that this question goes beyond HBase, but is an Ops question. Do
> you have redundant power sources and redundant power supplies to the racks
> and machines in the cluster?
>

The data center that cluster is in suffered a complete power failure during
some maintenance they were running.  I don't know all the details though
apparently they made some mistakes in their process.  Our namenode does
have two power supplies, but they are all dependent on the datacenter.

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

On Wed, Jul 3, 2013 at 7:03 AM, Michael Segel <mi...@hotmail.com>wrote:

> How did you lose power to the entire cluster?
>
> I realize that this question goes beyond HBase, but is an Ops question. Do
> you have redundant power sources and redundant power supplies to the racks
> and machines in the cluster?
>

The data center that cluster is in suffered a complete power failure during
some maintenance they were running.  I don't know all the details though
apparently they made some mistakes in their process.  Our namenode does
have two power supplies, but they are all dependent on the datacenter.

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

On Wed, Jul 3, 2013 at 7:03 AM, Michael Segel <mi...@hotmail.com>wrote:

> How did you lose power to the entire cluster?
>
> I realize that this question goes beyond HBase, but is an Ops question. Do
> you have redundant power sources and redundant power supplies to the racks
> and machines in the cluster?
>

The data center that cluster is in suffered a complete power failure during
some maintenance they were running.  I don't know all the details though
apparently they made some mistakes in their process.  Our namenode does
have two power supplies, but they are all dependent on the datacenter.

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

On Wed, Jul 3, 2013 at 7:03 AM, Michael Segel <mi...@hotmail.com>wrote:

> How did you lose power to the entire cluster?
>
> I realize that this question goes beyond HBase, but is an Ops question. Do
> you have redundant power sources and redundant power supplies to the racks
> and machines in the cluster?
>

The data center that cluster is in suffered a complete power failure during
some maintenance they were running.  I don't know all the details though
apparently they made some mistakes in their process.  Our namenode does
have two power supplies, but they are all dependent on the datacenter.

Re: data loss after cluster wide power loss

Posted by Michael Segel <mi...@hotmail.com>.

Dave, 

How did you lose power to the entire cluster? 

I realize that this question goes beyond HBase, but is an Ops question. Do you have redundant power sources and redundant power supplies to the racks and machines in the cluster? 


On Jul 2, 2013, at 7:42 AM, Dave Latham <la...@davelink.net> wrote:

> Hi Uma,
> 
> Thanks for the pointer.  Your case sounds very similar.  The main differences that I see are that in my case it happened on all 3 replicas and the power failure occurred merely seconds after the blocks were finalized.  So I guess the question is whether HDFS can do anything to better recover from such situations.  I'm also curious whether ext4 would be less susceptible than ext3.
> 
> I will definitely look at enabling dfs.datanode.synconclose once we upgrade to a version of hdfs that has it.  I would love to see some performance numbers if anyone has run them.  Also appears that HBase is considering enabling it by default (cf. comments on HBase-5954).
> 
> Dave
> 
> 
> On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G <ma...@huawei.com> wrote:
> Hi Dave,
> 
>  Looks like your analysis is correct. I have faced similar issue some time back.
> See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.
> 
> Regards,
> Uma
> 
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
> 
> Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.
> 
> Dave
> 
> 
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> 
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Michael Segel <mi...@hotmail.com>.

Dave, 

How did you lose power to the entire cluster? 

I realize that this question goes beyond HBase, but is an Ops question. Do you have redundant power sources and redundant power supplies to the racks and machines in the cluster? 


On Jul 2, 2013, at 7:42 AM, Dave Latham <la...@davelink.net> wrote:

> Hi Uma,
> 
> Thanks for the pointer.  Your case sounds very similar.  The main differences that I see are that in my case it happened on all 3 replicas and the power failure occurred merely seconds after the blocks were finalized.  So I guess the question is whether HDFS can do anything to better recover from such situations.  I'm also curious whether ext4 would be less susceptible than ext3.
> 
> I will definitely look at enabling dfs.datanode.synconclose once we upgrade to a version of hdfs that has it.  I would love to see some performance numbers if anyone has run them.  Also appears that HBase is considering enabling it by default (cf. comments on HBase-5954).
> 
> Dave
> 
> 
> On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G <ma...@huawei.com> wrote:
> Hi Dave,
> 
>  Looks like your analysis is correct. I have faced similar issue some time back.
> See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.
> 
> Regards,
> Uma
> 
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
> 
> Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.
> 
> Dave
> 
> 
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> 
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Michael Segel <mi...@hotmail.com>.

Dave, 

How did you lose power to the entire cluster? 

I realize that this question goes beyond HBase, but is an Ops question. Do you have redundant power sources and redundant power supplies to the racks and machines in the cluster? 


On Jul 2, 2013, at 7:42 AM, Dave Latham <la...@davelink.net> wrote:

> Hi Uma,
> 
> Thanks for the pointer.  Your case sounds very similar.  The main differences that I see are that in my case it happened on all 3 replicas and the power failure occurred merely seconds after the blocks were finalized.  So I guess the question is whether HDFS can do anything to better recover from such situations.  I'm also curious whether ext4 would be less susceptible than ext3.
> 
> I will definitely look at enabling dfs.datanode.synconclose once we upgrade to a version of hdfs that has it.  I would love to see some performance numbers if anyone has run them.  Also appears that HBase is considering enabling it by default (cf. comments on HBase-5954).
> 
> Dave
> 
> 
> On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G <ma...@huawei.com> wrote:
> Hi Dave,
> 
>  Looks like your analysis is correct. I have faced similar issue some time back.
> See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.
> 
> Regards,
> Uma
> 
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
> 
> Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.
> 
> Dave
> 
> 
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> 
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Michael Segel <mi...@hotmail.com>.

Dave, 

How did you lose power to the entire cluster? 

I realize that this question goes beyond HBase, but is an Ops question. Do you have redundant power sources and redundant power supplies to the racks and machines in the cluster? 


On Jul 2, 2013, at 7:42 AM, Dave Latham <la...@davelink.net> wrote:

> Hi Uma,
> 
> Thanks for the pointer.  Your case sounds very similar.  The main differences that I see are that in my case it happened on all 3 replicas and the power failure occurred merely seconds after the blocks were finalized.  So I guess the question is whether HDFS can do anything to better recover from such situations.  I'm also curious whether ext4 would be less susceptible than ext3.
> 
> I will definitely look at enabling dfs.datanode.synconclose once we upgrade to a version of hdfs that has it.  I would love to see some performance numbers if anyone has run them.  Also appears that HBase is considering enabling it by default (cf. comments on HBase-5954).
> 
> Dave
> 
> 
> On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G <ma...@huawei.com> wrote:
> Hi Dave,
> 
>  Looks like your analysis is correct. I have faced similar issue some time back.
> See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.
> 
> Regards,
> Uma
> 
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
> 
> Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.
> 
> Dave
> 
> 
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> 
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Hi Uma,

Thanks for the pointer.  Your case sounds very similar.  The main
differences that I see are that in my case it happened on all 3 replicas
and the power failure occurred merely seconds after the blocks were
finalized.  So I guess the question is whether HDFS can do anything to
better recover from such situations.  I'm also curious whether ext4 would
be less susceptible than ext3.

I will definitely look at enabling dfs.datanode.synconclose once we upgrade
to a version of hdfs that has it.  I would love to see some performance
numbers if anyone has run them.  Also appears that HBase is considering
enabling it by default (cf. comments on HBase-5954).

Dave


On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G
<ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Hi Uma,

Thanks for the pointer.  Your case sounds very similar.  The main
differences that I see are that in my case it happened on all 3 replicas
and the power failure occurred merely seconds after the blocks were
finalized.  So I guess the question is whether HDFS can do anything to
better recover from such situations.  I'm also curious whether ext4 would
be less susceptible than ext3.

I will definitely look at enabling dfs.datanode.synconclose once we upgrade
to a version of hdfs that has it.  I would love to see some performance
numbers if anyone has run them.  Also appears that HBase is considering
enabling it by default (cf. comments on HBase-5954).

Dave


On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G
<ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Hi Uma,

Thanks for the pointer.  Your case sounds very similar.  The main
differences that I see are that in my case it happened on all 3 replicas
and the power failure occurred merely seconds after the blocks were
finalized.  So I guess the question is whether HDFS can do anything to
better recover from such situations.  I'm also curious whether ext4 would
be less susceptible than ext3.

I will definitely look at enabling dfs.datanode.synconclose once we upgrade
to a version of hdfs that has it.  I would love to see some performance
numbers if anyone has run them.  Also appears that HBase is considering
enabling it by default (cf. comments on HBase-5954).

Dave


On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G
<ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Hi Uma,

Thanks for the pointer.  Your case sounds very similar.  The main
differences that I see are that in my case it happened on all 3 replicas
and the power failure occurred merely seconds after the blocks were
finalized.  So I guess the question is whether HDFS can do anything to
better recover from such situations.  I'm also curious whether ext4 would
be less susceptible than ext3.

I will definitely look at enabling dfs.datanode.synconclose once we upgrade
to a version of hdfs that has it.  I would love to see some performance
numbers if anyone has run them.  Also appears that HBase is considering
enabling it by default (cf. comments on HBase-5954).

Dave


On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G
<ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Hi Uma,

Thanks for the pointer.  Your case sounds very similar.  The main
differences that I see are that in my case it happened on all 3 replicas
and the power failure occurred merely seconds after the blocks were
finalized.  So I guess the question is whether HDFS can do anything to
better recover from such situations.  I'm also curious whether ext4 would
be less susceptible than ext3.

I will definitely look at enabling dfs.datanode.synconclose once we upgrade
to a version of hdfs that has it.  I would love to see some performance
numbers if anyone has run them.  Also appears that HBase is considering
enabling it by default (cf. comments on HBase-5954).

Dave


On Tue, Jul 2, 2013 at 12:31 AM, Uma Maheswara Rao G
<ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Azuryy Yu <az...@gmail.com>.

Hi Uma,

I  think there is minimum performance degration if set
dfs.datanode.synconclose to true.


On Tue, Jul 2, 2013 at 3:31 PM, Uma Maheswara Rao G <ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

Re: data loss after cluster wide power loss

Posted by Azuryy Yu <az...@gmail.com>.

Hi Uma,

I  think there is minimum performance degration if set
dfs.datanode.synconclose to true.


On Tue, Jul 2, 2013 at 3:31 PM, Uma Maheswara Rao G <ma...@huawei.com>wrote:

> Hi Dave,
>
>  Looks like your analysis is correct. I have faced similar issue some time
> back.
> See the discussion link:
> http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
> On sudden restarts, it can lost the OS filesystem edits. Similar thing
> happened in our case, i.e, after restart blocks were moved back to
> BeingWritten directory even though they were finalized.
> After restart they were marked as corrupt. You could set
> dfs.datanode.synconclose to true to avoid this sort of things, but that
> will degrade performance.
>
> Regards,
> Uma
>
> -----Original Message-----
> From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave
> Latham
> Sent: 01 July 2013 16:08
> To: hdfs-user@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: data loss after cluster wide power loss
>
> Much appreciated, Suresh.  Let me know if I can provide any more
> information or if you'd like me to open a JIRA.
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <suresh@hortonworks.com
> >wrote:
>
> > Dave,
> >
> > Thanks for the detailed email. Sorry I did not read all the details
> > you had sent earlier completely (on my phone). As you said, this is
> > not related to data loss related to HBase log and hsync. I think you
> > are right; the rename operation itself might not have hit the disk. I
> > think we should either ensure metadata operation is synced on the
> > datanode or handle it being reported as blockBeingWritten. Let me
> > spend sometime to debug this issue.
> >
> > One surprising thing is, all the replicas were reported as
> > blockBeingWritten.
> >
> > Regards,
> > Suresh
> >
> >
> > On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
> >
> >> (Removing hbase list and adding hdfs-dev list as this is pretty
> >> internal stuff).
> >>
> >> Reading through the code a bit:
> >>
> >> FSDataOutputStream.close calls
> >> DFSOutputStream.close calls
> >> DFSOutputStream.closeInternal
> >>  - sets currentPacket.lastPacketInBlock = true
> >>  - then calls
> >> DFSOutputStream.flushInternal
> >>  - enqueues current packet
> >>  - waits for ack
> >>
> >> BlockReceiver.run
> >>  - if (lastPacketInBlock && !receiver.finalized) calls
> >> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls
> >> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
> >>  - renames block from "blocksBeingWritten" tmp dir to "current" dest
> >> dir
> >>
> >> This looks to me as I would expect a synchronous chain from a DFS
> >> client to moving the file from blocksBeingWritten to the current dir
> >> so that once the file is closed that it the block files would be in
> >> the proper directory
> >> - even if the contents of the file are still in the OS buffer rather
> >> than synced to disk.  It's only after this moving of blocks that
> >> NameNode.complete file is called.  There are several conditions and
> >> loops in there that I'm not certain this chain is fully reliable in
> >> all cases without a greater understanding of the code.
> >>
> >> Could it be the case that the rename operation itself is not synced
> >> and that ext3 lost the fact that the block files were moved?
> >> Or is there a bug in the close file logic that for some reason the
> >> block files are not always moved into place when a file is closed?
> >>
> >> Thanks for your patience,
> >> Dave
> >>
> >>
> >> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net>
> wrote:
> >>
> >>> Thanks for the response, Suresh.
> >>>
> >>> I'm not sure that I understand the details properly.  From my
> >>> reading of
> >>> HDFS-744 the hsync API would allow a client to make sure that at any
> >>> point in time it's writes so far hit the disk.  For example, for
> >>> HBase it could apply a fsync after adding some edits to its WAL to
> >>> ensure those edits are fully durable for a file which is still open.
> >>>
> >>> However, in this case the dfs file was closed and even renamed.  Is
> >>> it the case that even after a dfs file is closed and renamed that
> >>> the data blocks would still not be synced and would still be stored
> >>> by the datanode in "blocksBeingWritten" rather than in "current"?
> >>> If that is case, would it be better for the NameNode not to reject
> >>> replicas that are in blocksBeingWritten, especially if it doesn't
> >>> have any other replicas available?
> >>>
> >>> Dave
> >>>
> >>>
> >>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <
> suresh@hortonworks.com>wrote:
> >>>
> >>>> Yes this is a known issue.
> >>>>
> >>>> The HDFS part of this was addressed in
> >>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and
> >>>> is not available in 1.x  release. I think HBase does not use this
> >>>> API yet.
> >>>>
> >>>>
> >>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
> >>>> wrote:
> >>>>
> >>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On
> >>>> > Saturday
> >>>> the
> >>>> > data center we were in had a total power failure and the cluster
> >>>> > went
> >>>> down
> >>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
> >>>>  We
> >>>> > recovered the data in question from our secondary datacenter, but
> >>>> > I'm trying to understand what happened and whether this is a bug
> >>>> > in HDFS
> >>>> that
> >>>> > should be fixed.
> >>>> >
> >>>> > From what I can tell the file was created and closed by the dfs
> >>>> > client (hbase).  Then HBase renamed it into a new directory and
> >>>> > deleted some
> >>>> other
> >>>> > files containing the same data.  Then the cluster lost power.
> >>>> > After
> >>>> the
> >>>> > cluster was restarted, the datanodes reported into the namenode
> >>>> > but
> >>>> the
> >>>> > blocks for this file appeared as "blocks being written" - the
> >>>> > namenode rejected them and the datanodes deleted the blocks.  At
> >>>> > this point
> >>>> there
> >>>> > were no replicas for the blocks and the files were marked CORRUPT.
> >>>>  The
> >>>> > underlying file systems are ext3.  Some questions that I would
> >>>> > love
> >>>> get
> >>>> > answers for if anyone with deeper understanding of HDFS can chime
> in:
> >>>> >
> >>>> >  - Is this a known scenario where data loss is expected?  (I
> >>>> > found
> >>>> > HDFS-1539 but that seems different)
> >>>> >  - When are blocks moved from blocksBeingWritten to current?
> >>>> > Does
> >>>> that
> >>>> > happen before a file close operation is acknowledged to a hdfs
> client?
> >>>> >  - Could it be that the DataNodes actually moved the blocks to
> >>>> current but
> >>>> > after the restart ext3 rewound state somehow (forgive my
> >>>> > ignorance of underlying file system behavior)?
> >>>> >  - Is there any other explanation for how this can happen?
> >>>> >
> >>>> > Here is a sequence of selected relevant log lines from the RS
> >>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example
> >>>> > of 3 in question).  It includes everything that mentions the
> >>>> > block in
> >>>> question in
> >>>> > the NameNode and one DataNode log.  Please let me know if this
> >>>> > more information that would be helpful.
> >>>> >
> >>>> > RS 2013-06-29 11:16:06,812 DEBUG
> org.apache.hadoop.hbase.util.FSUtils:
> >>>> > Creating
> >>>> >
> >>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
> >>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > with permission=rwxrwxrwx
> >>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.allocateBlock:
> >>>> >
> >>>>
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> >>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> >>>> > 10.0.5.237:50010
> >>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is
> >>>> > added
> >>>> to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29
> >>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
> >>>> added to
> >>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29
> >>>> > 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> >>>> > blk_1395839728632046111_357084589 of size 25418340 from /
> >>>> 10.0.5.237:14327
> >>>>
> >>>> > DN 2013-06-29 11:16:11,385 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>> > 2 for block blk_1395839728632046111_357084589 terminating NN
> >>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> > Removing lease on  file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> DIR*
> >>>> > NameSystem.completeFile: file
> >>>> >
> >>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
> >>>> 56ba5a539fdf159c4c
> >>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> >>>> > RS 2013-06-29 11:16:11,393 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted
> >>>> > file at
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
> >>>> /6e0cc30af6e64e56ba5a539fdf159c4c
> >>>> > to
> >>>> >
> >>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
> >>>> 0cc30af6e64e56ba5a539fdf159c4c
> >>>> > RS 2013-06-29 11:16:11,505 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
> >>>> compaction of 7
> >>>> > file(s) in n of
> >>>> >
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> >>>> into
> >>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for
> >>>> > store is
> >>>> 24.2m
> >>>> >
> >>>> > -------  CRASH, RESTART ---------
> >>>> >
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
> >>>> but was
> >>>> > rejected: Reported as block being written but is a block of
> >>>> > closed
> >>>> file.
> >>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.1:50010
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size
> >>>> > 16971264
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.5.237:50010
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addStoredBlock: addStoredBlock request received for
> >>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size
> >>>> > 21913088
> >>>> but
> >>>> > was rejected: Reported as block being written but is a block of
> >>>> closed file.
> >>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> >>>> BLOCK*
> >>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> >>>> invalidSet
> >>>> > of 10.0.6.24:50010
> >>>> > (note the clock on the server running DN is wrong after restart.
> >>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29
> >>>> > 06:07:22,877 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> >>>> > blk_1395839728632046111_357084589 file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
> >>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> >>>> > blk_1395839728632046111_357084589 at file
> >>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> > Dave
> >>>> >
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> http://hortonworks.com/download/
> >>>>
> >>>
> >>>
> >>
> >
> >
> > --
> > http://hortonworks.com/download/
> >
>

RE: data loss after cluster wide power loss

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Hi Dave,

 Looks like your analysis is correct. I have faced similar issue some time back. 
See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.

Regards,
Uma

-----Original Message-----
From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
Sent: 01 July 2013 16:08
To: hdfs-user@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: data loss after cluster wide power loss

Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details 
> you had sent earlier completely (on my phone). As you said, this is 
> not related to data loss related to HBase log and hsync. I think you 
> are right; the rename operation itself might not have hit the disk. I 
> think we should either ensure metadata operation is synced on the 
> datanode or handle it being reported as blockBeingWritten. Let me 
> spend sometime to debug this issue.
>
> One surprising thing is, all the replicas were reported as 
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty 
>> internal stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls 
>> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls 
>> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest 
>> dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS 
>> client to moving the file from blocksBeingWritten to the current dir 
>> so that once the file is closed that it the block files would be in 
>> the proper directory
>> - even if the contents of the file are still in the OS buffer rather 
>> than synced to disk.  It's only after this moving of blocks that 
>> NameNode.complete file is called.  There are several conditions and 
>> loops in there that I'm not certain this chain is fully reliable in 
>> all cases without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced 
>> and that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the 
>> block files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my 
>>> reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any 
>>> point in time it's writes so far hit the disk.  For example, for 
>>> HBase it could apply a fsync after adding some edits to its WAL to 
>>> ensure those edits are fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is 
>>> it the case that even after a dfs file is closed and renamed that 
>>> the data blocks would still not be synced and would still be stored 
>>> by the datanode in "blocksBeingWritten" rather than in "current"?  
>>> If that is case, would it be better for the NameNode not to reject 
>>> replicas that are in blocksBeingWritten, especially if it doesn't 
>>> have any other replicas available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and 
>>>> is not available in 1.x  release. I think HBase does not use this 
>>>> API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On 
>>>> > Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster 
>>>> > went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but 
>>>> > I'm trying to understand what happened and whether this is a bug 
>>>> > in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs 
>>>> > client (hbase).  Then HBase renamed it into a new directory and 
>>>> > deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  
>>>> > After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode 
>>>> > but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the 
>>>> > namenode rejected them and the datanodes deleted the blocks.  At 
>>>> > this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would 
>>>> > love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I 
>>>> > found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  
>>>> > Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my 
>>>> > ignorance of underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS 
>>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example 
>>>> > of 3 in question).  It includes everything that mentions the 
>>>> > block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this 
>>>> > more information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
>>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is 
>>>> > added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 
>>>> > 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 
>>>> > 2 for block blk_1395839728632046111_357084589 terminating NN 
>>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted 
>>>> > file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
>>>> /6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
>>>> 0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for 
>>>> > store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of 
>>>> > closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 
>>>> > 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 
>>>> > 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  
>>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29 
>>>> > 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 
>>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

RE: data loss after cluster wide power loss

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Hi Dave,

 Looks like your analysis is correct. I have faced similar issue some time back. 
See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.

Regards,
Uma

-----Original Message-----
From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
Sent: 01 July 2013 16:08
To: hdfs-user@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: data loss after cluster wide power loss

Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details 
> you had sent earlier completely (on my phone). As you said, this is 
> not related to data loss related to HBase log and hsync. I think you 
> are right; the rename operation itself might not have hit the disk. I 
> think we should either ensure metadata operation is synced on the 
> datanode or handle it being reported as blockBeingWritten. Let me 
> spend sometime to debug this issue.
>
> One surprising thing is, all the replicas were reported as 
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty 
>> internal stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls 
>> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls 
>> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest 
>> dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS 
>> client to moving the file from blocksBeingWritten to the current dir 
>> so that once the file is closed that it the block files would be in 
>> the proper directory
>> - even if the contents of the file are still in the OS buffer rather 
>> than synced to disk.  It's only after this moving of blocks that 
>> NameNode.complete file is called.  There are several conditions and 
>> loops in there that I'm not certain this chain is fully reliable in 
>> all cases without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced 
>> and that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the 
>> block files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my 
>>> reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any 
>>> point in time it's writes so far hit the disk.  For example, for 
>>> HBase it could apply a fsync after adding some edits to its WAL to 
>>> ensure those edits are fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is 
>>> it the case that even after a dfs file is closed and renamed that 
>>> the data blocks would still not be synced and would still be stored 
>>> by the datanode in "blocksBeingWritten" rather than in "current"?  
>>> If that is case, would it be better for the NameNode not to reject 
>>> replicas that are in blocksBeingWritten, especially if it doesn't 
>>> have any other replicas available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and 
>>>> is not available in 1.x  release. I think HBase does not use this 
>>>> API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On 
>>>> > Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster 
>>>> > went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but 
>>>> > I'm trying to understand what happened and whether this is a bug 
>>>> > in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs 
>>>> > client (hbase).  Then HBase renamed it into a new directory and 
>>>> > deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  
>>>> > After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode 
>>>> > but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the 
>>>> > namenode rejected them and the datanodes deleted the blocks.  At 
>>>> > this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would 
>>>> > love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I 
>>>> > found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  
>>>> > Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my 
>>>> > ignorance of underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS 
>>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example 
>>>> > of 3 in question).  It includes everything that mentions the 
>>>> > block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this 
>>>> > more information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
>>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is 
>>>> > added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 
>>>> > 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 
>>>> > 2 for block blk_1395839728632046111_357084589 terminating NN 
>>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted 
>>>> > file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
>>>> /6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
>>>> 0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for 
>>>> > store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of 
>>>> > closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 
>>>> > 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 
>>>> > 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  
>>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29 
>>>> > 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 
>>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

RE: data loss after cluster wide power loss

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Hi Dave,

 Looks like your analysis is correct. I have faced similar issue some time back. 
See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.

Regards,
Uma

-----Original Message-----
From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
Sent: 01 July 2013 16:08
To: hdfs-user@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: data loss after cluster wide power loss

Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details 
> you had sent earlier completely (on my phone). As you said, this is 
> not related to data loss related to HBase log and hsync. I think you 
> are right; the rename operation itself might not have hit the disk. I 
> think we should either ensure metadata operation is synced on the 
> datanode or handle it being reported as blockBeingWritten. Let me 
> spend sometime to debug this issue.
>
> One surprising thing is, all the replicas were reported as 
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty 
>> internal stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls 
>> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls 
>> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest 
>> dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS 
>> client to moving the file from blocksBeingWritten to the current dir 
>> so that once the file is closed that it the block files would be in 
>> the proper directory
>> - even if the contents of the file are still in the OS buffer rather 
>> than synced to disk.  It's only after this moving of blocks that 
>> NameNode.complete file is called.  There are several conditions and 
>> loops in there that I'm not certain this chain is fully reliable in 
>> all cases without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced 
>> and that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the 
>> block files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my 
>>> reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any 
>>> point in time it's writes so far hit the disk.  For example, for 
>>> HBase it could apply a fsync after adding some edits to its WAL to 
>>> ensure those edits are fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is 
>>> it the case that even after a dfs file is closed and renamed that 
>>> the data blocks would still not be synced and would still be stored 
>>> by the datanode in "blocksBeingWritten" rather than in "current"?  
>>> If that is case, would it be better for the NameNode not to reject 
>>> replicas that are in blocksBeingWritten, especially if it doesn't 
>>> have any other replicas available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and 
>>>> is not available in 1.x  release. I think HBase does not use this 
>>>> API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On 
>>>> > Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster 
>>>> > went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but 
>>>> > I'm trying to understand what happened and whether this is a bug 
>>>> > in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs 
>>>> > client (hbase).  Then HBase renamed it into a new directory and 
>>>> > deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  
>>>> > After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode 
>>>> > but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the 
>>>> > namenode rejected them and the datanodes deleted the blocks.  At 
>>>> > this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would 
>>>> > love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I 
>>>> > found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  
>>>> > Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my 
>>>> > ignorance of underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS 
>>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example 
>>>> > of 3 in question).  It includes everything that mentions the 
>>>> > block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this 
>>>> > more information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
>>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is 
>>>> > added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 
>>>> > 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 
>>>> > 2 for block blk_1395839728632046111_357084589 terminating NN 
>>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted 
>>>> > file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
>>>> /6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
>>>> 0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for 
>>>> > store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of 
>>>> > closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 
>>>> > 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 
>>>> > 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  
>>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29 
>>>> > 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 
>>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

RE: data loss after cluster wide power loss

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Hi Dave,

 Looks like your analysis is correct. I have faced similar issue some time back. 
See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.

Regards,
Uma

-----Original Message-----
From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
Sent: 01 July 2013 16:08
To: hdfs-user@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: data loss after cluster wide power loss

Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details 
> you had sent earlier completely (on my phone). As you said, this is 
> not related to data loss related to HBase log and hsync. I think you 
> are right; the rename operation itself might not have hit the disk. I 
> think we should either ensure metadata operation is synced on the 
> datanode or handle it being reported as blockBeingWritten. Let me 
> spend sometime to debug this issue.
>
> One surprising thing is, all the replicas were reported as 
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty 
>> internal stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls 
>> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls 
>> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest 
>> dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS 
>> client to moving the file from blocksBeingWritten to the current dir 
>> so that once the file is closed that it the block files would be in 
>> the proper directory
>> - even if the contents of the file are still in the OS buffer rather 
>> than synced to disk.  It's only after this moving of blocks that 
>> NameNode.complete file is called.  There are several conditions and 
>> loops in there that I'm not certain this chain is fully reliable in 
>> all cases without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced 
>> and that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the 
>> block files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my 
>>> reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any 
>>> point in time it's writes so far hit the disk.  For example, for 
>>> HBase it could apply a fsync after adding some edits to its WAL to 
>>> ensure those edits are fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is 
>>> it the case that even after a dfs file is closed and renamed that 
>>> the data blocks would still not be synced and would still be stored 
>>> by the datanode in "blocksBeingWritten" rather than in "current"?  
>>> If that is case, would it be better for the NameNode not to reject 
>>> replicas that are in blocksBeingWritten, especially if it doesn't 
>>> have any other replicas available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and 
>>>> is not available in 1.x  release. I think HBase does not use this 
>>>> API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On 
>>>> > Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster 
>>>> > went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but 
>>>> > I'm trying to understand what happened and whether this is a bug 
>>>> > in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs 
>>>> > client (hbase).  Then HBase renamed it into a new directory and 
>>>> > deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  
>>>> > After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode 
>>>> > but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the 
>>>> > namenode rejected them and the datanodes deleted the blocks.  At 
>>>> > this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would 
>>>> > love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I 
>>>> > found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  
>>>> > Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my 
>>>> > ignorance of underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS 
>>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example 
>>>> > of 3 in question).  It includes everything that mentions the 
>>>> > block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this 
>>>> > more information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
>>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is 
>>>> > added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 
>>>> > 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 
>>>> > 2 for block blk_1395839728632046111_357084589 terminating NN 
>>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted 
>>>> > file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
>>>> /6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
>>>> 0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for 
>>>> > store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of 
>>>> > closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 
>>>> > 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 
>>>> > 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  
>>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29 
>>>> > 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 
>>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

RE: data loss after cluster wide power loss

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Hi Dave,

 Looks like your analysis is correct. I have faced similar issue some time back. 
See the discussion link: http://markmail.org/message/ruev3aa4x5zh2l4w#query:+page:1+mid:33gcdcu3coodkks3+state:results
On sudden restarts, it can lost the OS filesystem edits. Similar thing happened in our case, i.e, after restart blocks were moved back to BeingWritten directory even though they were finalized.
After restart they were marked as corrupt. You could set dfs.datanode.synconclose to true to avoid this sort of things, but that will degrade performance.

Regards,
Uma

-----Original Message-----
From: ddlatham@gmail.com [mailto:ddlatham@gmail.com] On Behalf Of Dave Latham
Sent: 01 July 2013 16:08
To: hdfs-user@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: data loss after cluster wide power loss

Much appreciated, Suresh.  Let me know if I can provide any more information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details 
> you had sent earlier completely (on my phone). As you said, this is 
> not related to data loss related to HBase log and hsync. I think you 
> are right; the rename operation itself might not have hit the disk. I 
> think we should either ensure metadata operation is synced on the 
> datanode or handle it being reported as blockBeingWritten. Let me 
> spend sometime to debug this issue.
>
> One surprising thing is, all the replicas were reported as 
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty 
>> internal stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls 
>> FSDataset.finalizeBlock calls FSDataset.finalizeBlockInternal calls 
>> FSVolume.addBlock calls FSDir.addBlock calls FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest 
>> dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS 
>> client to moving the file from blocksBeingWritten to the current dir 
>> so that once the file is closed that it the block files would be in 
>> the proper directory
>> - even if the contents of the file are still in the OS buffer rather 
>> than synced to disk.  It's only after this moving of blocks that 
>> NameNode.complete file is called.  There are several conditions and 
>> loops in there that I'm not certain this chain is fully reliable in 
>> all cases without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced 
>> and that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the 
>> block files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my 
>>> reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any 
>>> point in time it's writes so far hit the disk.  For example, for 
>>> HBase it could apply a fsync after adding some edits to its WAL to 
>>> ensure those edits are fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is 
>>> it the case that even after a dfs file is closed and renamed that 
>>> the data blocks would still not be synced and would still be stored 
>>> by the datanode in "blocksBeingWritten" rather than in "current"?  
>>> If that is case, would it be better for the NameNode not to reject 
>>> replicas that are in blocksBeingWritten, especially if it doesn't 
>>> have any other replicas available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and 
>>>> is not available in 1.x  release. I think HBase does not use this 
>>>> API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On 
>>>> > Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster 
>>>> > went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but 
>>>> > I'm trying to understand what happened and whether this is a bug 
>>>> > in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs 
>>>> > client (hbase).  Then HBase renamed it into a new directory and 
>>>> > deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  
>>>> > After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode 
>>>> > but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the 
>>>> > namenode rejected them and the datanodes deleted the blocks.  At 
>>>> > this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would 
>>>> > love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I 
>>>> > found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  
>>>> > Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my 
>>>> > ignorance of underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS 
>>>> > (HBase Region Server) NN (NameNode) and DN (DataNode - 1 example 
>>>> > of 3 in question).  It includes everything that mentions the 
>>>> > block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this 
>>>> > more information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e
>>>> /.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589 DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is 
>>>> > added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 NN 2013-06-29 
>>>> > 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340 DN 2013-06-29 
>>>> > 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 
>>>> > 2 for block blk_1395839728632046111_357084589 terminating NN 
>>>> > 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e
>>>> 56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted 
>>>> > file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp
>>>> /6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e
>>>> 0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for 
>>>> > store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of 
>>>> > closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 
>>>> > 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 
>>>> > 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  
>>>> > I believe timestamps are off by 6 hours:) DN 2013-06-29 
>>>> > 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 
>>>> > for deletion DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Much appreciated, Suresh.  Let me know if I can provide any more
information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you
> had sent earlier completely (on my phone). As you said, this is not related
> to data loss related to HBase log and hsync. I think you are right; the
> rename operation itself might not have hit the disk. I think we should
> either ensure metadata operation is synced on the datanode or handle it
> being reported as blockBeingWritten. Let me spend sometime to debug this
> issue.
>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops
>> in there that I'm not certain this chain is fully reliable in all cases
>> without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>>> it be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Colin McCabe <cm...@alumni.cmu.edu>.

On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com> wrote:
> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you had
> sent earlier completely (on my phone). As you said, this is not related to
> data loss related to HBase log and hsync. I think you are right; the rename
> operation itself might not have hit the disk. I think we should either
> ensure metadata operation is synced on the datanode or handle it being
> reported as blockBeingWritten. Let me spend sometime to debug this issue.

In theory, ext3 is journaled, so all metadata operations should be
durable in the case of a power outage.  It is only data operations
that should be possible to lose.  It is the same for ext4.  (Assuming
you are not using nonstandard mount options.)

In practice, it is possible that your hard disks didn't actually
persist the data that they said they did.  Rumor has it that some
drives ignore the SATA FLUSH CACHE command in some cases, since it
makes them look bad in benchmarks.  In that case, there is nothing the
filesystem or any other software can do.

There was also a bug in older linux kernels where the kernel would not
actually send FLUSH CACHE.  Since the poster is using ext3 and
hadoop-1, it's possible he's also using an antique kernel as well.  I
know for sure this affected LVM-- it used to ignore barriers until
fairly recently.

In Ceph, we used to recommend disabling the hard drive write cache if
your kernel was older than 2.6.33.  You can read the recommendation
for yourself here:
http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
 This will have an impact on performance, however.

An uninterruptable power supply is not a bad idea.

I am curious:
what kernel version you are using?
are you using LVM?

Colin


>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops in
>> there that I'm not certain this chain is fully reliable in all cases without
>> a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would it
>>> be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>
>>> wrote:
>>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> > the
>>>> > data center we were in had a total power failure and the cluster went
>>>> > down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>> > We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> > that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> > other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> > the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> > the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> > there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>> > The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> > get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> > that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>>> > but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in question
>>>> > in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> > file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>>>> > to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from
>>>> > /10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> > /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> > hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>>> > of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> > into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> > 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>>> > was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> > but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>>> > was rejected: Reported as block being written but is a block of closed
>>>> > file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> > invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>
>>>
>>
>
>
>
> --
> http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Much appreciated, Suresh.  Let me know if I can provide any more
information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you
> had sent earlier completely (on my phone). As you said, this is not related
> to data loss related to HBase log and hsync. I think you are right; the
> rename operation itself might not have hit the disk. I think we should
> either ensure metadata operation is synced on the datanode or handle it
> being reported as blockBeingWritten. Let me spend sometime to debug this
> issue.
>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops
>> in there that I'm not certain this chain is fully reliable in all cases
>> without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>>> it be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Much appreciated, Suresh.  Let me know if I can provide any more
information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you
> had sent earlier completely (on my phone). As you said, this is not related
> to data loss related to HBase log and hsync. I think you are right; the
> rename operation itself might not have hit the disk. I think we should
> either ensure metadata operation is synced on the datanode or handle it
> being reported as blockBeingWritten. Let me spend sometime to debug this
> issue.
>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops
>> in there that I'm not certain this chain is fully reliable in all cases
>> without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>>> it be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Much appreciated, Suresh.  Let me know if I can provide any more
information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you
> had sent earlier completely (on my phone). As you said, this is not related
> to data loss related to HBase log and hsync. I think you are right; the
> rename operation itself might not have hit the disk. I think we should
> either ensure metadata operation is synced on the datanode or handle it
> being reported as blockBeingWritten. Let me spend sometime to debug this
> issue.
>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops
>> in there that I'm not certain this chain is fully reliable in all cases
>> without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>>> it be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Much appreciated, Suresh.  Let me know if I can provide any more
information or if you'd like me to open a JIRA.

Dave


On Mon, Jul 1, 2013 at 8:48 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Dave,
>
> Thanks for the detailed email. Sorry I did not read all the details you
> had sent earlier completely (on my phone). As you said, this is not related
> to data loss related to HBase log and hsync. I think you are right; the
> rename operation itself might not have hit the disk. I think we should
> either ensure metadata operation is synced on the datanode or handle it
> being reported as blockBeingWritten. Let me spend sometime to debug this
> issue.
>
> One surprising thing is, all the replicas were reported as
> blockBeingWritten.
>
> Regards,
> Suresh
>
>
> On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:
>
>> (Removing hbase list and adding hdfs-dev list as this is pretty internal
>> stuff).
>>
>> Reading through the code a bit:
>>
>> FSDataOutputStream.close calls
>> DFSOutputStream.close calls
>> DFSOutputStream.closeInternal
>>  - sets currentPacket.lastPacketInBlock = true
>>  - then calls
>> DFSOutputStream.flushInternal
>>  - enqueues current packet
>>  - waits for ack
>>
>> BlockReceiver.run
>>  - if (lastPacketInBlock && !receiver.finalized) calls
>> FSDataset.finalizeBlock calls
>> FSDataset.finalizeBlockInternal calls
>> FSVolume.addBlock calls
>> FSDir.addBlock calls
>> FSDir.addBlock
>>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>>
>> This looks to me as I would expect a synchronous chain from a DFS client
>> to moving the file from blocksBeingWritten to the current dir so that once
>> the file is closed that it the block files would be in the proper directory
>> - even if the contents of the file are still in the OS buffer rather than
>> synced to disk.  It's only after this moving of blocks that
>> NameNode.complete file is called.  There are several conditions and loops
>> in there that I'm not certain this chain is fully reliable in all cases
>> without a greater understanding of the code.
>>
>> Could it be the case that the rename operation itself is not synced and
>> that ext3 lost the fact that the block files were moved?
>> Or is there a bug in the close file logic that for some reason the block
>> files are not always moved into place when a file is closed?
>>
>> Thanks for your patience,
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>>
>>> Thanks for the response, Suresh.
>>>
>>> I'm not sure that I understand the details properly.  From my reading of
>>> HDFS-744 the hsync API would allow a client to make sure that at any point
>>> in time it's writes so far hit the disk.  For example, for HBase it could
>>> apply a fsync after adding some edits to its WAL to ensure those edits are
>>> fully durable for a file which is still open.
>>>
>>> However, in this case the dfs file was closed and even renamed.  Is it
>>> the case that even after a dfs file is closed and renamed that the data
>>> blocks would still not be synced and would still be stored by the datanode
>>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>>> it be better for the NameNode not to reject replicas that are in
>>> blocksBeingWritten, especially if it doesn't have any other replicas
>>> available?
>>>
>>> Dave
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>>
>>>> Yes this is a known issue.
>>>>
>>>> The HDFS part of this was addressed in
>>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>>> not
>>>> available in 1.x  release. I think HBase does not use this API yet.
>>>>
>>>>
>>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net>
>>>> wrote:
>>>>
>>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>>> the
>>>> > data center we were in had a total power failure and the cluster went
>>>> down
>>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>>  We
>>>> > recovered the data in question from our secondary datacenter, but I'm
>>>> > trying to understand what happened and whether this is a bug in HDFS
>>>> that
>>>> > should be fixed.
>>>> >
>>>> > From what I can tell the file was created and closed by the dfs client
>>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>>> other
>>>> > files containing the same data.  Then the cluster lost power.  After
>>>> the
>>>> > cluster was restarted, the datanodes reported into the namenode but
>>>> the
>>>> > blocks for this file appeared as "blocks being written" - the namenode
>>>> > rejected them and the datanodes deleted the blocks.  At this point
>>>> there
>>>> > were no replicas for the blocks and the files were marked CORRUPT.
>>>>  The
>>>> > underlying file systems are ext3.  Some questions that I would love
>>>> get
>>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>>> >
>>>> >  - Is this a known scenario where data loss is expected?  (I found
>>>> > HDFS-1539 but that seems different)
>>>> >  - When are blocks moved from blocksBeingWritten to current?  Does
>>>> that
>>>> > happen before a file close operation is acknowledged to a hdfs client?
>>>> >  - Could it be that the DataNodes actually moved the blocks to
>>>> current but
>>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>>> > underlying file system behavior)?
>>>> >  - Is there any other explanation for how this can happen?
>>>> >
>>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>>> > question).  It includes everything that mentions the block in
>>>> question in
>>>> > the NameNode and one DataNode log.  Please let me know if this more
>>>> > information that would be helpful.
>>>> >
>>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>>> > Creating
>>>> >
>>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > with permission=rwxrwxrwx
>>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.allocateBlock:
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>>> > blk_1395839728632046111_357084589
>>>> > DN 2013-06-29 11:16:06,832 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>>> > 10.0.5.237:50010
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>>> to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>>> added to
>>>> > blk_1395839728632046111_357084589 size 25418340
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>>> 10.0.5.237:14327
>>>>
>>>> > DN 2013-06-29 11:16:11,385 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>>> > block blk_1395839728632046111_357084589 terminating
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> > Removing lease on  file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>>> DIR*
>>>> > NameSystem.completeFile: file
>>>> >
>>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>>> > RS 2013-06-29 11:16:11,393 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > to
>>>> >
>>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>>> > RS 2013-06-29 11:16:11,505 INFO
>>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major
>>>> compaction of 7
>>>> > file(s) in n of
>>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>>> into
>>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>>> 24.2m
>>>> >
>>>> > -------  CRASH, RESTART ---------
>>>> >
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112
>>>> but was
>>>> > rejected: Reported as block being written but is a block of closed
>>>> file.
>>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.1:50010
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.5.237:50010
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088
>>>> but
>>>> > was rejected: Reported as block being written but is a block of
>>>> closed file.
>>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>>> BLOCK*
>>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>>> invalidSet
>>>> > of 10.0.6.24:50010
>>>> > (note the clock on the server running DN is wrong after restart.  I
>>>> > believe timestamps are off by 6 hours:)
>>>> > DN 2013-06-29 06:07:22,877 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>>> > blk_1395839728632046111_357084589 file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>>> > deletion
>>>> > DN 2013-06-29 06:07:24,952 INFO
>>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>>> > blk_1395839728632046111_357084589 at file
>>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>>> >
>>>> >
>>>> > Thanks,
>>>> > Dave
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> http://hortonworks.com/download/
>>>>
>>>
>>>
>>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Dave,

Thanks for the detailed email. Sorry I did not read all the details you had
sent earlier completely (on my phone). As you said, this is not related to
data loss related to HBase log and hsync. I think you are right; the rename
operation itself might not have hit the disk. I think we should either
ensure metadata operation is synced on the datanode or handle it being
reported as blockBeingWritten. Let me spend sometime to debug this issue.

One surprising thing is, all the replicas were reported as
blockBeingWritten.

Regards,
Suresh


On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:

> (Removing hbase list and adding hdfs-dev list as this is pretty internal
> stuff).
>
> Reading through the code a bit:
>
> FSDataOutputStream.close calls
> DFSOutputStream.close calls
> DFSOutputStream.closeInternal
>  - sets currentPacket.lastPacketInBlock = true
>  - then calls
> DFSOutputStream.flushInternal
>  - enqueues current packet
>  - waits for ack
>
> BlockReceiver.run
>  - if (lastPacketInBlock && !receiver.finalized) calls
> FSDataset.finalizeBlock calls
> FSDataset.finalizeBlockInternal calls
> FSVolume.addBlock calls
> FSDir.addBlock calls
> FSDir.addBlock
>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>
> This looks to me as I would expect a synchronous chain from a DFS client
> to moving the file from blocksBeingWritten to the current dir so that once
> the file is closed that it the block files would be in the proper directory
> - even if the contents of the file are still in the OS buffer rather than
> synced to disk.  It's only after this moving of blocks that
> NameNode.complete file is called.  There are several conditions and loops
> in there that I'm not certain this chain is fully reliable in all cases
> without a greater understanding of the code.
>
> Could it be the case that the rename operation itself is not synced and
> that ext3 lost the fact that the block files were moved?
> Or is there a bug in the close file logic that for some reason the block
> files are not always moved into place when a file is closed?
>
> Thanks for your patience,
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>
>> Thanks for the response, Suresh.
>>
>> I'm not sure that I understand the details properly.  From my reading of
>> HDFS-744 the hsync API would allow a client to make sure that at any point
>> in time it's writes so far hit the disk.  For example, for HBase it could
>> apply a fsync after adding some edits to its WAL to ensure those edits are
>> fully durable for a file which is still open.
>>
>> However, in this case the dfs file was closed and even renamed.  Is it
>> the case that even after a dfs file is closed and renamed that the data
>> blocks would still not be synced and would still be stored by the datanode
>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>> it be better for the NameNode not to reject replicas that are in
>> blocksBeingWritten, especially if it doesn't have any other replicas
>> available?
>>
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>
>>> Yes this is a known issue.
>>>
>>> The HDFS part of this was addressed in
>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>> not
>>> available in 1.x  release. I think HBase does not use this API yet.
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>> the
>>> > data center we were in had a total power failure and the cluster went
>>> down
>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>  We
>>> > recovered the data in question from our secondary datacenter, but I'm
>>> > trying to understand what happened and whether this is a bug in HDFS
>>> that
>>> > should be fixed.
>>> >
>>> > From what I can tell the file was created and closed by the dfs client
>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>> other
>>> > files containing the same data.  Then the cluster lost power.  After
>>> the
>>> > cluster was restarted, the datanodes reported into the namenode but the
>>> > blocks for this file appeared as "blocks being written" - the namenode
>>> > rejected them and the datanodes deleted the blocks.  At this point
>>> there
>>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>>> > underlying file systems are ext3.  Some questions that I would love get
>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>> >
>>> >  - Is this a known scenario where data loss is expected?  (I found
>>> > HDFS-1539 but that seems different)
>>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>>> > happen before a file close operation is acknowledged to a hdfs client?
>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>> but
>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>> > underlying file system behavior)?
>>> >  - Is there any other explanation for how this can happen?
>>> >
>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>> > question).  It includes everything that mentions the block in question
>>> in
>>> > the NameNode and one DataNode log.  Please let me know if this more
>>> > information that would be helpful.
>>> >
>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>> > Creating
>>> >
>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > with permission=rwxrwxrwx
>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.allocateBlock:
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>> > blk_1395839728632046111_357084589
>>> > DN 2013-06-29 11:16:06,832 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>> > 10.0.5.237:50010
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>> added to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>> 10.0.5.237:14327
>>>
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>> > block blk_1395839728632046111_357084589 terminating
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> > Removing lease on  file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> DIR*
>>> > NameSystem.completeFile: file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>> > RS 2013-06-29 11:16:11,393 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > to
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > RS 2013-06-29 11:16:11,505 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>> of 7
>>> > file(s) in n of
>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>> into
>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>> 24.2m
>>> >
>>> > -------  CRASH, RESTART ---------
>>> >
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>> was
>>> > rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.1:50010
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>> but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.5.237:50010
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.24:50010
>>> > (note the clock on the server running DN is wrong after restart.  I
>>> > believe timestamps are off by 6 hours:)
>>> > DN 2013-06-29 06:07:22,877 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>> > blk_1395839728632046111_357084589 file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>> > deletion
>>> > DN 2013-06-29 06:07:24,952 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>> > blk_1395839728632046111_357084589 at file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>> >
>>> >
>>> > Thanks,
>>> > Dave
>>> >
>>>
>>>
>>>
>>> --
>>> http://hortonworks.com/download/
>>>
>>
>>
>


-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Dave,

Thanks for the detailed email. Sorry I did not read all the details you had
sent earlier completely (on my phone). As you said, this is not related to
data loss related to HBase log and hsync. I think you are right; the rename
operation itself might not have hit the disk. I think we should either
ensure metadata operation is synced on the datanode or handle it being
reported as blockBeingWritten. Let me spend sometime to debug this issue.

One surprising thing is, all the replicas were reported as
blockBeingWritten.

Regards,
Suresh


On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:

> (Removing hbase list and adding hdfs-dev list as this is pretty internal
> stuff).
>
> Reading through the code a bit:
>
> FSDataOutputStream.close calls
> DFSOutputStream.close calls
> DFSOutputStream.closeInternal
>  - sets currentPacket.lastPacketInBlock = true
>  - then calls
> DFSOutputStream.flushInternal
>  - enqueues current packet
>  - waits for ack
>
> BlockReceiver.run
>  - if (lastPacketInBlock && !receiver.finalized) calls
> FSDataset.finalizeBlock calls
> FSDataset.finalizeBlockInternal calls
> FSVolume.addBlock calls
> FSDir.addBlock calls
> FSDir.addBlock
>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>
> This looks to me as I would expect a synchronous chain from a DFS client
> to moving the file from blocksBeingWritten to the current dir so that once
> the file is closed that it the block files would be in the proper directory
> - even if the contents of the file are still in the OS buffer rather than
> synced to disk.  It's only after this moving of blocks that
> NameNode.complete file is called.  There are several conditions and loops
> in there that I'm not certain this chain is fully reliable in all cases
> without a greater understanding of the code.
>
> Could it be the case that the rename operation itself is not synced and
> that ext3 lost the fact that the block files were moved?
> Or is there a bug in the close file logic that for some reason the block
> files are not always moved into place when a file is closed?
>
> Thanks for your patience,
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>
>> Thanks for the response, Suresh.
>>
>> I'm not sure that I understand the details properly.  From my reading of
>> HDFS-744 the hsync API would allow a client to make sure that at any point
>> in time it's writes so far hit the disk.  For example, for HBase it could
>> apply a fsync after adding some edits to its WAL to ensure those edits are
>> fully durable for a file which is still open.
>>
>> However, in this case the dfs file was closed and even renamed.  Is it
>> the case that even after a dfs file is closed and renamed that the data
>> blocks would still not be synced and would still be stored by the datanode
>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>> it be better for the NameNode not to reject replicas that are in
>> blocksBeingWritten, especially if it doesn't have any other replicas
>> available?
>>
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>
>>> Yes this is a known issue.
>>>
>>> The HDFS part of this was addressed in
>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>> not
>>> available in 1.x  release. I think HBase does not use this API yet.
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>> the
>>> > data center we were in had a total power failure and the cluster went
>>> down
>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>  We
>>> > recovered the data in question from our secondary datacenter, but I'm
>>> > trying to understand what happened and whether this is a bug in HDFS
>>> that
>>> > should be fixed.
>>> >
>>> > From what I can tell the file was created and closed by the dfs client
>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>> other
>>> > files containing the same data.  Then the cluster lost power.  After
>>> the
>>> > cluster was restarted, the datanodes reported into the namenode but the
>>> > blocks for this file appeared as "blocks being written" - the namenode
>>> > rejected them and the datanodes deleted the blocks.  At this point
>>> there
>>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>>> > underlying file systems are ext3.  Some questions that I would love get
>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>> >
>>> >  - Is this a known scenario where data loss is expected?  (I found
>>> > HDFS-1539 but that seems different)
>>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>>> > happen before a file close operation is acknowledged to a hdfs client?
>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>> but
>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>> > underlying file system behavior)?
>>> >  - Is there any other explanation for how this can happen?
>>> >
>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>> > question).  It includes everything that mentions the block in question
>>> in
>>> > the NameNode and one DataNode log.  Please let me know if this more
>>> > information that would be helpful.
>>> >
>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>> > Creating
>>> >
>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > with permission=rwxrwxrwx
>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.allocateBlock:
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>> > blk_1395839728632046111_357084589
>>> > DN 2013-06-29 11:16:06,832 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>> > 10.0.5.237:50010
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>> added to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>> 10.0.5.237:14327
>>>
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>> > block blk_1395839728632046111_357084589 terminating
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> > Removing lease on  file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> DIR*
>>> > NameSystem.completeFile: file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>> > RS 2013-06-29 11:16:11,393 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > to
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > RS 2013-06-29 11:16:11,505 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>> of 7
>>> > file(s) in n of
>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>> into
>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>> 24.2m
>>> >
>>> > -------  CRASH, RESTART ---------
>>> >
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>> was
>>> > rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.1:50010
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>> but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.5.237:50010
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.24:50010
>>> > (note the clock on the server running DN is wrong after restart.  I
>>> > believe timestamps are off by 6 hours:)
>>> > DN 2013-06-29 06:07:22,877 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>> > blk_1395839728632046111_357084589 file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>> > deletion
>>> > DN 2013-06-29 06:07:24,952 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>> > blk_1395839728632046111_357084589 at file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>> >
>>> >
>>> > Thanks,
>>> > Dave
>>> >
>>>
>>>
>>>
>>> --
>>> http://hortonworks.com/download/
>>>
>>
>>
>


-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Dave,

Thanks for the detailed email. Sorry I did not read all the details you had
sent earlier completely (on my phone). As you said, this is not related to
data loss related to HBase log and hsync. I think you are right; the rename
operation itself might not have hit the disk. I think we should either
ensure metadata operation is synced on the datanode or handle it being
reported as blockBeingWritten. Let me spend sometime to debug this issue.

One surprising thing is, all the replicas were reported as
blockBeingWritten.

Regards,
Suresh


On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:

> (Removing hbase list and adding hdfs-dev list as this is pretty internal
> stuff).
>
> Reading through the code a bit:
>
> FSDataOutputStream.close calls
> DFSOutputStream.close calls
> DFSOutputStream.closeInternal
>  - sets currentPacket.lastPacketInBlock = true
>  - then calls
> DFSOutputStream.flushInternal
>  - enqueues current packet
>  - waits for ack
>
> BlockReceiver.run
>  - if (lastPacketInBlock && !receiver.finalized) calls
> FSDataset.finalizeBlock calls
> FSDataset.finalizeBlockInternal calls
> FSVolume.addBlock calls
> FSDir.addBlock calls
> FSDir.addBlock
>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>
> This looks to me as I would expect a synchronous chain from a DFS client
> to moving the file from blocksBeingWritten to the current dir so that once
> the file is closed that it the block files would be in the proper directory
> - even if the contents of the file are still in the OS buffer rather than
> synced to disk.  It's only after this moving of blocks that
> NameNode.complete file is called.  There are several conditions and loops
> in there that I'm not certain this chain is fully reliable in all cases
> without a greater understanding of the code.
>
> Could it be the case that the rename operation itself is not synced and
> that ext3 lost the fact that the block files were moved?
> Or is there a bug in the close file logic that for some reason the block
> files are not always moved into place when a file is closed?
>
> Thanks for your patience,
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>
>> Thanks for the response, Suresh.
>>
>> I'm not sure that I understand the details properly.  From my reading of
>> HDFS-744 the hsync API would allow a client to make sure that at any point
>> in time it's writes so far hit the disk.  For example, for HBase it could
>> apply a fsync after adding some edits to its WAL to ensure those edits are
>> fully durable for a file which is still open.
>>
>> However, in this case the dfs file was closed and even renamed.  Is it
>> the case that even after a dfs file is closed and renamed that the data
>> blocks would still not be synced and would still be stored by the datanode
>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>> it be better for the NameNode not to reject replicas that are in
>> blocksBeingWritten, especially if it doesn't have any other replicas
>> available?
>>
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>
>>> Yes this is a known issue.
>>>
>>> The HDFS part of this was addressed in
>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>> not
>>> available in 1.x  release. I think HBase does not use this API yet.
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>> the
>>> > data center we were in had a total power failure and the cluster went
>>> down
>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>  We
>>> > recovered the data in question from our secondary datacenter, but I'm
>>> > trying to understand what happened and whether this is a bug in HDFS
>>> that
>>> > should be fixed.
>>> >
>>> > From what I can tell the file was created and closed by the dfs client
>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>> other
>>> > files containing the same data.  Then the cluster lost power.  After
>>> the
>>> > cluster was restarted, the datanodes reported into the namenode but the
>>> > blocks for this file appeared as "blocks being written" - the namenode
>>> > rejected them and the datanodes deleted the blocks.  At this point
>>> there
>>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>>> > underlying file systems are ext3.  Some questions that I would love get
>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>> >
>>> >  - Is this a known scenario where data loss is expected?  (I found
>>> > HDFS-1539 but that seems different)
>>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>>> > happen before a file close operation is acknowledged to a hdfs client?
>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>> but
>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>> > underlying file system behavior)?
>>> >  - Is there any other explanation for how this can happen?
>>> >
>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>> > question).  It includes everything that mentions the block in question
>>> in
>>> > the NameNode and one DataNode log.  Please let me know if this more
>>> > information that would be helpful.
>>> >
>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>> > Creating
>>> >
>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > with permission=rwxrwxrwx
>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.allocateBlock:
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>> > blk_1395839728632046111_357084589
>>> > DN 2013-06-29 11:16:06,832 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>> > 10.0.5.237:50010
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>> added to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>> 10.0.5.237:14327
>>>
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>> > block blk_1395839728632046111_357084589 terminating
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> > Removing lease on  file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> DIR*
>>> > NameSystem.completeFile: file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>> > RS 2013-06-29 11:16:11,393 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > to
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > RS 2013-06-29 11:16:11,505 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>> of 7
>>> > file(s) in n of
>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>> into
>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>> 24.2m
>>> >
>>> > -------  CRASH, RESTART ---------
>>> >
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>> was
>>> > rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.1:50010
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>> but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.5.237:50010
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.24:50010
>>> > (note the clock on the server running DN is wrong after restart.  I
>>> > believe timestamps are off by 6 hours:)
>>> > DN 2013-06-29 06:07:22,877 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>> > blk_1395839728632046111_357084589 file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>> > deletion
>>> > DN 2013-06-29 06:07:24,952 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>> > blk_1395839728632046111_357084589 at file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>> >
>>> >
>>> > Thanks,
>>> > Dave
>>> >
>>>
>>>
>>>
>>> --
>>> http://hortonworks.com/download/
>>>
>>
>>
>


-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Dave,

Thanks for the detailed email. Sorry I did not read all the details you had
sent earlier completely (on my phone). As you said, this is not related to
data loss related to HBase log and hsync. I think you are right; the rename
operation itself might not have hit the disk. I think we should either
ensure metadata operation is synced on the datanode or handle it being
reported as blockBeingWritten. Let me spend sometime to debug this issue.

One surprising thing is, all the replicas were reported as
blockBeingWritten.

Regards,
Suresh


On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:

> (Removing hbase list and adding hdfs-dev list as this is pretty internal
> stuff).
>
> Reading through the code a bit:
>
> FSDataOutputStream.close calls
> DFSOutputStream.close calls
> DFSOutputStream.closeInternal
>  - sets currentPacket.lastPacketInBlock = true
>  - then calls
> DFSOutputStream.flushInternal
>  - enqueues current packet
>  - waits for ack
>
> BlockReceiver.run
>  - if (lastPacketInBlock && !receiver.finalized) calls
> FSDataset.finalizeBlock calls
> FSDataset.finalizeBlockInternal calls
> FSVolume.addBlock calls
> FSDir.addBlock calls
> FSDir.addBlock
>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>
> This looks to me as I would expect a synchronous chain from a DFS client
> to moving the file from blocksBeingWritten to the current dir so that once
> the file is closed that it the block files would be in the proper directory
> - even if the contents of the file are still in the OS buffer rather than
> synced to disk.  It's only after this moving of blocks that
> NameNode.complete file is called.  There are several conditions and loops
> in there that I'm not certain this chain is fully reliable in all cases
> without a greater understanding of the code.
>
> Could it be the case that the rename operation itself is not synced and
> that ext3 lost the fact that the block files were moved?
> Or is there a bug in the close file logic that for some reason the block
> files are not always moved into place when a file is closed?
>
> Thanks for your patience,
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>
>> Thanks for the response, Suresh.
>>
>> I'm not sure that I understand the details properly.  From my reading of
>> HDFS-744 the hsync API would allow a client to make sure that at any point
>> in time it's writes so far hit the disk.  For example, for HBase it could
>> apply a fsync after adding some edits to its WAL to ensure those edits are
>> fully durable for a file which is still open.
>>
>> However, in this case the dfs file was closed and even renamed.  Is it
>> the case that even after a dfs file is closed and renamed that the data
>> blocks would still not be synced and would still be stored by the datanode
>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>> it be better for the NameNode not to reject replicas that are in
>> blocksBeingWritten, especially if it doesn't have any other replicas
>> available?
>>
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>
>>> Yes this is a known issue.
>>>
>>> The HDFS part of this was addressed in
>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>> not
>>> available in 1.x  release. I think HBase does not use this API yet.
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>> the
>>> > data center we were in had a total power failure and the cluster went
>>> down
>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>  We
>>> > recovered the data in question from our secondary datacenter, but I'm
>>> > trying to understand what happened and whether this is a bug in HDFS
>>> that
>>> > should be fixed.
>>> >
>>> > From what I can tell the file was created and closed by the dfs client
>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>> other
>>> > files containing the same data.  Then the cluster lost power.  After
>>> the
>>> > cluster was restarted, the datanodes reported into the namenode but the
>>> > blocks for this file appeared as "blocks being written" - the namenode
>>> > rejected them and the datanodes deleted the blocks.  At this point
>>> there
>>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>>> > underlying file systems are ext3.  Some questions that I would love get
>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>> >
>>> >  - Is this a known scenario where data loss is expected?  (I found
>>> > HDFS-1539 but that seems different)
>>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>>> > happen before a file close operation is acknowledged to a hdfs client?
>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>> but
>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>> > underlying file system behavior)?
>>> >  - Is there any other explanation for how this can happen?
>>> >
>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>> > question).  It includes everything that mentions the block in question
>>> in
>>> > the NameNode and one DataNode log.  Please let me know if this more
>>> > information that would be helpful.
>>> >
>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>> > Creating
>>> >
>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > with permission=rwxrwxrwx
>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.allocateBlock:
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>> > blk_1395839728632046111_357084589
>>> > DN 2013-06-29 11:16:06,832 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>> > 10.0.5.237:50010
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>> added to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>> 10.0.5.237:14327
>>>
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>> > block blk_1395839728632046111_357084589 terminating
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> > Removing lease on  file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> DIR*
>>> > NameSystem.completeFile: file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>> > RS 2013-06-29 11:16:11,393 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > to
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > RS 2013-06-29 11:16:11,505 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>> of 7
>>> > file(s) in n of
>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>> into
>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>> 24.2m
>>> >
>>> > -------  CRASH, RESTART ---------
>>> >
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>> was
>>> > rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.1:50010
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>> but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.5.237:50010
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.24:50010
>>> > (note the clock on the server running DN is wrong after restart.  I
>>> > believe timestamps are off by 6 hours:)
>>> > DN 2013-06-29 06:07:22,877 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>> > blk_1395839728632046111_357084589 file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>> > deletion
>>> > DN 2013-06-29 06:07:24,952 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>> > blk_1395839728632046111_357084589 at file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>> >
>>> >
>>> > Thanks,
>>> > Dave
>>> >
>>>
>>>
>>>
>>> --
>>> http://hortonworks.com/download/
>>>
>>
>>
>


-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Dave,

Thanks for the detailed email. Sorry I did not read all the details you had
sent earlier completely (on my phone). As you said, this is not related to
data loss related to HBase log and hsync. I think you are right; the rename
operation itself might not have hit the disk. I think we should either
ensure metadata operation is synced on the datanode or handle it being
reported as blockBeingWritten. Let me spend sometime to debug this issue.

One surprising thing is, all the replicas were reported as
blockBeingWritten.

Regards,
Suresh


On Mon, Jul 1, 2013 at 6:03 PM, Dave Latham <la...@davelink.net> wrote:

> (Removing hbase list and adding hdfs-dev list as this is pretty internal
> stuff).
>
> Reading through the code a bit:
>
> FSDataOutputStream.close calls
> DFSOutputStream.close calls
> DFSOutputStream.closeInternal
>  - sets currentPacket.lastPacketInBlock = true
>  - then calls
> DFSOutputStream.flushInternal
>  - enqueues current packet
>  - waits for ack
>
> BlockReceiver.run
>  - if (lastPacketInBlock && !receiver.finalized) calls
> FSDataset.finalizeBlock calls
> FSDataset.finalizeBlockInternal calls
> FSVolume.addBlock calls
> FSDir.addBlock calls
> FSDir.addBlock
>  - renames block from "blocksBeingWritten" tmp dir to "current" dest dir
>
> This looks to me as I would expect a synchronous chain from a DFS client
> to moving the file from blocksBeingWritten to the current dir so that once
> the file is closed that it the block files would be in the proper directory
> - even if the contents of the file are still in the OS buffer rather than
> synced to disk.  It's only after this moving of blocks that
> NameNode.complete file is called.  There are several conditions and loops
> in there that I'm not certain this chain is fully reliable in all cases
> without a greater understanding of the code.
>
> Could it be the case that the rename operation itself is not synced and
> that ext3 lost the fact that the block files were moved?
> Or is there a bug in the close file logic that for some reason the block
> files are not always moved into place when a file is closed?
>
> Thanks for your patience,
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:
>
>> Thanks for the response, Suresh.
>>
>> I'm not sure that I understand the details properly.  From my reading of
>> HDFS-744 the hsync API would allow a client to make sure that at any point
>> in time it's writes so far hit the disk.  For example, for HBase it could
>> apply a fsync after adding some edits to its WAL to ensure those edits are
>> fully durable for a file which is still open.
>>
>> However, in this case the dfs file was closed and even renamed.  Is it
>> the case that even after a dfs file is closed and renamed that the data
>> blocks would still not be synced and would still be stored by the datanode
>> in "blocksBeingWritten" rather than in "current"?  If that is case, would
>> it be better for the NameNode not to reject replicas that are in
>> blocksBeingWritten, especially if it doesn't have any other replicas
>> available?
>>
>> Dave
>>
>>
>> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>>
>>> Yes this is a known issue.
>>>
>>> The HDFS part of this was addressed in
>>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is
>>> not
>>> available in 1.x  release. I think HBase does not use this API yet.
>>>
>>>
>>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>>
>>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>>> the
>>> > data center we were in had a total power failure and the cluster went
>>> down
>>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.
>>>  We
>>> > recovered the data in question from our secondary datacenter, but I'm
>>> > trying to understand what happened and whether this is a bug in HDFS
>>> that
>>> > should be fixed.
>>> >
>>> > From what I can tell the file was created and closed by the dfs client
>>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>>> other
>>> > files containing the same data.  Then the cluster lost power.  After
>>> the
>>> > cluster was restarted, the datanodes reported into the namenode but the
>>> > blocks for this file appeared as "blocks being written" - the namenode
>>> > rejected them and the datanodes deleted the blocks.  At this point
>>> there
>>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>>> > underlying file systems are ext3.  Some questions that I would love get
>>> > answers for if anyone with deeper understanding of HDFS can chime in:
>>> >
>>> >  - Is this a known scenario where data loss is expected?  (I found
>>> > HDFS-1539 but that seems different)
>>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>>> > happen before a file close operation is acknowledged to a hdfs client?
>>> >  - Could it be that the DataNodes actually moved the blocks to current
>>> but
>>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>>> > underlying file system behavior)?
>>> >  - Is there any other explanation for how this can happen?
>>> >
>>> > Here is a sequence of selected relevant log lines from the RS (HBase
>>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>>> > question).  It includes everything that mentions the block in question
>>> in
>>> > the NameNode and one DataNode log.  Please let me know if this more
>>> > information that would be helpful.
>>> >
>>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>>> > Creating
>>> >
>>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > with permission=rwxrwxrwx
>>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.allocateBlock:
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>>> > blk_1395839728632046111_357084589
>>> > DN 2013-06-29 11:16:06,832 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>>> > 10.0.5.237:50010
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>>> to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is
>>> added to
>>> > blk_1395839728632046111_357084589 size 25418340
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>>> > blk_1395839728632046111_357084589 of size 25418340 from /
>>> 10.0.5.237:14327
>>>
>>> > DN 2013-06-29 11:16:11,385 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>>> > block blk_1395839728632046111_357084589 terminating
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> > Removing lease on  file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>>> DIR*
>>> > NameSystem.completeFile: file
>>> >
>>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>>> > RS 2013-06-29 11:16:11,393 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > to
>>> >
>>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>>> > RS 2013-06-29 11:16:11,505 INFO
>>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>>> of 7
>>> > file(s) in n of
>>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>>> into
>>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>>> 24.2m
>>> >
>>> > -------  CRASH, RESTART ---------
>>> >
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>>> was
>>> > rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.1:50010
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264
>>> but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.5.237:50010
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addStoredBlock: addStoredBlock request received for
>>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>>> > was rejected: Reported as block being written but is a block of closed
>>> file.
>>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>>> BLOCK*
>>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>>> invalidSet
>>> > of 10.0.6.24:50010
>>> > (note the clock on the server running DN is wrong after restart.  I
>>> > believe timestamps are off by 6 hours:)
>>> > DN 2013-06-29 06:07:22,877 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>>> > blk_1395839728632046111_357084589 file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>>> > deletion
>>> > DN 2013-06-29 06:07:24,952 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>>> > blk_1395839728632046111_357084589 at file
>>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>>> >
>>> >
>>> > Thanks,
>>> > Dave
>>> >
>>>
>>>
>>>
>>> --
>>> http://hortonworks.com/download/
>>>
>>
>>
>


-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

(Removing hbase list and adding hdfs-dev list as this is pretty internal
stuff).

Reading through the code a bit:

FSDataOutputStream.close calls
DFSOutputStream.close calls
DFSOutputStream.closeInternal
 - sets currentPacket.lastPacketInBlock = true
 - then calls
DFSOutputStream.flushInternal
 - enqueues current packet
 - waits for ack

BlockReceiver.run
 - if (lastPacketInBlock && !receiver.finalized) calls
FSDataset.finalizeBlock calls
FSDataset.finalizeBlockInternal calls
FSVolume.addBlock calls
FSDir.addBlock calls
FSDir.addBlock
 - renames block from "blocksBeingWritten" tmp dir to "current" dest dir

This looks to me as I would expect a synchronous chain from a DFS client to
moving the file from blocksBeingWritten to the current dir so that once the
file is closed that it the block files would be in the proper directory -
even if the contents of the file are still in the OS buffer rather than
synced to disk.  It's only after this moving of blocks that
NameNode.complete file is called.  There are several conditions and loops
in there that I'm not certain this chain is fully reliable in all cases
without a greater understanding of the code.

Could it be the case that the rename operation itself is not synced and
that ext3 lost the fact that the block files were moved?
Or is there a bug in the close file logic that for some reason the block
files are not always moved into place when a file is closed?

Thanks for your patience,
Dave


On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:

> Thanks for the response, Suresh.
>
> I'm not sure that I understand the details properly.  From my reading of
> HDFS-744 the hsync API would allow a client to make sure that at any point
> in time it's writes so far hit the disk.  For example, for HBase it could
> apply a fsync after adding some edits to its WAL to ensure those edits are
> fully durable for a file which is still open.
>
> However, in this case the dfs file was closed and even renamed.  Is it the
> case that even after a dfs file is closed and renamed that the data blocks
> would still not be synced and would still be stored by the datanode in
> "blocksBeingWritten" rather than in "current"?  If that is case, would it
> be better for the NameNode not to reject replicas that are in
> blocksBeingWritten, especially if it doesn't have any other replicas
> available?
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>
>> Yes this is a known issue.
>>
>> The HDFS part of this was addressed in
>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
>> available in 1.x  release. I think HBase does not use this API yet.
>>
>>
>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>> the
>> > data center we were in had a total power failure and the cluster went
>> down
>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
>> > recovered the data in question from our secondary datacenter, but I'm
>> > trying to understand what happened and whether this is a bug in HDFS
>> that
>> > should be fixed.
>> >
>> > From what I can tell the file was created and closed by the dfs client
>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>> other
>> > files containing the same data.  Then the cluster lost power.  After the
>> > cluster was restarted, the datanodes reported into the namenode but the
>> > blocks for this file appeared as "blocks being written" - the namenode
>> > rejected them and the datanodes deleted the blocks.  At this point there
>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>> > underlying file systems are ext3.  Some questions that I would love get
>> > answers for if anyone with deeper understanding of HDFS can chime in:
>> >
>> >  - Is this a known scenario where data loss is expected?  (I found
>> > HDFS-1539 but that seems different)
>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>> > happen before a file close operation is acknowledged to a hdfs client?
>> >  - Could it be that the DataNodes actually moved the blocks to current
>> but
>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>> > underlying file system behavior)?
>> >  - Is there any other explanation for how this can happen?
>> >
>> > Here is a sequence of selected relevant log lines from the RS (HBase
>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>> > question).  It includes everything that mentions the block in question
>> in
>> > the NameNode and one DataNode log.  Please let me know if this more
>> > information that would be helpful.
>> >
>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>> > Creating
>> >
>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > with permission=rwxrwxrwx
>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.allocateBlock:
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>> > blk_1395839728632046111_357084589
>> > DN 2013-06-29 11:16:06,832 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>> > 10.0.5.237:50010
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>> > blk_1395839728632046111_357084589 of size 25418340 from /
>> 10.0.5.237:14327
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>> > block blk_1395839728632046111_357084589 terminating
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> > Removing lease on  file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
>> > NameSystem.completeFile: file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>> > RS 2013-06-29 11:16:11,393 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > to
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>> > RS 2013-06-29 11:16:11,505 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>> of 7
>> > file(s) in n of
>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>> into
>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>> 24.2m
>> >
>> > -------  CRASH, RESTART ---------
>> >
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>> was
>> > rejected: Reported as block being written but is a block of closed file.
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.1:50010
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.5.237:50010
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.24:50010
>> > (note the clock on the server running DN is wrong after restart.  I
>> > believe timestamps are off by 6 hours:)
>> > DN 2013-06-29 06:07:22,877 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>> > blk_1395839728632046111_357084589 file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>> > deletion
>> > DN 2013-06-29 06:07:24,952 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>> > blk_1395839728632046111_357084589 at file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>> >
>> >
>> > Thanks,
>> > Dave
>> >
>>
>>
>>
>> --
>> http://hortonworks.com/download/
>>
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

(Removing hbase list and adding hdfs-dev list as this is pretty internal
stuff).

Reading through the code a bit:

FSDataOutputStream.close calls
DFSOutputStream.close calls
DFSOutputStream.closeInternal
 - sets currentPacket.lastPacketInBlock = true
 - then calls
DFSOutputStream.flushInternal
 - enqueues current packet
 - waits for ack

BlockReceiver.run
 - if (lastPacketInBlock && !receiver.finalized) calls
FSDataset.finalizeBlock calls
FSDataset.finalizeBlockInternal calls
FSVolume.addBlock calls
FSDir.addBlock calls
FSDir.addBlock
 - renames block from "blocksBeingWritten" tmp dir to "current" dest dir

This looks to me as I would expect a synchronous chain from a DFS client to
moving the file from blocksBeingWritten to the current dir so that once the
file is closed that it the block files would be in the proper directory -
even if the contents of the file are still in the OS buffer rather than
synced to disk.  It's only after this moving of blocks that
NameNode.complete file is called.  There are several conditions and loops
in there that I'm not certain this chain is fully reliable in all cases
without a greater understanding of the code.

Could it be the case that the rename operation itself is not synced and
that ext3 lost the fact that the block files were moved?
Or is there a bug in the close file logic that for some reason the block
files are not always moved into place when a file is closed?

Thanks for your patience,
Dave


On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:

> Thanks for the response, Suresh.
>
> I'm not sure that I understand the details properly.  From my reading of
> HDFS-744 the hsync API would allow a client to make sure that at any point
> in time it's writes so far hit the disk.  For example, for HBase it could
> apply a fsync after adding some edits to its WAL to ensure those edits are
> fully durable for a file which is still open.
>
> However, in this case the dfs file was closed and even renamed.  Is it the
> case that even after a dfs file is closed and renamed that the data blocks
> would still not be synced and would still be stored by the datanode in
> "blocksBeingWritten" rather than in "current"?  If that is case, would it
> be better for the NameNode not to reject replicas that are in
> blocksBeingWritten, especially if it doesn't have any other replicas
> available?
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>
>> Yes this is a known issue.
>>
>> The HDFS part of this was addressed in
>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
>> available in 1.x  release. I think HBase does not use this API yet.
>>
>>
>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>> the
>> > data center we were in had a total power failure and the cluster went
>> down
>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
>> > recovered the data in question from our secondary datacenter, but I'm
>> > trying to understand what happened and whether this is a bug in HDFS
>> that
>> > should be fixed.
>> >
>> > From what I can tell the file was created and closed by the dfs client
>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>> other
>> > files containing the same data.  Then the cluster lost power.  After the
>> > cluster was restarted, the datanodes reported into the namenode but the
>> > blocks for this file appeared as "blocks being written" - the namenode
>> > rejected them and the datanodes deleted the blocks.  At this point there
>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>> > underlying file systems are ext3.  Some questions that I would love get
>> > answers for if anyone with deeper understanding of HDFS can chime in:
>> >
>> >  - Is this a known scenario where data loss is expected?  (I found
>> > HDFS-1539 but that seems different)
>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>> > happen before a file close operation is acknowledged to a hdfs client?
>> >  - Could it be that the DataNodes actually moved the blocks to current
>> but
>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>> > underlying file system behavior)?
>> >  - Is there any other explanation for how this can happen?
>> >
>> > Here is a sequence of selected relevant log lines from the RS (HBase
>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>> > question).  It includes everything that mentions the block in question
>> in
>> > the NameNode and one DataNode log.  Please let me know if this more
>> > information that would be helpful.
>> >
>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>> > Creating
>> >
>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > with permission=rwxrwxrwx
>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.allocateBlock:
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>> > blk_1395839728632046111_357084589
>> > DN 2013-06-29 11:16:06,832 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>> > 10.0.5.237:50010
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>> > blk_1395839728632046111_357084589 of size 25418340 from /
>> 10.0.5.237:14327
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>> > block blk_1395839728632046111_357084589 terminating
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> > Removing lease on  file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
>> > NameSystem.completeFile: file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>> > RS 2013-06-29 11:16:11,393 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > to
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>> > RS 2013-06-29 11:16:11,505 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>> of 7
>> > file(s) in n of
>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>> into
>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>> 24.2m
>> >
>> > -------  CRASH, RESTART ---------
>> >
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>> was
>> > rejected: Reported as block being written but is a block of closed file.
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.1:50010
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.5.237:50010
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.24:50010
>> > (note the clock on the server running DN is wrong after restart.  I
>> > believe timestamps are off by 6 hours:)
>> > DN 2013-06-29 06:07:22,877 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>> > blk_1395839728632046111_357084589 file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>> > deletion
>> > DN 2013-06-29 06:07:24,952 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>> > blk_1395839728632046111_357084589 at file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>> >
>> >
>> > Thanks,
>> > Dave
>> >
>>
>>
>>
>> --
>> http://hortonworks.com/download/
>>
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

(Removing hbase list and adding hdfs-dev list as this is pretty internal
stuff).

Reading through the code a bit:

FSDataOutputStream.close calls
DFSOutputStream.close calls
DFSOutputStream.closeInternal
 - sets currentPacket.lastPacketInBlock = true
 - then calls
DFSOutputStream.flushInternal
 - enqueues current packet
 - waits for ack

BlockReceiver.run
 - if (lastPacketInBlock && !receiver.finalized) calls
FSDataset.finalizeBlock calls
FSDataset.finalizeBlockInternal calls
FSVolume.addBlock calls
FSDir.addBlock calls
FSDir.addBlock
 - renames block from "blocksBeingWritten" tmp dir to "current" dest dir

This looks to me as I would expect a synchronous chain from a DFS client to
moving the file from blocksBeingWritten to the current dir so that once the
file is closed that it the block files would be in the proper directory -
even if the contents of the file are still in the OS buffer rather than
synced to disk.  It's only after this moving of blocks that
NameNode.complete file is called.  There are several conditions and loops
in there that I'm not certain this chain is fully reliable in all cases
without a greater understanding of the code.

Could it be the case that the rename operation itself is not synced and
that ext3 lost the fact that the block files were moved?
Or is there a bug in the close file logic that for some reason the block
files are not always moved into place when a file is closed?

Thanks for your patience,
Dave


On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:

> Thanks for the response, Suresh.
>
> I'm not sure that I understand the details properly.  From my reading of
> HDFS-744 the hsync API would allow a client to make sure that at any point
> in time it's writes so far hit the disk.  For example, for HBase it could
> apply a fsync after adding some edits to its WAL to ensure those edits are
> fully durable for a file which is still open.
>
> However, in this case the dfs file was closed and even renamed.  Is it the
> case that even after a dfs file is closed and renamed that the data blocks
> would still not be synced and would still be stored by the datanode in
> "blocksBeingWritten" rather than in "current"?  If that is case, would it
> be better for the NameNode not to reject replicas that are in
> blocksBeingWritten, especially if it doesn't have any other replicas
> available?
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>
>> Yes this is a known issue.
>>
>> The HDFS part of this was addressed in
>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
>> available in 1.x  release. I think HBase does not use this API yet.
>>
>>
>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>> the
>> > data center we were in had a total power failure and the cluster went
>> down
>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
>> > recovered the data in question from our secondary datacenter, but I'm
>> > trying to understand what happened and whether this is a bug in HDFS
>> that
>> > should be fixed.
>> >
>> > From what I can tell the file was created and closed by the dfs client
>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>> other
>> > files containing the same data.  Then the cluster lost power.  After the
>> > cluster was restarted, the datanodes reported into the namenode but the
>> > blocks for this file appeared as "blocks being written" - the namenode
>> > rejected them and the datanodes deleted the blocks.  At this point there
>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>> > underlying file systems are ext3.  Some questions that I would love get
>> > answers for if anyone with deeper understanding of HDFS can chime in:
>> >
>> >  - Is this a known scenario where data loss is expected?  (I found
>> > HDFS-1539 but that seems different)
>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>> > happen before a file close operation is acknowledged to a hdfs client?
>> >  - Could it be that the DataNodes actually moved the blocks to current
>> but
>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>> > underlying file system behavior)?
>> >  - Is there any other explanation for how this can happen?
>> >
>> > Here is a sequence of selected relevant log lines from the RS (HBase
>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>> > question).  It includes everything that mentions the block in question
>> in
>> > the NameNode and one DataNode log.  Please let me know if this more
>> > information that would be helpful.
>> >
>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>> > Creating
>> >
>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > with permission=rwxrwxrwx
>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.allocateBlock:
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>> > blk_1395839728632046111_357084589
>> > DN 2013-06-29 11:16:06,832 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>> > 10.0.5.237:50010
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>> > blk_1395839728632046111_357084589 of size 25418340 from /
>> 10.0.5.237:14327
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>> > block blk_1395839728632046111_357084589 terminating
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> > Removing lease on  file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
>> > NameSystem.completeFile: file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>> > RS 2013-06-29 11:16:11,393 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > to
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>> > RS 2013-06-29 11:16:11,505 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>> of 7
>> > file(s) in n of
>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>> into
>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>> 24.2m
>> >
>> > -------  CRASH, RESTART ---------
>> >
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>> was
>> > rejected: Reported as block being written but is a block of closed file.
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.1:50010
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.5.237:50010
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.24:50010
>> > (note the clock on the server running DN is wrong after restart.  I
>> > believe timestamps are off by 6 hours:)
>> > DN 2013-06-29 06:07:22,877 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>> > blk_1395839728632046111_357084589 file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>> > deletion
>> > DN 2013-06-29 06:07:24,952 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>> > blk_1395839728632046111_357084589 at file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>> >
>> >
>> > Thanks,
>> > Dave
>> >
>>
>>
>>
>> --
>> http://hortonworks.com/download/
>>
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

(Removing hbase list and adding hdfs-dev list as this is pretty internal
stuff).

Reading through the code a bit:

FSDataOutputStream.close calls
DFSOutputStream.close calls
DFSOutputStream.closeInternal
 - sets currentPacket.lastPacketInBlock = true
 - then calls
DFSOutputStream.flushInternal
 - enqueues current packet
 - waits for ack

BlockReceiver.run
 - if (lastPacketInBlock && !receiver.finalized) calls
FSDataset.finalizeBlock calls
FSDataset.finalizeBlockInternal calls
FSVolume.addBlock calls
FSDir.addBlock calls
FSDir.addBlock
 - renames block from "blocksBeingWritten" tmp dir to "current" dest dir

This looks to me as I would expect a synchronous chain from a DFS client to
moving the file from blocksBeingWritten to the current dir so that once the
file is closed that it the block files would be in the proper directory -
even if the contents of the file are still in the OS buffer rather than
synced to disk.  It's only after this moving of blocks that
NameNode.complete file is called.  There are several conditions and loops
in there that I'm not certain this chain is fully reliable in all cases
without a greater understanding of the code.

Could it be the case that the rename operation itself is not synced and
that ext3 lost the fact that the block files were moved?
Or is there a bug in the close file logic that for some reason the block
files are not always moved into place when a file is closed?

Thanks for your patience,
Dave


On Mon, Jul 1, 2013 at 3:35 PM, Dave Latham <la...@davelink.net> wrote:

> Thanks for the response, Suresh.
>
> I'm not sure that I understand the details properly.  From my reading of
> HDFS-744 the hsync API would allow a client to make sure that at any point
> in time it's writes so far hit the disk.  For example, for HBase it could
> apply a fsync after adding some edits to its WAL to ensure those edits are
> fully durable for a file which is still open.
>
> However, in this case the dfs file was closed and even renamed.  Is it the
> case that even after a dfs file is closed and renamed that the data blocks
> would still not be synced and would still be stored by the datanode in
> "blocksBeingWritten" rather than in "current"?  If that is case, would it
> be better for the NameNode not to reject replicas that are in
> blocksBeingWritten, especially if it doesn't have any other replicas
> available?
>
> Dave
>
>
> On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:
>
>> Yes this is a known issue.
>>
>> The HDFS part of this was addressed in
>> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
>> available in 1.x  release. I think HBase does not use this API yet.
>>
>>
>> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>>
>> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday
>> the
>> > data center we were in had a total power failure and the cluster went
>> down
>> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
>> > recovered the data in question from our secondary datacenter, but I'm
>> > trying to understand what happened and whether this is a bug in HDFS
>> that
>> > should be fixed.
>> >
>> > From what I can tell the file was created and closed by the dfs client
>> > (hbase).  Then HBase renamed it into a new directory and deleted some
>> other
>> > files containing the same data.  Then the cluster lost power.  After the
>> > cluster was restarted, the datanodes reported into the namenode but the
>> > blocks for this file appeared as "blocks being written" - the namenode
>> > rejected them and the datanodes deleted the blocks.  At this point there
>> > were no replicas for the blocks and the files were marked CORRUPT.  The
>> > underlying file systems are ext3.  Some questions that I would love get
>> > answers for if anyone with deeper understanding of HDFS can chime in:
>> >
>> >  - Is this a known scenario where data loss is expected?  (I found
>> > HDFS-1539 but that seems different)
>> >  - When are blocks moved from blocksBeingWritten to current?  Does that
>> > happen before a file close operation is acknowledged to a hdfs client?
>> >  - Could it be that the DataNodes actually moved the blocks to current
>> but
>> > after the restart ext3 rewound state somehow (forgive my ignorance of
>> > underlying file system behavior)?
>> >  - Is there any other explanation for how this can happen?
>> >
>> > Here is a sequence of selected relevant log lines from the RS (HBase
>> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
>> > question).  It includes everything that mentions the block in question
>> in
>> > the NameNode and one DataNode log.  Please let me know if this more
>> > information that would be helpful.
>> >
>> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
>> > Creating
>> >
>> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > with permission=rwxrwxrwx
>> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.allocateBlock:
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
>> > blk_1395839728632046111_357084589
>> > DN 2013-06-29 11:16:06,832 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
>> > 10.0.5.237:50010
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
>> to
>> > blk_1395839728632046111_357084589 size 25418340
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
>> > blk_1395839728632046111_357084589 of size 25418340 from /
>> 10.0.5.237:14327
>> > DN 2013-06-29 11:16:11,385 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
>> > block blk_1395839728632046111_357084589 terminating
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
>> > Removing lease on  file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > from client DFSClient_hb_rs_hs745,60020,1372470111932
>> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
>> > NameSystem.completeFile: file
>> >
>> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
>> > RS 2013-06-29 11:16:11,393 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
>> > to
>> >
>> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
>> > RS 2013-06-29 11:16:11,505 INFO
>> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
>> of 7
>> > file(s) in n of
>> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
>> into
>> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
>> 24.2m
>> >
>> > -------  CRASH, RESTART ---------
>> >
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
>> was
>> > rejected: Reported as block being written but is a block of closed file.
>> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.1:50010
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.5.237:50010
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addStoredBlock: addStoredBlock request received for
>> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
>> > was rejected: Reported as block being written but is a block of closed
>> file.
>> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
>> BLOCK*
>> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
>> invalidSet
>> > of 10.0.6.24:50010
>> > (note the clock on the server running DN is wrong after restart.  I
>> > believe timestamps are off by 6 hours:)
>> > DN 2013-06-29 06:07:22,877 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>> > blk_1395839728632046111_357084589 file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
>> > deletion
>> > DN 2013-06-29 06:07:24,952 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>> > blk_1395839728632046111_357084589 at file
>> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>> >
>> >
>> > Thanks,
>> > Dave
>> >
>>
>>
>>
>> --
>> http://hortonworks.com/download/
>>
>
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the response, Suresh.

I'm not sure that I understand the details properly.  From my reading of
HDFS-744 the hsync API would allow a client to make sure that at any point
in time it's writes so far hit the disk.  For example, for HBase it could
apply a fsync after adding some edits to its WAL to ensure those edits are
fully durable for a file which is still open.

However, in this case the dfs file was closed and even renamed.  Is it the
case that even after a dfs file is closed and renamed that the data blocks
would still not be synced and would still be stored by the datanode in
"blocksBeingWritten" rather than in "current"?  If that is case, would it
be better for the NameNode not to reject replicas that are in
blocksBeingWritten, especially if it doesn't have any other replicas
available?

Dave


On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Yes this is a known issue.
>
> The HDFS part of this was addressed in
> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
> available in 1.x  release. I think HBase does not use this API yet.
>
>
> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>
> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> > data center we were in had a total power failure and the cluster went
> down
> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> > recovered the data in question from our secondary datacenter, but I'm
> > trying to understand what happened and whether this is a bug in HDFS that
> > should be fixed.
> >
> > From what I can tell the file was created and closed by the dfs client
> > (hbase).  Then HBase renamed it into a new directory and deleted some
> other
> > files containing the same data.  Then the cluster lost power.  After the
> > cluster was restarted, the datanodes reported into the namenode but the
> > blocks for this file appeared as "blocks being written" - the namenode
> > rejected them and the datanodes deleted the blocks.  At this point there
> > were no replicas for the blocks and the files were marked CORRUPT.  The
> > underlying file systems are ext3.  Some questions that I would love get
> > answers for if anyone with deeper understanding of HDFS can chime in:
> >
> >  - Is this a known scenario where data loss is expected?  (I found
> > HDFS-1539 but that seems different)
> >  - When are blocks moved from blocksBeingWritten to current?  Does that
> > happen before a file close operation is acknowledged to a hdfs client?
> >  - Could it be that the DataNodes actually moved the blocks to current
> but
> > after the restart ext3 rewound state somehow (forgive my ignorance of
> > underlying file system behavior)?
> >  - Is there any other explanation for how this can happen?
> >
> > Here is a sequence of selected relevant log lines from the RS (HBase
> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> > question).  It includes everything that mentions the block in question in
> > the NameNode and one DataNode log.  Please let me know if this more
> > information that would be helpful.
> >
> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> > Creating
> >
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > with permission=rwxrwxrwx
> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.allocateBlock:
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> > blk_1395839728632046111_357084589
> > DN 2013-06-29 11:16:06,832 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> > 10.0.5.237:50010
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
> to
> > blk_1395839728632046111_357084589 size 25418340
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> > blk_1395839728632046111_357084589 of size 25418340 from /
> 10.0.5.237:14327
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> > block blk_1395839728632046111_357084589 terminating
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> > Removing lease on  file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> > NameSystem.completeFile: file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> > RS 2013-06-29 11:16:11,393 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > to
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> > RS 2013-06-29 11:16:11,505 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
> of 7
> > file(s) in n of
> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> into
> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
> 24.2m
> >
> > -------  CRASH, RESTART ---------
> >
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
> was
> > rejected: Reported as block being written but is a block of closed file.
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.1:50010
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.5.237:50010
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.24:50010
> > (note the clock on the server running DN is wrong after restart.  I
> > believe timestamps are off by 6 hours:)
> > DN 2013-06-29 06:07:22,877 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> > blk_1395839728632046111_357084589 file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> > deletion
> > DN 2013-06-29 06:07:24,952 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> > blk_1395839728632046111_357084589 at file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >
> >
> > Thanks,
> > Dave
> >
>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the response, Suresh.

I'm not sure that I understand the details properly.  From my reading of
HDFS-744 the hsync API would allow a client to make sure that at any point
in time it's writes so far hit the disk.  For example, for HBase it could
apply a fsync after adding some edits to its WAL to ensure those edits are
fully durable for a file which is still open.

However, in this case the dfs file was closed and even renamed.  Is it the
case that even after a dfs file is closed and renamed that the data blocks
would still not be synced and would still be stored by the datanode in
"blocksBeingWritten" rather than in "current"?  If that is case, would it
be better for the NameNode not to reject replicas that are in
blocksBeingWritten, especially if it doesn't have any other replicas
available?

Dave


On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Yes this is a known issue.
>
> The HDFS part of this was addressed in
> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
> available in 1.x  release. I think HBase does not use this API yet.
>
>
> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>
> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> > data center we were in had a total power failure and the cluster went
> down
> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> > recovered the data in question from our secondary datacenter, but I'm
> > trying to understand what happened and whether this is a bug in HDFS that
> > should be fixed.
> >
> > From what I can tell the file was created and closed by the dfs client
> > (hbase).  Then HBase renamed it into a new directory and deleted some
> other
> > files containing the same data.  Then the cluster lost power.  After the
> > cluster was restarted, the datanodes reported into the namenode but the
> > blocks for this file appeared as "blocks being written" - the namenode
> > rejected them and the datanodes deleted the blocks.  At this point there
> > were no replicas for the blocks and the files were marked CORRUPT.  The
> > underlying file systems are ext3.  Some questions that I would love get
> > answers for if anyone with deeper understanding of HDFS can chime in:
> >
> >  - Is this a known scenario where data loss is expected?  (I found
> > HDFS-1539 but that seems different)
> >  - When are blocks moved from blocksBeingWritten to current?  Does that
> > happen before a file close operation is acknowledged to a hdfs client?
> >  - Could it be that the DataNodes actually moved the blocks to current
> but
> > after the restart ext3 rewound state somehow (forgive my ignorance of
> > underlying file system behavior)?
> >  - Is there any other explanation for how this can happen?
> >
> > Here is a sequence of selected relevant log lines from the RS (HBase
> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> > question).  It includes everything that mentions the block in question in
> > the NameNode and one DataNode log.  Please let me know if this more
> > information that would be helpful.
> >
> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> > Creating
> >
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > with permission=rwxrwxrwx
> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.allocateBlock:
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> > blk_1395839728632046111_357084589
> > DN 2013-06-29 11:16:06,832 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> > 10.0.5.237:50010
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
> to
> > blk_1395839728632046111_357084589 size 25418340
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> > blk_1395839728632046111_357084589 of size 25418340 from /
> 10.0.5.237:14327
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> > block blk_1395839728632046111_357084589 terminating
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> > Removing lease on  file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> > NameSystem.completeFile: file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> > RS 2013-06-29 11:16:11,393 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > to
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> > RS 2013-06-29 11:16:11,505 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
> of 7
> > file(s) in n of
> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> into
> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
> 24.2m
> >
> > -------  CRASH, RESTART ---------
> >
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
> was
> > rejected: Reported as block being written but is a block of closed file.
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.1:50010
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.5.237:50010
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.24:50010
> > (note the clock on the server running DN is wrong after restart.  I
> > believe timestamps are off by 6 hours:)
> > DN 2013-06-29 06:07:22,877 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> > blk_1395839728632046111_357084589 file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> > deletion
> > DN 2013-06-29 06:07:24,952 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> > blk_1395839728632046111_357084589 at file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >
> >
> > Thanks,
> > Dave
> >
>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the response, Suresh.

I'm not sure that I understand the details properly.  From my reading of
HDFS-744 the hsync API would allow a client to make sure that at any point
in time it's writes so far hit the disk.  For example, for HBase it could
apply a fsync after adding some edits to its WAL to ensure those edits are
fully durable for a file which is still open.

However, in this case the dfs file was closed and even renamed.  Is it the
case that even after a dfs file is closed and renamed that the data blocks
would still not be synced and would still be stored by the datanode in
"blocksBeingWritten" rather than in "current"?  If that is case, would it
be better for the NameNode not to reject replicas that are in
blocksBeingWritten, especially if it doesn't have any other replicas
available?

Dave


On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Yes this is a known issue.
>
> The HDFS part of this was addressed in
> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
> available in 1.x  release. I think HBase does not use this API yet.
>
>
> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>
> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> > data center we were in had a total power failure and the cluster went
> down
> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> > recovered the data in question from our secondary datacenter, but I'm
> > trying to understand what happened and whether this is a bug in HDFS that
> > should be fixed.
> >
> > From what I can tell the file was created and closed by the dfs client
> > (hbase).  Then HBase renamed it into a new directory and deleted some
> other
> > files containing the same data.  Then the cluster lost power.  After the
> > cluster was restarted, the datanodes reported into the namenode but the
> > blocks for this file appeared as "blocks being written" - the namenode
> > rejected them and the datanodes deleted the blocks.  At this point there
> > were no replicas for the blocks and the files were marked CORRUPT.  The
> > underlying file systems are ext3.  Some questions that I would love get
> > answers for if anyone with deeper understanding of HDFS can chime in:
> >
> >  - Is this a known scenario where data loss is expected?  (I found
> > HDFS-1539 but that seems different)
> >  - When are blocks moved from blocksBeingWritten to current?  Does that
> > happen before a file close operation is acknowledged to a hdfs client?
> >  - Could it be that the DataNodes actually moved the blocks to current
> but
> > after the restart ext3 rewound state somehow (forgive my ignorance of
> > underlying file system behavior)?
> >  - Is there any other explanation for how this can happen?
> >
> > Here is a sequence of selected relevant log lines from the RS (HBase
> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> > question).  It includes everything that mentions the block in question in
> > the NameNode and one DataNode log.  Please let me know if this more
> > information that would be helpful.
> >
> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> > Creating
> >
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > with permission=rwxrwxrwx
> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.allocateBlock:
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> > blk_1395839728632046111_357084589
> > DN 2013-06-29 11:16:06,832 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> > 10.0.5.237:50010
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
> to
> > blk_1395839728632046111_357084589 size 25418340
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> > blk_1395839728632046111_357084589 of size 25418340 from /
> 10.0.5.237:14327
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> > block blk_1395839728632046111_357084589 terminating
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> > Removing lease on  file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> > NameSystem.completeFile: file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> > RS 2013-06-29 11:16:11,393 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > to
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> > RS 2013-06-29 11:16:11,505 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
> of 7
> > file(s) in n of
> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> into
> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
> 24.2m
> >
> > -------  CRASH, RESTART ---------
> >
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
> was
> > rejected: Reported as block being written but is a block of closed file.
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.1:50010
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.5.237:50010
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.24:50010
> > (note the clock on the server running DN is wrong after restart.  I
> > believe timestamps are off by 6 hours:)
> > DN 2013-06-29 06:07:22,877 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> > blk_1395839728632046111_357084589 file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> > deletion
> > DN 2013-06-29 06:07:24,952 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> > blk_1395839728632046111_357084589 at file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >
> >
> > Thanks,
> > Dave
> >
>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the response, Suresh.

I'm not sure that I understand the details properly.  From my reading of
HDFS-744 the hsync API would allow a client to make sure that at any point
in time it's writes so far hit the disk.  For example, for HBase it could
apply a fsync after adding some edits to its WAL to ensure those edits are
fully durable for a file which is still open.

However, in this case the dfs file was closed and even renamed.  Is it the
case that even after a dfs file is closed and renamed that the data blocks
would still not be synced and would still be stored by the datanode in
"blocksBeingWritten" rather than in "current"?  If that is case, would it
be better for the NameNode not to reject replicas that are in
blocksBeingWritten, especially if it doesn't have any other replicas
available?

Dave


On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Yes this is a known issue.
>
> The HDFS part of this was addressed in
> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
> available in 1.x  release. I think HBase does not use this API yet.
>
>
> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>
> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> > data center we were in had a total power failure and the cluster went
> down
> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> > recovered the data in question from our secondary datacenter, but I'm
> > trying to understand what happened and whether this is a bug in HDFS that
> > should be fixed.
> >
> > From what I can tell the file was created and closed by the dfs client
> > (hbase).  Then HBase renamed it into a new directory and deleted some
> other
> > files containing the same data.  Then the cluster lost power.  After the
> > cluster was restarted, the datanodes reported into the namenode but the
> > blocks for this file appeared as "blocks being written" - the namenode
> > rejected them and the datanodes deleted the blocks.  At this point there
> > were no replicas for the blocks and the files were marked CORRUPT.  The
> > underlying file systems are ext3.  Some questions that I would love get
> > answers for if anyone with deeper understanding of HDFS can chime in:
> >
> >  - Is this a known scenario where data loss is expected?  (I found
> > HDFS-1539 but that seems different)
> >  - When are blocks moved from blocksBeingWritten to current?  Does that
> > happen before a file close operation is acknowledged to a hdfs client?
> >  - Could it be that the DataNodes actually moved the blocks to current
> but
> > after the restart ext3 rewound state somehow (forgive my ignorance of
> > underlying file system behavior)?
> >  - Is there any other explanation for how this can happen?
> >
> > Here is a sequence of selected relevant log lines from the RS (HBase
> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> > question).  It includes everything that mentions the block in question in
> > the NameNode and one DataNode log.  Please let me know if this more
> > information that would be helpful.
> >
> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> > Creating
> >
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > with permission=rwxrwxrwx
> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.allocateBlock:
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> > blk_1395839728632046111_357084589
> > DN 2013-06-29 11:16:06,832 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> > 10.0.5.237:50010
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
> to
> > blk_1395839728632046111_357084589 size 25418340
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> > blk_1395839728632046111_357084589 of size 25418340 from /
> 10.0.5.237:14327
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> > block blk_1395839728632046111_357084589 terminating
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> > Removing lease on  file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> > NameSystem.completeFile: file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> > RS 2013-06-29 11:16:11,393 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > to
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> > RS 2013-06-29 11:16:11,505 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
> of 7
> > file(s) in n of
> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> into
> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
> 24.2m
> >
> > -------  CRASH, RESTART ---------
> >
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
> was
> > rejected: Reported as block being written but is a block of closed file.
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.1:50010
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.5.237:50010
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.24:50010
> > (note the clock on the server running DN is wrong after restart.  I
> > believe timestamps are off by 6 hours:)
> > DN 2013-06-29 06:07:22,877 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> > blk_1395839728632046111_357084589 file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> > deletion
> > DN 2013-06-29 06:07:24,952 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> > blk_1395839728632046111_357084589 at file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >
> >
> > Thanks,
> > Dave
> >
>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Dave Latham <la...@davelink.net>.

Thanks for the response, Suresh.

I'm not sure that I understand the details properly.  From my reading of
HDFS-744 the hsync API would allow a client to make sure that at any point
in time it's writes so far hit the disk.  For example, for HBase it could
apply a fsync after adding some edits to its WAL to ensure those edits are
fully durable for a file which is still open.

However, in this case the dfs file was closed and even renamed.  Is it the
case that even after a dfs file is closed and renamed that the data blocks
would still not be synced and would still be stored by the datanode in
"blocksBeingWritten" rather than in "current"?  If that is case, would it
be better for the NameNode not to reject replicas that are in
blocksBeingWritten, especially if it doesn't have any other replicas
available?

Dave


On Mon, Jul 1, 2013 at 3:16 PM, Suresh Srinivas <su...@hortonworks.com>wrote:

> Yes this is a known issue.
>
> The HDFS part of this was addressed in
> https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
> available in 1.x  release. I think HBase does not use this API yet.
>
>
> On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:
>
> > We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> > data center we were in had a total power failure and the cluster went
> down
> > hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> > recovered the data in question from our secondary datacenter, but I'm
> > trying to understand what happened and whether this is a bug in HDFS that
> > should be fixed.
> >
> > From what I can tell the file was created and closed by the dfs client
> > (hbase).  Then HBase renamed it into a new directory and deleted some
> other
> > files containing the same data.  Then the cluster lost power.  After the
> > cluster was restarted, the datanodes reported into the namenode but the
> > blocks for this file appeared as "blocks being written" - the namenode
> > rejected them and the datanodes deleted the blocks.  At this point there
> > were no replicas for the blocks and the files were marked CORRUPT.  The
> > underlying file systems are ext3.  Some questions that I would love get
> > answers for if anyone with deeper understanding of HDFS can chime in:
> >
> >  - Is this a known scenario where data loss is expected?  (I found
> > HDFS-1539 but that seems different)
> >  - When are blocks moved from blocksBeingWritten to current?  Does that
> > happen before a file close operation is acknowledged to a hdfs client?
> >  - Could it be that the DataNodes actually moved the blocks to current
> but
> > after the restart ext3 rewound state somehow (forgive my ignorance of
> > underlying file system behavior)?
> >  - Is there any other explanation for how this can happen?
> >
> > Here is a sequence of selected relevant log lines from the RS (HBase
> > Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> > question).  It includes everything that mentions the block in question in
> > the NameNode and one DataNode log.  Please let me know if this more
> > information that would be helpful.
> >
> > RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> > Creating
> >
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > with permission=rwxrwxrwx
> > NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.allocateBlock:
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> > blk_1395839728632046111_357084589
> > DN 2013-06-29 11:16:06,832 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> > 10.0.5.237:50010
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> > blk_1395839728632046111_357084589 size 25418340
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added
> to
> > blk_1395839728632046111_357084589 size 25418340
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> > blk_1395839728632046111_357084589 of size 25418340 from /
> 10.0.5.237:14327
> > DN 2013-06-29 11:16:11,385 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> > block blk_1395839728632046111_357084589 terminating
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> > Removing lease on  file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > from client DFSClient_hb_rs_hs745,60020,1372470111932
> > NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> > NameSystem.completeFile: file
> >
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> > RS 2013-06-29 11:16:11,393 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> > to
> >
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> > RS 2013-06-29 11:16:11,505 INFO
> > org.apache.hadoop.hbase.regionserver.Store: Completed major compaction
> of 7
> > file(s) in n of
> > users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e.
> into
> > 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is
> 24.2m
> >
> > -------  CRASH, RESTART ---------
> >
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but
> was
> > rejected: Reported as block being written but is a block of closed file.
> > NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.1:50010
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.5.237:50010
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addStoredBlock: addStoredBlock request received for
> > blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> > was rejected: Reported as block being written but is a block of closed
> file.
> > NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange:
> BLOCK*
> > NameSystem.addToInvalidates: blk_1395839728632046111 is added to
> invalidSet
> > of 10.0.6.24:50010
> > (note the clock on the server running DN is wrong after restart.  I
> > believe timestamps are off by 6 hours:)
> > DN 2013-06-29 06:07:22,877 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> > blk_1395839728632046111_357084589 file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> > deletion
> > DN 2013-06-29 06:07:24,952 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> > blk_1395839728632046111_357084589 at file
> > /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
> >
> >
> > Thanks,
> > Dave
> >
>
>
>
> --
> http://hortonworks.com/download/
>

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Yes this is a known issue.

The HDFS part of this was addressed in
https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
available in 1.x  release. I think HBase does not use this API yet.


On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:

> We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> data center we were in had a total power failure and the cluster went down
> hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> recovered the data in question from our secondary datacenter, but I'm
> trying to understand what happened and whether this is a bug in HDFS that
> should be fixed.
>
> From what I can tell the file was created and closed by the dfs client
> (hbase).  Then HBase renamed it into a new directory and deleted some other
> files containing the same data.  Then the cluster lost power.  After the
> cluster was restarted, the datanodes reported into the namenode but the
> blocks for this file appeared as "blocks being written" - the namenode
> rejected them and the datanodes deleted the blocks.  At this point there
> were no replicas for the blocks and the files were marked CORRUPT.  The
> underlying file systems are ext3.  Some questions that I would love get
> answers for if anyone with deeper understanding of HDFS can chime in:
>
>  - Is this a known scenario where data loss is expected?  (I found
> HDFS-1539 but that seems different)
>  - When are blocks moved from blocksBeingWritten to current?  Does that
> happen before a file close operation is acknowledged to a hdfs client?
>  - Could it be that the DataNodes actually moved the blocks to current but
> after the restart ext3 rewound state somehow (forgive my ignorance of
> underlying file system behavior)?
>  - Is there any other explanation for how this can happen?
>
> Here is a sequence of selected relevant log lines from the RS (HBase
> Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> question).  It includes everything that mentions the block in question in
> the NameNode and one DataNode log.  Please let me know if this more
> information that would be helpful.
>
> RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> Creating
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> with permission=rwxrwxrwx
> NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> blk_1395839728632046111_357084589
> DN 2013-06-29 11:16:06,832 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> 10.0.5.237:50010
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block blk_1395839728632046111_357084589 terminating
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> Removing lease on  file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> from client DFSClient_hb_rs_hs745,60020,1372470111932
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> RS 2013-06-29 11:16:11,393 INFO
> org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> to
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> RS 2013-06-29 11:16:11,505 INFO
> org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7
> file(s) in n of
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into
> 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
>
> -------  CRASH, RESTART ---------
>
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was
> rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.1:50010
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.5.237:50010
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.24:50010
> (note the clock on the server running DN is wrong after restart.  I
> believe timestamps are off by 6 hours:)
> DN 2013-06-29 06:07:22,877 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1395839728632046111_357084589 file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> deletion
> DN 2013-06-29 06:07:24,952 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1395839728632046111_357084589 at file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>
>
> Thanks,
> Dave
>



-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Yes this is a known issue.

The HDFS part of this was addressed in
https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
available in 1.x  release. I think HBase does not use this API yet.


On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:

> We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> data center we were in had a total power failure and the cluster went down
> hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> recovered the data in question from our secondary datacenter, but I'm
> trying to understand what happened and whether this is a bug in HDFS that
> should be fixed.
>
> From what I can tell the file was created and closed by the dfs client
> (hbase).  Then HBase renamed it into a new directory and deleted some other
> files containing the same data.  Then the cluster lost power.  After the
> cluster was restarted, the datanodes reported into the namenode but the
> blocks for this file appeared as "blocks being written" - the namenode
> rejected them and the datanodes deleted the blocks.  At this point there
> were no replicas for the blocks and the files were marked CORRUPT.  The
> underlying file systems are ext3.  Some questions that I would love get
> answers for if anyone with deeper understanding of HDFS can chime in:
>
>  - Is this a known scenario where data loss is expected?  (I found
> HDFS-1539 but that seems different)
>  - When are blocks moved from blocksBeingWritten to current?  Does that
> happen before a file close operation is acknowledged to a hdfs client?
>  - Could it be that the DataNodes actually moved the blocks to current but
> after the restart ext3 rewound state somehow (forgive my ignorance of
> underlying file system behavior)?
>  - Is there any other explanation for how this can happen?
>
> Here is a sequence of selected relevant log lines from the RS (HBase
> Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> question).  It includes everything that mentions the block in question in
> the NameNode and one DataNode log.  Please let me know if this more
> information that would be helpful.
>
> RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> Creating
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> with permission=rwxrwxrwx
> NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> blk_1395839728632046111_357084589
> DN 2013-06-29 11:16:06,832 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> 10.0.5.237:50010
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block blk_1395839728632046111_357084589 terminating
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> Removing lease on  file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> from client DFSClient_hb_rs_hs745,60020,1372470111932
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> RS 2013-06-29 11:16:11,393 INFO
> org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> to
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> RS 2013-06-29 11:16:11,505 INFO
> org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7
> file(s) in n of
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into
> 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
>
> -------  CRASH, RESTART ---------
>
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was
> rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.1:50010
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.5.237:50010
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.24:50010
> (note the clock on the server running DN is wrong after restart.  I
> believe timestamps are off by 6 hours:)
> DN 2013-06-29 06:07:22,877 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1395839728632046111_357084589 file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> deletion
> DN 2013-06-29 06:07:24,952 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1395839728632046111_357084589 at file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>
>
> Thanks,
> Dave
>



-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Yes this is a known issue.

The HDFS part of this was addressed in
https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
available in 1.x  release. I think HBase does not use this API yet.


On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:

> We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> data center we were in had a total power failure and the cluster went down
> hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> recovered the data in question from our secondary datacenter, but I'm
> trying to understand what happened and whether this is a bug in HDFS that
> should be fixed.
>
> From what I can tell the file was created and closed by the dfs client
> (hbase).  Then HBase renamed it into a new directory and deleted some other
> files containing the same data.  Then the cluster lost power.  After the
> cluster was restarted, the datanodes reported into the namenode but the
> blocks for this file appeared as "blocks being written" - the namenode
> rejected them and the datanodes deleted the blocks.  At this point there
> were no replicas for the blocks and the files were marked CORRUPT.  The
> underlying file systems are ext3.  Some questions that I would love get
> answers for if anyone with deeper understanding of HDFS can chime in:
>
>  - Is this a known scenario where data loss is expected?  (I found
> HDFS-1539 but that seems different)
>  - When are blocks moved from blocksBeingWritten to current?  Does that
> happen before a file close operation is acknowledged to a hdfs client?
>  - Could it be that the DataNodes actually moved the blocks to current but
> after the restart ext3 rewound state somehow (forgive my ignorance of
> underlying file system behavior)?
>  - Is there any other explanation for how this can happen?
>
> Here is a sequence of selected relevant log lines from the RS (HBase
> Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> question).  It includes everything that mentions the block in question in
> the NameNode and one DataNode log.  Please let me know if this more
> information that would be helpful.
>
> RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> Creating
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> with permission=rwxrwxrwx
> NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> blk_1395839728632046111_357084589
> DN 2013-06-29 11:16:06,832 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> 10.0.5.237:50010
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block blk_1395839728632046111_357084589 terminating
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> Removing lease on  file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> from client DFSClient_hb_rs_hs745,60020,1372470111932
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> RS 2013-06-29 11:16:11,393 INFO
> org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> to
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> RS 2013-06-29 11:16:11,505 INFO
> org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7
> file(s) in n of
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into
> 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
>
> -------  CRASH, RESTART ---------
>
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was
> rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.1:50010
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.5.237:50010
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.24:50010
> (note the clock on the server running DN is wrong after restart.  I
> believe timestamps are off by 6 hours:)
> DN 2013-06-29 06:07:22,877 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1395839728632046111_357084589 file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> deletion
> DN 2013-06-29 06:07:24,952 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1395839728632046111_357084589 at file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>
>
> Thanks,
> Dave
>



-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Yes this is a known issue.

The HDFS part of this was addressed in
https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
available in 1.x  release. I think HBase does not use this API yet.


On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:

> We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> data center we were in had a total power failure and the cluster went down
> hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> recovered the data in question from our secondary datacenter, but I'm
> trying to understand what happened and whether this is a bug in HDFS that
> should be fixed.
>
> From what I can tell the file was created and closed by the dfs client
> (hbase).  Then HBase renamed it into a new directory and deleted some other
> files containing the same data.  Then the cluster lost power.  After the
> cluster was restarted, the datanodes reported into the namenode but the
> blocks for this file appeared as "blocks being written" - the namenode
> rejected them and the datanodes deleted the blocks.  At this point there
> were no replicas for the blocks and the files were marked CORRUPT.  The
> underlying file systems are ext3.  Some questions that I would love get
> answers for if anyone with deeper understanding of HDFS can chime in:
>
>  - Is this a known scenario where data loss is expected?  (I found
> HDFS-1539 but that seems different)
>  - When are blocks moved from blocksBeingWritten to current?  Does that
> happen before a file close operation is acknowledged to a hdfs client?
>  - Could it be that the DataNodes actually moved the blocks to current but
> after the restart ext3 rewound state somehow (forgive my ignorance of
> underlying file system behavior)?
>  - Is there any other explanation for how this can happen?
>
> Here is a sequence of selected relevant log lines from the RS (HBase
> Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> question).  It includes everything that mentions the block in question in
> the NameNode and one DataNode log.  Please let me know if this more
> information that would be helpful.
>
> RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> Creating
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> with permission=rwxrwxrwx
> NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> blk_1395839728632046111_357084589
> DN 2013-06-29 11:16:06,832 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> 10.0.5.237:50010
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block blk_1395839728632046111_357084589 terminating
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> Removing lease on  file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> from client DFSClient_hb_rs_hs745,60020,1372470111932
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> RS 2013-06-29 11:16:11,393 INFO
> org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> to
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> RS 2013-06-29 11:16:11,505 INFO
> org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7
> file(s) in n of
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into
> 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
>
> -------  CRASH, RESTART ---------
>
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was
> rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.1:50010
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.5.237:50010
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.24:50010
> (note the clock on the server running DN is wrong after restart.  I
> believe timestamps are off by 6 hours:)
> DN 2013-06-29 06:07:22,877 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1395839728632046111_357084589 file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> deletion
> DN 2013-06-29 06:07:24,952 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1395839728632046111_357084589 at file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>
>
> Thanks,
> Dave
>



-- 
http://hortonworks.com/download/

Re: data loss after cluster wide power loss

Posted by Suresh Srinivas <su...@hortonworks.com>.

Yes this is a known issue.

The HDFS part of this was addressed in
https://issues.apache.org/jira/browse/HDFS-744 for 2.0.2-alpha and is not
available in 1.x  release. I think HBase does not use this API yet.


On Mon, Jul 1, 2013 at 3:00 PM, Dave Latham <la...@davelink.net> wrote:

> We're running HBase over HDFS 1.0.2 on about 1000 nodes.  On Saturday the
> data center we were in had a total power failure and the cluster went down
> hard.  When we brought it back up, HDFS reported 4 files as CORRUPT.  We
> recovered the data in question from our secondary datacenter, but I'm
> trying to understand what happened and whether this is a bug in HDFS that
> should be fixed.
>
> From what I can tell the file was created and closed by the dfs client
> (hbase).  Then HBase renamed it into a new directory and deleted some other
> files containing the same data.  Then the cluster lost power.  After the
> cluster was restarted, the datanodes reported into the namenode but the
> blocks for this file appeared as "blocks being written" - the namenode
> rejected them and the datanodes deleted the blocks.  At this point there
> were no replicas for the blocks and the files were marked CORRUPT.  The
> underlying file systems are ext3.  Some questions that I would love get
> answers for if anyone with deeper understanding of HDFS can chime in:
>
>  - Is this a known scenario where data loss is expected?  (I found
> HDFS-1539 but that seems different)
>  - When are blocks moved from blocksBeingWritten to current?  Does that
> happen before a file close operation is acknowledged to a hdfs client?
>  - Could it be that the DataNodes actually moved the blocks to current but
> after the restart ext3 rewound state somehow (forgive my ignorance of
> underlying file system behavior)?
>  - Is there any other explanation for how this can happen?
>
> Here is a sequence of selected relevant log lines from the RS (HBase
> Region Server) NN (NameNode) and DN (DataNode - 1 example of 3 in
> question).  It includes everything that mentions the block in question in
> the NameNode and one DataNode log.  Please let me know if this more
> information that would be helpful.
>
> RS 2013-06-29 11:16:06,812 DEBUG org.apache.hadoop.hbase.util.FSUtils:
> Creating
> file=hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> with permission=rwxrwxrwx
> NN 2013-06-29 11:16:06,830 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c.
> blk_1395839728632046111_357084589
> DN 2013-06-29 11:16:06,832 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_1395839728632046111_357084589 src: /10.0.5.237:14327 dest: /
> 10.0.5.237:50010
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.1:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,370 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.6.24:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 10.0.5.237:50010 is added to
> blk_1395839728632046111_357084589 size 25418340
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_1395839728632046111_357084589 of size 25418340 from /10.0.5.237:14327
> DN 2013-06-29 11:16:11,385 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for
> block blk_1395839728632046111_357084589 terminating
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange:
> Removing lease on  file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> from client DFSClient_hb_rs_hs745,60020,1372470111932
> NN 2013-06-29 11:16:11,385 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> is closed by DFSClient_hb_rs_hs745,60020,1372470111932
> RS 2013-06-29 11:16:11,393 INFO
> org.apache.hadoop.hbase.regionserver.Store: Renaming compacted file at
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/.tmp/6e0cc30af6e64e56ba5a539fdf159c4c
> to
> hdfs://hm3:9000/hbase/users-6/b5b0820cde759ae68e333b2f4015bb7e/n/6e0cc30af6e64e56ba5a539fdf159c4c
> RS 2013-06-29 11:16:11,505 INFO
> org.apache.hadoop.hbase.regionserver.Store: Completed major compaction of 7
> file(s) in n of
> users-6,\x12\xBDp\xA3,1359426311784.b5b0820cde759ae68e333b2f4015bb7e. into
> 6e0cc30af6e64e56ba5a539fdf159c4c, size=24.2m; total size for store is 24.2m
>
> -------  CRASH, RESTART ---------
>
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.1:50010 size 21978112 but was
> rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:19,743 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.1:50010
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.5.237:50010 size 16971264 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,155 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.5.237:50010
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: addStoredBlock request received for
> blk_1395839728632046111_357084589 on 10.0.6.24:50010 size 21913088 but
> was rejected: Reported as block being written but is a block of closed file.
> NN 2013-06-29 12:01:20,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_1395839728632046111 is added to invalidSet
> of 10.0.6.24:50010
> (note the clock on the server running DN is wrong after restart.  I
> believe timestamps are off by 6 hours:)
> DN 2013-06-29 06:07:22,877 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
> blk_1395839728632046111_357084589 file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111 for
> deletion
> DN 2013-06-29 06:07:24,952 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
> blk_1395839728632046111_357084589 at file
> /data/hadoop/dfs/data/blocksBeingWritten/blk_1395839728632046111
>
>
> Thanks,
> Dave
>



-- 
http://hortonworks.com/download/