You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Varun Sharma <va...@pinterest.com> on 2013/04/19 03:01:26 UTC

Slow region server recoveries

Hi,

We are facing problems with really slow HBase region server recoveries ~ 20
minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.

Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale
node timeouts configured correctly. Time for dead node detection is still
10 minutes.

We see that our region server is trying to read an HLog is stuck there for
a long time. Logs here:

2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.156.194.251:50010 for file
/hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
for block
BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
15000 millis timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818remote=/
10.156.194.251:50010]

I would think that HDFS 3703 would make the server fail fast and go to the
third datanode. Currently, the recovery seems way too slow for production
usage...

Varun

Re: Slow region server recoveries

Posted by Ted Yu <yu...@gmail.com>.
Copying CDH Users mailing list.

On Thu, Apr 18, 2013 at 6:37 PM, Varun Sharma <va...@pinterest.com> wrote:

> I am wondering if DFSClient caches the data node for a long period of time
> ?
>
> Varun
>
>
> On Thu, Apr 18, 2013 at 6:01 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi,
> >
> > We are facing problems with really slow HBase region server recoveries ~
> > 20 minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
> >
> > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale
> > node timeouts configured correctly. Time for dead node detection is still
> > 10 minutes.
> >
> > We see that our region server is trying to read an HLog is stuck there
> for
> > a long time. Logs here:
> >
> > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /10.156.194.251:50010 for file
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > for block
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > 15000 millis timeout while waiting for channel to be ready for read. ch :
> > java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818
> remote=/
> > 10.156.194.251:50010]
> >
> > I would think that HDFS 3703 would make the server fail fast and go to
> the
> > third datanode. Currently, the recovery seems way too slow for production
> > usage...
> >
> > Varun
> >
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
I am wondering if DFSClient caches the data node for a long period of time ?

Varun


On Thu, Apr 18, 2013 at 6:01 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> We are facing problems with really slow HBase region server recoveries ~
> 20 minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
>
> Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale
> node timeouts configured correctly. Time for dead node detection is still
> 10 minutes.
>
> We see that our region server is trying to read an HLog is stuck there for
> a long time. Logs here:
>
> 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.251:50010 for file
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> for block
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> 15000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818remote=/
> 10.156.194.251:50010]
>
> I would think that HDFS 3703 would make the server fail fast and go to the
> third datanode. Currently, the recovery seems way too slow for production
> usage...
>
> Varun
>

Re: Slow region server recoveries

Posted by Ted Yu <yu...@gmail.com>.
Varun:
Thanks for trying out HBASE-8354 .

Can you move the text in Environment section of HBASE-8389 to Description ?

If you have a patch for HBASE-8389, can you upload it ?

Cheers

On Sun, Apr 21, 2013 at 10:38 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Ted, Nicholas,
>
> Thanks for the comments. We found some issues with lease recovery and I
> patched HBASE 8354 to ensure we don't see data loss. Could you please look
> at HDFS 4721 and HBASE 8389 ?
>
> Thanks
> Varun
>
>
> On Sat, Apr 20, 2013 at 10:52 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > The important thing to note is the block for this rogue WAL is
> > UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
> > thing kicks in correctly for UNDER_RECOVERY blocks but failed.
> >
> >
> > On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma <varun@pinterest.com
> >wrote:
> >
> >> Hi Nicholas,
> >>
> >> Regarding the following, I think this is not a recovery - the file below
> >> is an HFIle and is being accessed on a get request. On this cluster, I
> >> don't have block locality. I see these exceptions for a while and then
> they
> >> are gone, which means the stale node thing kicks in.
> >>
> >> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> >> connect to /10.156.194.94:50010 for file
> >> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
> >> 02f639bb43944d4ba9abcf58287831c0
> >> for block
> >>
> >> This is the real bummer. The stale datanode is 1st even 90 seconds
> >> afterwards.
> >>
> >> *2013-04-19 00:28:35*,777 WARN
> >> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
> >> hdfs://
> >>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed,
> returning error
> >> java.io.IOException: Cannot obtain block length for
> >>
> LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
> >> getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
> >> 10.156.192.106:50010, 10.156.195.38:50010]}*
> >> >---at
> >>
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
> >> >---at
> >> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
> >> >---at
> >> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
> >> >---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
> >> >---at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
> >> >---at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
> >> >---at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
> >> >---at
> >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
> >> >---at java.lang.Thread.run(Thread.java:662)
> >>
> >>
> >>
> >> On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <nkeywal@gmail.com
> >wrote:
> >>
> >>> Hi,
> >>>
> >>> I looked at it again with a fresh eye. As Varun was saying, the root
> >>> cause
> >>> is the wrong order of the block locations.
> >>>
> >>> The root cause of the root cause is actually simple: HBASE started the
> >>> recovery while the node was not yet stale from an HDFS pov.
> >>>
> >>> Varun mentioned this timing:
> >>> Lost Beat: 27:30
> >>> Became stale: 27:50 - * this is a guess and reverse engineered (stale
> >>> timeout 20 seconds)
> >>> Became dead: 37:51
> >>>
> >>> But the  recovery started at 27:13 (15 seconds before we have this log
> >>> line)
> >>> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> >>> connect to /10.156.194.94:50010 for file
> >>>
> >>>
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
> >>> for block
> >>>
> >>>
> BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException:
> >>> 15000 millis timeout while waiting for channel to be ready for connect.
> >>> ch
> >>> : java.nio.channels.SocketChannel[connection-pending remote=/
> >>> 10.156.194.94:50010]
> >>>
> >>> So when we took the blocks from the NN, the datanode was not stale, so
> >>> you
> >>> have the wrong (random) order.
> >>>
> >>> ZooKeeper can expire a session before the timeout. I don't what why it
> >>> does
> >>> this in this case, but I don't consider it as a ZK bug: if ZK knows
> that
> >>> a
> >>> node is dead, it's its role to expire the session. There is something
> >>> more
> >>> fishy: we started the recovery while the datanode was still responding
> to
> >>> heartbeat. I don't know why. Maybe the OS has been able to kill 15 the
> RS
> >>> before vanishing away.
> >>>
> >>> Anyway, we then have an exception when we try to connect, because the
> RS
> >>> does not have a TCP connection to this datanode. And this is retried
> many
> >>> times.
> >>>
> >>> You would not have this with trunk, because HBASE-6435 reorders the
> >>> blocks
> >>> inside the client, using an information not available to the NN,
> >>> excluding
> >>> the datanode of the region server under recovery.
> >>>
> >>> Some conclusions:
> >>>  - we should likely backport hbase-6435 to 0.94.
> >>>  - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
> >>> hbase-6435).
> >>>  - There are some stuff that could be better in HDFS. I will see.
> >>>  - I'm worried by the SocketTimeoutException. We should get
> NoRouteToHost
> >>> at a moment, and we don't. That's also why it takes ages. I think it's
> an
> >>> AWS thing, but it brings to issue: it's slow, and, in HBase, you don't
> >>> know
> >>> if the operation could have been executed or not, so it adds complexity
> >>> to
> >>> some scenarios. If someone with enough network and AWS knowledge could
> >>> clarify this point it would be great.
> >>>
> >>>  Cheers,
> >>>
> >>>  Nicolas
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <va...@pinterest.com>
> >>> wrote:
> >>>
> >>> > This is 0.94.3 hbase...
> >>> >
> >>> >
> >>> > On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com>
> >>> wrote:
> >>> >
> >>> > > Hi Ted,
> >>> > >
> >>> > > I had a long offline discussion with nicholas on this. Looks like
> the
> >>> > last
> >>> > > block which was still being written too, took an enormous time to
> >>> > recover.
> >>> > > Here's what happened.
> >>> > > a) Master split tasks and region servers process them
> >>> > > b) Region server tries to recover lease for each WAL log - most
> >>> cases are
> >>> > > noop since they are already rolled over/finalized
> >>> > > c) The last file lease recovery takes some time since the crashing
> >>> server
> >>> > > was writing to it and had a lease on it - but basically we have the
> >>> > lease 1
> >>> > > minute after the server was lost
> >>> > > d) Now we start the recovery for this but we end up hitting the
> stale
> >>> > data
> >>> > > node which is puzzling.
> >>> > >
> >>> > > It seems that we did not hit the stale datanode when we were trying
> >>> to
> >>> > > recover the finalized WAL blocks with trivial lease recovery.
> >>> However,
> >>> > for
> >>> > > the final block, we hit the stale datanode. Any clue why this might
> >>> be
> >>> > > happening ?
> >>> > >
> >>> > > Varun
> >>> > >
> >>> > >
> >>> > > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com>
> >>> wrote:
> >>> > >
> >>> > >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
> >>> > >>
> >>> > >> Here is the criteria for stale node checking to kick in (from
> >>> > >>
> >>> > >>
> >>> >
> >>>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> >>> > >> ):
> >>> > >>
> >>> > >> +   * Check if the datanode is in stale state. Here if
> >>> > >> +   * the namenode has not received heartbeat msg from a
> >>> > >> +   * datanode for more than staleInterval (default value is
> >>> > >> +   * {@link
> >>> > >>
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> >>> > >> +   * the datanode will be treated as stale node.
> >>> > >>
> >>> > >>
> >>> > >> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <
> varun@pinterest.com
> >>> >
> >>> > >> wrote:
> >>> > >>
> >>> > >> > Is there a place to upload these logs ?
> >>> > >> >
> >>> > >> >
> >>> > >> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <
> >>> varun@pinterest.com>
> >>> > >> > wrote:
> >>> > >> >
> >>> > >> > > Hi Nicholas,
> >>> > >> > >
> >>> > >> > > Attached are the namenode, dn logs (of one of the healthy
> >>> replicas
> >>> > of
> >>> > >> the
> >>> > >> > > WAL block) and the rs logs which got stuch doing the log
> split.
> >>> > Action
> >>> > >> > > begins at 2013-04-19 00:27*.
> >>> > >> > >
> >>> > >> > > Also, the rogue block is 5723958680970112840_174056. Its very
> >>> > >> interesting
> >>> > >> > > to trace this guy through the HDFS logs (dn and nn).
> >>> > >> > >
> >>> > >> > > Btw, do you know what the UNDER_RECOVERY stage is for, in
> HDFS ?
> >>> > Also
> >>> > >> > does
> >>> > >> > > the stale node stuff kick in for that state ?
> >>> > >> > >
> >>> > >> > > Thanks
> >>> > >> > > Varun
> >>> > >> > >
> >>> > >> > >
> >>> > >> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <
> >>> nkeywal@gmail.com
> >>> > >> > >wrote:
> >>> > >> > >
> >>> > >> > >> Thanks for the detailed scenario and analysis. I'm going to
> >>> have a
> >>> > >> look.
> >>> > >> > >> I can't access the logs (
> >>> ec2-107-20-237-30.compute-1.amazonaws.com
> >>> > >> > >> timeouts), could you please send them directly to me?
> >>> > >> > >>
> >>> > >> > >> Thanks,
> >>> > >> > >>
> >>> > >> > >> Nicolas
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <
> >>> > varun@pinterest.com>
> >>> > >> > >> wrote:
> >>> > >> > >>
> >>> > >> > >> > Hi Nicholas,
> >>> > >> > >> >
> >>> > >> > >> > Here is the failure scenario, I have dug up the logs.
> >>> > >> > >> >
> >>> > >> > >> > A machine fails and stops accepting/transmitting traffic.
> The
> >>> > >> HMaster
> >>> > >> > >> > starts the distributed split for 13 tasks. There are 12
> >>> region
> >>> > >> > servers.
> >>> > >> > >> 12
> >>> > >> > >> > tasks succeed but the 13th one takes a looong time.
> >>> > >> > >> >
> >>> > >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout
> >>> is 20
> >>> > >> > >> seconds.
> >>> > >> > >> > Both patches are there.
> >>> > >> > >> >
> >>> > >> > >> > a) Machine fails around 27:30
> >>> > >> > >> > b) Master starts the split around 27:40 and submits the
> >>> tasks.
> >>> > The
> >>> > >> one
> >>> > >> > >> task
> >>> > >> > >> > which fails seems to be the one which contains the WAL
> being
> >>> > >> currently
> >>> > >> > >> > written to:
> >>> > >> > >> >
> >>> > >> > >> > 2013-04-19 00:27:44,325 INFO
> >>> > >> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter:
> >>> Splitting
> >>> > >> hlog:
> >>> > >> > >> > hdfs://
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >>> > >> > >> > ,
> >>> > >> > >> > length=0
> >>> > >> > >> >
> >>> > >> > >> > Basically this region server picks up the task but finds
> the
> >>> > >> length of
> >>> > >> > >> this
> >>> > >> > >> > file to be 0 and drops. This happens again
> >>> > >> > >> >
> >>> > >> > >> > c) Finally another region server picks up the task but it
> >>> ends up
> >>> > >> > going
> >>> > >> > >> to
> >>> > >> > >> > the bad datanode which should not happen because of the
> stale
> >>> > node
> >>> > >> > >> timeout)
> >>> > >> > >> > Unfortunately it hits the 45 retries and a connect timeout
> >>> of 20
> >>> > >> > seconds
> >>> > >> > >> > every time. This delays recovery significantly. Now I guess
> >>> > >> reducing #
> >>> > >> > >> of
> >>> > >> > >> > retries to 1 is one possibility.
> >>> > >> > >> > But then the namenode logs are very interesting.
> >>> > >> > >> >
> >>> > >> > >> > d) Namenode seems to be in cyclic lease recovery loop until
> >>> the
> >>> > >> node
> >>> > >> > is
> >>> > >> > >> > marked dead. There is this one last block which exhibits
> >>> this.
> >>> > >> > >> >
> >>> > >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK*
> blk_-*
> >>> > >> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> >>> > >> > >> > primaryNodeIndex=1,
> >>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010
> |RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> recovery
> >>> > >> started,
> >>> > >> > >> > primary=10.156.192.106:50010
> >>> > >> > >> > 2013-04-19 00:28:09,744 WARN
> >>> org.apache.hadoop.hdfs.StateChange:
> >>> > >> DIR*
> >>> > >> > >> > NameSystem.internalReleaseLease: File
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >>> > >> > >> > has not been closed. Lease recovery is in progress.
> >>> RecoveryId =
> >>> > >> > 174413
> >>> > >> > >> for
> >>> > >> > >> > block
> >>> > blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> >>> > >> > >> > primaryNodeIndex=1,
> >>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010
> |RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> >>> > >> > >> >
> >>> > >> > >> > I see this over and over again in the logs until the
> >>> datanode is
> >>> > >> > marked
> >>> > >> > >> > dead. It seems to be cycling through the replicas for this
> >>> WAL
> >>> > >> block
> >>> > >> > and
> >>> > >> > >> > trying to add it to the recovery list. I looked at the code
> >>> and
> >>> > it
> >>> > >> > says:
> >>> > >> > >> >
> >>> > >> > >> >       // Cannot close file right now, since the last block
> >>> > requires
> >>> > >> > >> > recovery.
> >>> > >> > >> >       // This may potentially cause infinite loop in lease
> >>> > recovery
> >>> > >> > >> >       // if there are no valid replicas on data-nodes.
> >>> > >> > >> >       NameNode.stateChangeLog.warn(
> >>> > >> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
> >>> > >> > >> >                 "File " + src + " has not been closed." +
> >>> > >> > >> >                " Lease recovery is in progress. " +
> >>> > >> > >> >                 "RecoveryId = " + blockRecoveryId + " for
> >>> block
> >>> > " +
> >>> > >> > >> > lastBlock);
> >>> > >> > >> >       break;
> >>> > >> > >> >
> >>> > >> > >> > Eventually for this block, we get
> >>> > >> > >> >
> >>> > >> > >> > 2013-04-19 00:41:20,736 INFO
> >>> > >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> >>> > >> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> >>> > >> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> >>> > >> > >> 10.156.195.38:50010
> >>> > >> > >> > ],
> >>> > >> > >> > closeFile=true, deleteBlock=false)
> >>> > >> > >> > 2013-04-19 00:41:20,736 ERROR
> >>> > >> > >> > org.apache.hadoop.security.UserGroupInformation:
> >>> > >> > >> PriviledgedActionException
> >>> > >> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >>> > >> > >> > not found
> >>> > >> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server:
> >>> IPC
> >>> > >> Server
> >>> > >> > >> > handler 35 on 8020, call
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> >>> > >> > >> > from 10.156.192.106:53271: error: java.io.IOException:
> Block
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >>> > >> > >> > not found
> >>> > >> > >> >
> >>> > >> > >> > On the datanode side, i see a call for recover blocks - I
> see
> >>> > that
> >>> > >> a
> >>> > >> > >> write
> >>> > >> > >> > pipeline is there, which gets terminated with some socket
> >>> > >> timeouts...
> >>> > >> > >> >
> >>> > >> > >> > 00:28:11,471 INFO
> >>> > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>> > >> > >> NameNode
> >>> > >> > >> > at
> >>> > ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> >>> > >> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> >>> > >> > 10.156.195.38:50010
> >>> > >> > >> ],
> >>> > >> > >> > newGenerationStamp=174413)
> >>> > >> > >> >
> >>> > >> > >> > Not sure but this looks like a case where data could be
> lost
> >>>   ?
> >>> > >> > >> >
> >>> > >> > >> > Varun
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> >>> > >> nkeywal@gmail.com>
> >>> > >> > >> > wrote:
> >>> > >> > >> >
> >>> > >> > >> > > Hey Varun,
> >>> > >> > >> > >
> >>> > >> > >> > > Could you please share the logs and the configuration
> >>> (hdfs /
> >>> > >> hbase
> >>> > >> > >> > > settings + cluster description). What's the failure
> >>> scenario?
> >>> > >> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node
> >>> > status.
> >>> > >> > But
> >>> > >> > >> > these
> >>> > >> > >> > > node will be given the lowest priority when reading.
> >>> > >> > >> > >
> >>> > >> > >> > >
> >>> > >> > >> > > Cheers,
> >>> > >> > >> > >
> >>> > >> > >> > > Nicolas
> >>> > >> > >> > >
> >>> > >> > >> > >
> >>> > >> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> >>> > >> varun@pinterest.com>
> >>> > >> > >> > wrote:
> >>> > >> > >> > >
> >>> > >> > >> > > > Hi,
> >>> > >> > >> > > >
> >>> > >> > >> > > > We are facing problems with really slow HBase region
> >>> server
> >>> > >> > >> recoveries
> >>> > >> > >> > ~
> >>> > >> > >> > > 20
> >>> > >> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> >>> > >> hadoop.profile=2.0.
> >>> > >> > >> > > >
> >>> > >> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912
> >>> > patched
> >>> > >> and
> >>> > >> > >> > stale
> >>> > >> > >> > > > node timeouts configured correctly. Time for dead node
> >>> > >> detection
> >>> > >> > is
> >>> > >> > >> > still
> >>> > >> > >> > > > 10 minutes.
> >>> > >> > >> > > >
> >>> > >> > >> > > > We see that our region server is trying to read an HLog
> >>> is
> >>> > >> stuck
> >>> > >> > >> there
> >>> > >> > >> > > for
> >>> > >> > >> > > > a long time. Logs here:
> >>> > >> > >> > > >
> >>> > >> > >> > > > 2013-04-12 21:14:30,248 WARN
> >>> > org.apache.hadoop.hdfs.DFSClient:
> >>> > >> > >> Failed
> >>> > >> > >> > to
> >>> > >> > >> > > > connect to /10.156.194.251:50010 for file
> >>> > >> > >> > > >
> >>> > >> > >> > > >
> >>> > >> > >> > >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> >>> > >> > >> > > > for block
> >>> > >> > >> > > >
> >>> > >> > >> > > >
> >>> > >> > >> > >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> >>> > >> > >> > > > 15000 millis timeout while waiting for channel to be
> >>> ready
> >>> > for
> >>> > >> > read.
> >>> > >> > >> > ch :
> >>> > >> > >> > > > java.nio.channels.SocketChannel[connected local=/
> >>> > >> > >> 10.156.192.173:52818
> >>> > >> > >> > > > remote=/
> >>> > >> > >> > > > 10.156.194.251:50010]
> >>> > >> > >> > > >
> >>> > >> > >> > > > I would think that HDFS 3703 would make the server fail
> >>> fast
> >>> > >> and
> >>> > >> > go
> >>> > >> > >> to
> >>> > >> > >> > > the
> >>> > >> > >> > > > third datanode. Currently, the recovery seems way too
> >>> slow
> >>> > for
> >>> > >> > >> > production
> >>> > >> > >> > > > usage...
> >>> > >> > >> > > >
> >>> > >> > >> > > > Varun
> >>> > >> > >> > > >
> >>> > >> > >> > >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> > >
> >>> > >> > >
> >>> > >> >
> >>> > >>
> >>> > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: Slow region server recoveries

Posted by Nicolas Liochon <nk...@gmail.com>.
The recovery starts at 27:40 (master log: 00:27:40,011), so before the
datanode is known as stale.
But the first attempt is cancelled, and a new one start at 28:10, before
being cancelled again at 28:35 (that's HBASE-6738).
These two attempts should see the datanode as stale. It seems it's not the
case. I don't know why it's not reordered. A possibility is the stale ratio
is not configured, or the other nodes were stale as well. Or HDFS doesn't
see the node as stale.

In any case, the story would be different with HBASE-8276 (i.e. HBASE-6738
backport to 94.7) + HBASE-6435.
If you have a reproduction scenario, it's worth trying.

For HBASE-8389, yes, we need this to work...

Nicolas
<https://issues.apache.org/jira/browse/HBASE-8389>




On Sun, Apr 21, 2013 at 7:38 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Ted, Nicholas,
>
> Thanks for the comments. We found some issues with lease recovery and I
> patched HBASE 8354 to ensure we don't see data loss. Could you please look
> at HDFS 4721 and HBASE 8389 ?
>
> Thanks
> Varun
>
>
> On Sat, Apr 20, 2013 at 10:52 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > The important thing to note is the block for this rogue WAL is
> > UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
> > thing kicks in correctly for UNDER_RECOVERY blocks but failed.
> >
> >
> > On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma <varun@pinterest.com
> >wrote:
> >
> >> Hi Nicholas,
> >>
> >> Regarding the following, I think this is not a recovery - the file below
> >> is an HFIle and is being accessed on a get request. On this cluster, I
> >> don't have block locality. I see these exceptions for a while and then
> they
> >> are gone, which means the stale node thing kicks in.
> >>
> >> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> >> connect to /10.156.194.94:50010 for file
> >> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
> >> 02f639bb43944d4ba9abcf58287831c0
> >> for block
> >>
> >> This is the real bummer. The stale datanode is 1st even 90 seconds
> >> afterwards.
> >>
> >> *2013-04-19 00:28:35*,777 WARN
> >> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
> >> hdfs://
> >>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed,
> returning error
> >> java.io.IOException: Cannot obtain block length for
> >>
> LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
> >> getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
> >> 10.156.192.106:50010, 10.156.195.38:50010]}*
> >> >---at
> >>
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
> >> >---at
> >> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
> >> >---at
> >> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
> >> >---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
> >> >---at
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
> >> >---at
> >>
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
> >> >---at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
> >> >---at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
> >> >---at
> >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
> >> >---at
> >>
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
> >> >---at java.lang.Thread.run(Thread.java:662)
> >>
> >>
> >>
> >> On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <nkeywal@gmail.com
> >wrote:
> >>
> >>> Hi,
> >>>
> >>> I looked at it again with a fresh eye. As Varun was saying, the root
> >>> cause
> >>> is the wrong order of the block locations.
> >>>
> >>> The root cause of the root cause is actually simple: HBASE started the
> >>> recovery while the node was not yet stale from an HDFS pov.
> >>>
> >>> Varun mentioned this timing:
> >>> Lost Beat: 27:30
> >>> Became stale: 27:50 - * this is a guess and reverse engineered (stale
> >>> timeout 20 seconds)
> >>> Became dead: 37:51
> >>>
> >>> But the  recovery started at 27:13 (15 seconds before we have this log
> >>> line)
> >>> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> >>> connect to /10.156.194.94:50010 for file
> >>>
> >>>
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
> >>> for block
> >>>
> >>>
> BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException:
> >>> 15000 millis timeout while waiting for channel to be ready for connect.
> >>> ch
> >>> : java.nio.channels.SocketChannel[connection-pending remote=/
> >>> 10.156.194.94:50010]
> >>>
> >>> So when we took the blocks from the NN, the datanode was not stale, so
> >>> you
> >>> have the wrong (random) order.
> >>>
> >>> ZooKeeper can expire a session before the timeout. I don't what why it
> >>> does
> >>> this in this case, but I don't consider it as a ZK bug: if ZK knows
> that
> >>> a
> >>> node is dead, it's its role to expire the session. There is something
> >>> more
> >>> fishy: we started the recovery while the datanode was still responding
> to
> >>> heartbeat. I don't know why. Maybe the OS has been able to kill 15 the
> RS
> >>> before vanishing away.
> >>>
> >>> Anyway, we then have an exception when we try to connect, because the
> RS
> >>> does not have a TCP connection to this datanode. And this is retried
> many
> >>> times.
> >>>
> >>> You would not have this with trunk, because HBASE-6435 reorders the
> >>> blocks
> >>> inside the client, using an information not available to the NN,
> >>> excluding
> >>> the datanode of the region server under recovery.
> >>>
> >>> Some conclusions:
> >>>  - we should likely backport hbase-6435 to 0.94.
> >>>  - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
> >>> hbase-6435).
> >>>  - There are some stuff that could be better in HDFS. I will see.
> >>>  - I'm worried by the SocketTimeoutException. We should get
> NoRouteToHost
> >>> at a moment, and we don't. That's also why it takes ages. I think it's
> an
> >>> AWS thing, but it brings to issue: it's slow, and, in HBase, you don't
> >>> know
> >>> if the operation could have been executed or not, so it adds complexity
> >>> to
> >>> some scenarios. If someone with enough network and AWS knowledge could
> >>> clarify this point it would be great.
> >>>
> >>>  Cheers,
> >>>
> >>>  Nicolas
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <va...@pinterest.com>
> >>> wrote:
> >>>
> >>> > This is 0.94.3 hbase...
> >>> >
> >>> >
> >>> > On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com>
> >>> wrote:
> >>> >
> >>> > > Hi Ted,
> >>> > >
> >>> > > I had a long offline discussion with nicholas on this. Looks like
> the
> >>> > last
> >>> > > block which was still being written too, took an enormous time to
> >>> > recover.
> >>> > > Here's what happened.
> >>> > > a) Master split tasks and region servers process them
> >>> > > b) Region server tries to recover lease for each WAL log - most
> >>> cases are
> >>> > > noop since they are already rolled over/finalized
> >>> > > c) The last file lease recovery takes some time since the crashing
> >>> server
> >>> > > was writing to it and had a lease on it - but basically we have the
> >>> > lease 1
> >>> > > minute after the server was lost
> >>> > > d) Now we start the recovery for this but we end up hitting the
> stale
> >>> > data
> >>> > > node which is puzzling.
> >>> > >
> >>> > > It seems that we did not hit the stale datanode when we were trying
> >>> to
> >>> > > recover the finalized WAL blocks with trivial lease recovery.
> >>> However,
> >>> > for
> >>> > > the final block, we hit the stale datanode. Any clue why this might
> >>> be
> >>> > > happening ?
> >>> > >
> >>> > > Varun
> >>> > >
> >>> > >
> >>> > > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com>
> >>> wrote:
> >>> > >
> >>> > >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
> >>> > >>
> >>> > >> Here is the criteria for stale node checking to kick in (from
> >>> > >>
> >>> > >>
> >>> >
> >>>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> >>> > >> ):
> >>> > >>
> >>> > >> +   * Check if the datanode is in stale state. Here if
> >>> > >> +   * the namenode has not received heartbeat msg from a
> >>> > >> +   * datanode for more than staleInterval (default value is
> >>> > >> +   * {@link
> >>> > >>
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> >>> > >> +   * the datanode will be treated as stale node.
> >>> > >>
> >>> > >>
> >>> > >> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <
> varun@pinterest.com
> >>> >
> >>> > >> wrote:
> >>> > >>
> >>> > >> > Is there a place to upload these logs ?
> >>> > >> >
> >>> > >> >
> >>> > >> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <
> >>> varun@pinterest.com>
> >>> > >> > wrote:
> >>> > >> >
> >>> > >> > > Hi Nicholas,
> >>> > >> > >
> >>> > >> > > Attached are the namenode, dn logs (of one of the healthy
> >>> replicas
> >>> > of
> >>> > >> the
> >>> > >> > > WAL block) and the rs logs which got stuch doing the log
> split.
> >>> > Action
> >>> > >> > > begins at 2013-04-19 00:27*.
> >>> > >> > >
> >>> > >> > > Also, the rogue block is 5723958680970112840_174056. Its very
> >>> > >> interesting
> >>> > >> > > to trace this guy through the HDFS logs (dn and nn).
> >>> > >> > >
> >>> > >> > > Btw, do you know what the UNDER_RECOVERY stage is for, in
> HDFS ?
> >>> > Also
> >>> > >> > does
> >>> > >> > > the stale node stuff kick in for that state ?
> >>> > >> > >
> >>> > >> > > Thanks
> >>> > >> > > Varun
> >>> > >> > >
> >>> > >> > >
> >>> > >> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <
> >>> nkeywal@gmail.com
> >>> > >> > >wrote:
> >>> > >> > >
> >>> > >> > >> Thanks for the detailed scenario and analysis. I'm going to
> >>> have a
> >>> > >> look.
> >>> > >> > >> I can't access the logs (
> >>> ec2-107-20-237-30.compute-1.amazonaws.com
> >>> > >> > >> timeouts), could you please send them directly to me?
> >>> > >> > >>
> >>> > >> > >> Thanks,
> >>> > >> > >>
> >>> > >> > >> Nicolas
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <
> >>> > varun@pinterest.com>
> >>> > >> > >> wrote:
> >>> > >> > >>
> >>> > >> > >> > Hi Nicholas,
> >>> > >> > >> >
> >>> > >> > >> > Here is the failure scenario, I have dug up the logs.
> >>> > >> > >> >
> >>> > >> > >> > A machine fails and stops accepting/transmitting traffic.
> The
> >>> > >> HMaster
> >>> > >> > >> > starts the distributed split for 13 tasks. There are 12
> >>> region
> >>> > >> > servers.
> >>> > >> > >> 12
> >>> > >> > >> > tasks succeed but the 13th one takes a looong time.
> >>> > >> > >> >
> >>> > >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout
> >>> is 20
> >>> > >> > >> seconds.
> >>> > >> > >> > Both patches are there.
> >>> > >> > >> >
> >>> > >> > >> > a) Machine fails around 27:30
> >>> > >> > >> > b) Master starts the split around 27:40 and submits the
> >>> tasks.
> >>> > The
> >>> > >> one
> >>> > >> > >> task
> >>> > >> > >> > which fails seems to be the one which contains the WAL
> being
> >>> > >> currently
> >>> > >> > >> > written to:
> >>> > >> > >> >
> >>> > >> > >> > 2013-04-19 00:27:44,325 INFO
> >>> > >> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter:
> >>> Splitting
> >>> > >> hlog:
> >>> > >> > >> > hdfs://
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >>> > >> > >> > ,
> >>> > >> > >> > length=0
> >>> > >> > >> >
> >>> > >> > >> > Basically this region server picks up the task but finds
> the
> >>> > >> length of
> >>> > >> > >> this
> >>> > >> > >> > file to be 0 and drops. This happens again
> >>> > >> > >> >
> >>> > >> > >> > c) Finally another region server picks up the task but it
> >>> ends up
> >>> > >> > going
> >>> > >> > >> to
> >>> > >> > >> > the bad datanode which should not happen because of the
> stale
> >>> > node
> >>> > >> > >> timeout)
> >>> > >> > >> > Unfortunately it hits the 45 retries and a connect timeout
> >>> of 20
> >>> > >> > seconds
> >>> > >> > >> > every time. This delays recovery significantly. Now I guess
> >>> > >> reducing #
> >>> > >> > >> of
> >>> > >> > >> > retries to 1 is one possibility.
> >>> > >> > >> > But then the namenode logs are very interesting.
> >>> > >> > >> >
> >>> > >> > >> > d) Namenode seems to be in cyclic lease recovery loop until
> >>> the
> >>> > >> node
> >>> > >> > is
> >>> > >> > >> > marked dead. There is this one last block which exhibits
> >>> this.
> >>> > >> > >> >
> >>> > >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK*
> blk_-*
> >>> > >> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> >>> > >> > >> > primaryNodeIndex=1,
> >>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010
> |RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> recovery
> >>> > >> started,
> >>> > >> > >> > primary=10.156.192.106:50010
> >>> > >> > >> > 2013-04-19 00:28:09,744 WARN
> >>> org.apache.hadoop.hdfs.StateChange:
> >>> > >> DIR*
> >>> > >> > >> > NameSystem.internalReleaseLease: File
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >>> > >> > >> > has not been closed. Lease recovery is in progress.
> >>> RecoveryId =
> >>> > >> > 174413
> >>> > >> > >> for
> >>> > >> > >> > block
> >>> > blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> >>> > >> > >> > primaryNodeIndex=1,
> >>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010
> |RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> >>> > >> > >> >
> >>> > >> > >> > I see this over and over again in the logs until the
> >>> datanode is
> >>> > >> > marked
> >>> > >> > >> > dead. It seems to be cycling through the replicas for this
> >>> WAL
> >>> > >> block
> >>> > >> > and
> >>> > >> > >> > trying to add it to the recovery list. I looked at the code
> >>> and
> >>> > it
> >>> > >> > says:
> >>> > >> > >> >
> >>> > >> > >> >       // Cannot close file right now, since the last block
> >>> > requires
> >>> > >> > >> > recovery.
> >>> > >> > >> >       // This may potentially cause infinite loop in lease
> >>> > recovery
> >>> > >> > >> >       // if there are no valid replicas on data-nodes.
> >>> > >> > >> >       NameNode.stateChangeLog.warn(
> >>> > >> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
> >>> > >> > >> >                 "File " + src + " has not been closed." +
> >>> > >> > >> >                " Lease recovery is in progress. " +
> >>> > >> > >> >                 "RecoveryId = " + blockRecoveryId + " for
> >>> block
> >>> > " +
> >>> > >> > >> > lastBlock);
> >>> > >> > >> >       break;
> >>> > >> > >> >
> >>> > >> > >> > Eventually for this block, we get
> >>> > >> > >> >
> >>> > >> > >> > 2013-04-19 00:41:20,736 INFO
> >>> > >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> >>> > >> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> >>> > >> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> >>> > >> > >> 10.156.195.38:50010
> >>> > >> > >> > ],
> >>> > >> > >> > closeFile=true, deleteBlock=false)
> >>> > >> > >> > 2013-04-19 00:41:20,736 ERROR
> >>> > >> > >> > org.apache.hadoop.security.UserGroupInformation:
> >>> > >> > >> PriviledgedActionException
> >>> > >> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >>> > >> > >> > not found
> >>> > >> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server:
> >>> IPC
> >>> > >> Server
> >>> > >> > >> > handler 35 on 8020, call
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> >>> > >> > >> > from 10.156.192.106:53271: error: java.io.IOException:
> Block
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >>> > >> > >> > not found
> >>> > >> > >> >
> >>> > >> > >> > On the datanode side, i see a call for recover blocks - I
> see
> >>> > that
> >>> > >> a
> >>> > >> > >> write
> >>> > >> > >> > pipeline is there, which gets terminated with some socket
> >>> > >> timeouts...
> >>> > >> > >> >
> >>> > >> > >> > 00:28:11,471 INFO
> >>> > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>> > >> > >> NameNode
> >>> > >> > >> > at
> >>> > ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> >>> > >> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> >>> > >> > 10.156.195.38:50010
> >>> > >> > >> ],
> >>> > >> > >> > newGenerationStamp=174413)
> >>> > >> > >> >
> >>> > >> > >> > Not sure but this looks like a case where data could be
> lost
> >>>   ?
> >>> > >> > >> >
> >>> > >> > >> > Varun
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> >>> > >> nkeywal@gmail.com>
> >>> > >> > >> > wrote:
> >>> > >> > >> >
> >>> > >> > >> > > Hey Varun,
> >>> > >> > >> > >
> >>> > >> > >> > > Could you please share the logs and the configuration
> >>> (hdfs /
> >>> > >> hbase
> >>> > >> > >> > > settings + cluster description). What's the failure
> >>> scenario?
> >>> > >> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node
> >>> > status.
> >>> > >> > But
> >>> > >> > >> > these
> >>> > >> > >> > > node will be given the lowest priority when reading.
> >>> > >> > >> > >
> >>> > >> > >> > >
> >>> > >> > >> > > Cheers,
> >>> > >> > >> > >
> >>> > >> > >> > > Nicolas
> >>> > >> > >> > >
> >>> > >> > >> > >
> >>> > >> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> >>> > >> varun@pinterest.com>
> >>> > >> > >> > wrote:
> >>> > >> > >> > >
> >>> > >> > >> > > > Hi,
> >>> > >> > >> > > >
> >>> > >> > >> > > > We are facing problems with really slow HBase region
> >>> server
> >>> > >> > >> recoveries
> >>> > >> > >> > ~
> >>> > >> > >> > > 20
> >>> > >> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> >>> > >> hadoop.profile=2.0.
> >>> > >> > >> > > >
> >>> > >> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912
> >>> > patched
> >>> > >> and
> >>> > >> > >> > stale
> >>> > >> > >> > > > node timeouts configured correctly. Time for dead node
> >>> > >> detection
> >>> > >> > is
> >>> > >> > >> > still
> >>> > >> > >> > > > 10 minutes.
> >>> > >> > >> > > >
> >>> > >> > >> > > > We see that our region server is trying to read an HLog
> >>> is
> >>> > >> stuck
> >>> > >> > >> there
> >>> > >> > >> > > for
> >>> > >> > >> > > > a long time. Logs here:
> >>> > >> > >> > > >
> >>> > >> > >> > > > 2013-04-12 21:14:30,248 WARN
> >>> > org.apache.hadoop.hdfs.DFSClient:
> >>> > >> > >> Failed
> >>> > >> > >> > to
> >>> > >> > >> > > > connect to /10.156.194.251:50010 for file
> >>> > >> > >> > > >
> >>> > >> > >> > > >
> >>> > >> > >> > >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> >>> > >> > >> > > > for block
> >>> > >> > >> > > >
> >>> > >> > >> > > >
> >>> > >> > >> > >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> >>> > >> > >> > > > 15000 millis timeout while waiting for channel to be
> >>> ready
> >>> > for
> >>> > >> > read.
> >>> > >> > >> > ch :
> >>> > >> > >> > > > java.nio.channels.SocketChannel[connected local=/
> >>> > >> > >> 10.156.192.173:52818
> >>> > >> > >> > > > remote=/
> >>> > >> > >> > > > 10.156.194.251:50010]
> >>> > >> > >> > > >
> >>> > >> > >> > > > I would think that HDFS 3703 would make the server fail
> >>> fast
> >>> > >> and
> >>> > >> > go
> >>> > >> > >> to
> >>> > >> > >> > > the
> >>> > >> > >> > > > third datanode. Currently, the recovery seems way too
> >>> slow
> >>> > for
> >>> > >> > >> > production
> >>> > >> > >> > > > usage...
> >>> > >> > >> > > >
> >>> > >> > >> > > > Varun
> >>> > >> > >> > > >
> >>> > >> > >> > >
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> > >
> >>> > >> > >
> >>> > >> >
> >>> > >>
> >>> > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
Hi Ted, Nicholas,

Thanks for the comments. We found some issues with lease recovery and I
patched HBASE 8354 to ensure we don't see data loss. Could you please look
at HDFS 4721 and HBASE 8389 ?

Thanks
Varun


On Sat, Apr 20, 2013 at 10:52 AM, Varun Sharma <va...@pinterest.com> wrote:

> The important thing to note is the block for this rogue WAL is
> UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
> thing kicks in correctly for UNDER_RECOVERY blocks but failed.
>
>
> On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma <va...@pinterest.com>wrote:
>
>> Hi Nicholas,
>>
>> Regarding the following, I think this is not a recovery - the file below
>> is an HFIle and is being accessed on a get request. On this cluster, I
>> don't have block locality. I see these exceptions for a while and then they
>> are gone, which means the stale node thing kicks in.
>>
>> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
>> connect to /10.156.194.94:50010 for file
>> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
>> 02f639bb43944d4ba9abcf58287831c0
>> for block
>>
>> This is the real bummer. The stale datanode is 1st even 90 seconds
>> afterwards.
>>
>> *2013-04-19 00:28:35*,777 WARN
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
>> hdfs://
>> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed, returning error
>> java.io.IOException: Cannot obtain block length for
>> LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
>> getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
>> 10.156.192.106:50010, 10.156.195.38:50010]}*
>> >---at
>> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
>> >---at
>> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
>> >---at
>> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
>> >---at
>> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
>> >---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
>> >---at
>> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
>> >---at
>> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
>> >---at
>> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
>> >---at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
>> >---at
>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
>> >---at
>> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
>> >---at
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
>> >---at
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
>> >---at
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
>> >---at
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
>> >---at java.lang.Thread.run(Thread.java:662)
>>
>>
>>
>> On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <nk...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> I looked at it again with a fresh eye. As Varun was saying, the root
>>> cause
>>> is the wrong order of the block locations.
>>>
>>> The root cause of the root cause is actually simple: HBASE started the
>>> recovery while the node was not yet stale from an HDFS pov.
>>>
>>> Varun mentioned this timing:
>>> Lost Beat: 27:30
>>> Became stale: 27:50 - * this is a guess and reverse engineered (stale
>>> timeout 20 seconds)
>>> Became dead: 37:51
>>>
>>> But the  recovery started at 27:13 (15 seconds before we have this log
>>> line)
>>> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
>>> connect to /10.156.194.94:50010 for file
>>>
>>> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
>>> for block
>>>
>>> BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException:
>>> 15000 millis timeout while waiting for channel to be ready for connect.
>>> ch
>>> : java.nio.channels.SocketChannel[connection-pending remote=/
>>> 10.156.194.94:50010]
>>>
>>> So when we took the blocks from the NN, the datanode was not stale, so
>>> you
>>> have the wrong (random) order.
>>>
>>> ZooKeeper can expire a session before the timeout. I don't what why it
>>> does
>>> this in this case, but I don't consider it as a ZK bug: if ZK knows that
>>> a
>>> node is dead, it's its role to expire the session. There is something
>>> more
>>> fishy: we started the recovery while the datanode was still responding to
>>> heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS
>>> before vanishing away.
>>>
>>> Anyway, we then have an exception when we try to connect, because the RS
>>> does not have a TCP connection to this datanode. And this is retried many
>>> times.
>>>
>>> You would not have this with trunk, because HBASE-6435 reorders the
>>> blocks
>>> inside the client, using an information not available to the NN,
>>> excluding
>>> the datanode of the region server under recovery.
>>>
>>> Some conclusions:
>>>  - we should likely backport hbase-6435 to 0.94.
>>>  - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
>>> hbase-6435).
>>>  - There are some stuff that could be better in HDFS. I will see.
>>>  - I'm worried by the SocketTimeoutException. We should get NoRouteToHost
>>> at a moment, and we don't. That's also why it takes ages. I think it's an
>>> AWS thing, but it brings to issue: it's slow, and, in HBase, you don't
>>> know
>>> if the operation could have been executed or not, so it adds complexity
>>> to
>>> some scenarios. If someone with enough network and AWS knowledge could
>>> clarify this point it would be great.
>>>
>>>  Cheers,
>>>
>>>  Nicolas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>
>>> > This is 0.94.3 hbase...
>>> >
>>> >
>>> > On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>> >
>>> > > Hi Ted,
>>> > >
>>> > > I had a long offline discussion with nicholas on this. Looks like the
>>> > last
>>> > > block which was still being written too, took an enormous time to
>>> > recover.
>>> > > Here's what happened.
>>> > > a) Master split tasks and region servers process them
>>> > > b) Region server tries to recover lease for each WAL log - most
>>> cases are
>>> > > noop since they are already rolled over/finalized
>>> > > c) The last file lease recovery takes some time since the crashing
>>> server
>>> > > was writing to it and had a lease on it - but basically we have the
>>> > lease 1
>>> > > minute after the server was lost
>>> > > d) Now we start the recovery for this but we end up hitting the stale
>>> > data
>>> > > node which is puzzling.
>>> > >
>>> > > It seems that we did not hit the stale datanode when we were trying
>>> to
>>> > > recover the finalized WAL blocks with trivial lease recovery.
>>> However,
>>> > for
>>> > > the final block, we hit the stale datanode. Any clue why this might
>>> be
>>> > > happening ?
>>> > >
>>> > > Varun
>>> > >
>>> > >
>>> > > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com>
>>> wrote:
>>> > >
>>> > >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>>> > >>
>>> > >> Here is the criteria for stale node checking to kick in (from
>>> > >>
>>> > >>
>>> >
>>> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
>>> > >> ):
>>> > >>
>>> > >> +   * Check if the datanode is in stale state. Here if
>>> > >> +   * the namenode has not received heartbeat msg from a
>>> > >> +   * datanode for more than staleInterval (default value is
>>> > >> +   * {@link
>>> > >> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
>>> > >> +   * the datanode will be treated as stale node.
>>> > >>
>>> > >>
>>> > >> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <varun@pinterest.com
>>> >
>>> > >> wrote:
>>> > >>
>>> > >> > Is there a place to upload these logs ?
>>> > >> >
>>> > >> >
>>> > >> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <
>>> varun@pinterest.com>
>>> > >> > wrote:
>>> > >> >
>>> > >> > > Hi Nicholas,
>>> > >> > >
>>> > >> > > Attached are the namenode, dn logs (of one of the healthy
>>> replicas
>>> > of
>>> > >> the
>>> > >> > > WAL block) and the rs logs which got stuch doing the log split.
>>> > Action
>>> > >> > > begins at 2013-04-19 00:27*.
>>> > >> > >
>>> > >> > > Also, the rogue block is 5723958680970112840_174056. Its very
>>> > >> interesting
>>> > >> > > to trace this guy through the HDFS logs (dn and nn).
>>> > >> > >
>>> > >> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ?
>>> > Also
>>> > >> > does
>>> > >> > > the stale node stuff kick in for that state ?
>>> > >> > >
>>> > >> > > Thanks
>>> > >> > > Varun
>>> > >> > >
>>> > >> > >
>>> > >> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <
>>> nkeywal@gmail.com
>>> > >> > >wrote:
>>> > >> > >
>>> > >> > >> Thanks for the detailed scenario and analysis. I'm going to
>>> have a
>>> > >> look.
>>> > >> > >> I can't access the logs (
>>> ec2-107-20-237-30.compute-1.amazonaws.com
>>> > >> > >> timeouts), could you please send them directly to me?
>>> > >> > >>
>>> > >> > >> Thanks,
>>> > >> > >>
>>> > >> > >> Nicolas
>>> > >> > >>
>>> > >> > >>
>>> > >> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <
>>> > varun@pinterest.com>
>>> > >> > >> wrote:
>>> > >> > >>
>>> > >> > >> > Hi Nicholas,
>>> > >> > >> >
>>> > >> > >> > Here is the failure scenario, I have dug up the logs.
>>> > >> > >> >
>>> > >> > >> > A machine fails and stops accepting/transmitting traffic. The
>>> > >> HMaster
>>> > >> > >> > starts the distributed split for 13 tasks. There are 12
>>> region
>>> > >> > servers.
>>> > >> > >> 12
>>> > >> > >> > tasks succeed but the 13th one takes a looong time.
>>> > >> > >> >
>>> > >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout
>>> is 20
>>> > >> > >> seconds.
>>> > >> > >> > Both patches are there.
>>> > >> > >> >
>>> > >> > >> > a) Machine fails around 27:30
>>> > >> > >> > b) Master starts the split around 27:40 and submits the
>>> tasks.
>>> > The
>>> > >> one
>>> > >> > >> task
>>> > >> > >> > which fails seems to be the one which contains the WAL being
>>> > >> currently
>>> > >> > >> > written to:
>>> > >> > >> >
>>> > >> > >> > 2013-04-19 00:27:44,325 INFO
>>> > >> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter:
>>> Splitting
>>> > >> hlog:
>>> > >> > >> > hdfs://
>>> > >> > >> >
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>>> > >> > >> > ,
>>> > >> > >> > length=0
>>> > >> > >> >
>>> > >> > >> > Basically this region server picks up the task but finds the
>>> > >> length of
>>> > >> > >> this
>>> > >> > >> > file to be 0 and drops. This happens again
>>> > >> > >> >
>>> > >> > >> > c) Finally another region server picks up the task but it
>>> ends up
>>> > >> > going
>>> > >> > >> to
>>> > >> > >> > the bad datanode which should not happen because of the stale
>>> > node
>>> > >> > >> timeout)
>>> > >> > >> > Unfortunately it hits the 45 retries and a connect timeout
>>> of 20
>>> > >> > seconds
>>> > >> > >> > every time. This delays recovery significantly. Now I guess
>>> > >> reducing #
>>> > >> > >> of
>>> > >> > >> > retries to 1 is one possibility.
>>> > >> > >> > But then the namenode logs are very interesting.
>>> > >> > >> >
>>> > >> > >> > d) Namenode seems to be in cyclic lease recovery loop until
>>> the
>>> > >> node
>>> > >> > is
>>> > >> > >> > marked dead. There is this one last block which exhibits
>>> this.
>>> > >> > >> >
>>> > >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
>>> > >> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
>>> > >> > >> > primaryNodeIndex=1,
>>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
>>> > >> started,
>>> > >> > >> > primary=10.156.192.106:50010
>>> > >> > >> > 2013-04-19 00:28:09,744 WARN
>>> org.apache.hadoop.hdfs.StateChange:
>>> > >> DIR*
>>> > >> > >> > NameSystem.internalReleaseLease: File
>>> > >> > >> >
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>>> > >> > >> > has not been closed. Lease recovery is in progress.
>>> RecoveryId =
>>> > >> > 174413
>>> > >> > >> for
>>> > >> > >> > block
>>> > blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
>>> > >> > >> > primaryNodeIndex=1,
>>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>>> > >> > >> >
>>> > >> > >> > I see this over and over again in the logs until the
>>> datanode is
>>> > >> > marked
>>> > >> > >> > dead. It seems to be cycling through the replicas for this
>>> WAL
>>> > >> block
>>> > >> > and
>>> > >> > >> > trying to add it to the recovery list. I looked at the code
>>> and
>>> > it
>>> > >> > says:
>>> > >> > >> >
>>> > >> > >> >       // Cannot close file right now, since the last block
>>> > requires
>>> > >> > >> > recovery.
>>> > >> > >> >       // This may potentially cause infinite loop in lease
>>> > recovery
>>> > >> > >> >       // if there are no valid replicas on data-nodes.
>>> > >> > >> >       NameNode.stateChangeLog.warn(
>>> > >> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
>>> > >> > >> >                 "File " + src + " has not been closed." +
>>> > >> > >> >                " Lease recovery is in progress. " +
>>> > >> > >> >                 "RecoveryId = " + blockRecoveryId + " for
>>> block
>>> > " +
>>> > >> > >> > lastBlock);
>>> > >> > >> >       break;
>>> > >> > >> >
>>> > >> > >> > Eventually for this block, we get
>>> > >> > >> >
>>> > >> > >> > 2013-04-19 00:41:20,736 INFO
>>> > >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>> > >> > >> >
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
>>> > >> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
>>> > >> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
>>> > >> > >> 10.156.195.38:50010
>>> > >> > >> > ],
>>> > >> > >> > closeFile=true, deleteBlock=false)
>>> > >> > >> > 2013-04-19 00:41:20,736 ERROR
>>> > >> > >> > org.apache.hadoop.security.UserGroupInformation:
>>> > >> > >> PriviledgedActionException
>>> > >> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>>> > >> > >> > not found
>>> > >> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server:
>>> IPC
>>> > >> Server
>>> > >> > >> > handler 35 on 8020, call
>>> > >> > >> >
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
>>> > >> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>>> > >> > >> > not found
>>> > >> > >> >
>>> > >> > >> > On the datanode side, i see a call for recover blocks - I see
>>> > that
>>> > >> a
>>> > >> > >> write
>>> > >> > >> > pipeline is there, which gets terminated with some socket
>>> > >> timeouts...
>>> > >> > >> >
>>> > >> > >> > 00:28:11,471 INFO
>>> > org.apache.hadoop.hdfs.server.datanode.DataNode:
>>> > >> > >> NameNode
>>> > >> > >> > at
>>> > ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
>>> > >> > >> >
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
>>> > >> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
>>> > >> > 10.156.195.38:50010
>>> > >> > >> ],
>>> > >> > >> > newGenerationStamp=174413)
>>> > >> > >> >
>>> > >> > >> > Not sure but this looks like a case where data could be lost
>>>   ?
>>> > >> > >> >
>>> > >> > >> > Varun
>>> > >> > >> >
>>> > >> > >> >
>>> > >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
>>> > >> nkeywal@gmail.com>
>>> > >> > >> > wrote:
>>> > >> > >> >
>>> > >> > >> > > Hey Varun,
>>> > >> > >> > >
>>> > >> > >> > > Could you please share the logs and the configuration
>>> (hdfs /
>>> > >> hbase
>>> > >> > >> > > settings + cluster description). What's the failure
>>> scenario?
>>> > >> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node
>>> > status.
>>> > >> > But
>>> > >> > >> > these
>>> > >> > >> > > node will be given the lowest priority when reading.
>>> > >> > >> > >
>>> > >> > >> > >
>>> > >> > >> > > Cheers,
>>> > >> > >> > >
>>> > >> > >> > > Nicolas
>>> > >> > >> > >
>>> > >> > >> > >
>>> > >> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
>>> > >> varun@pinterest.com>
>>> > >> > >> > wrote:
>>> > >> > >> > >
>>> > >> > >> > > > Hi,
>>> > >> > >> > > >
>>> > >> > >> > > > We are facing problems with really slow HBase region
>>> server
>>> > >> > >> recoveries
>>> > >> > >> > ~
>>> > >> > >> > > 20
>>> > >> > >> > > > minuted. Version is hbase 0.94.3 compiled with
>>> > >> hadoop.profile=2.0.
>>> > >> > >> > > >
>>> > >> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912
>>> > patched
>>> > >> and
>>> > >> > >> > stale
>>> > >> > >> > > > node timeouts configured correctly. Time for dead node
>>> > >> detection
>>> > >> > is
>>> > >> > >> > still
>>> > >> > >> > > > 10 minutes.
>>> > >> > >> > > >
>>> > >> > >> > > > We see that our region server is trying to read an HLog
>>> is
>>> > >> stuck
>>> > >> > >> there
>>> > >> > >> > > for
>>> > >> > >> > > > a long time. Logs here:
>>> > >> > >> > > >
>>> > >> > >> > > > 2013-04-12 21:14:30,248 WARN
>>> > org.apache.hadoop.hdfs.DFSClient:
>>> > >> > >> Failed
>>> > >> > >> > to
>>> > >> > >> > > > connect to /10.156.194.251:50010 for file
>>> > >> > >> > > >
>>> > >> > >> > > >
>>> > >> > >> > >
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
>>> > >> > >> > > > for block
>>> > >> > >> > > >
>>> > >> > >> > > >
>>> > >> > >> > >
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
>>> > >> > >> > > > 15000 millis timeout while waiting for channel to be
>>> ready
>>> > for
>>> > >> > read.
>>> > >> > >> > ch :
>>> > >> > >> > > > java.nio.channels.SocketChannel[connected local=/
>>> > >> > >> 10.156.192.173:52818
>>> > >> > >> > > > remote=/
>>> > >> > >> > > > 10.156.194.251:50010]
>>> > >> > >> > > >
>>> > >> > >> > > > I would think that HDFS 3703 would make the server fail
>>> fast
>>> > >> and
>>> > >> > go
>>> > >> > >> to
>>> > >> > >> > > the
>>> > >> > >> > > > third datanode. Currently, the recovery seems way too
>>> slow
>>> > for
>>> > >> > >> > production
>>> > >> > >> > > > usage...
>>> > >> > >> > > >
>>> > >> > >> > > > Varun
>>> > >> > >> > > >
>>> > >> > >> > >
>>> > >> > >> >
>>> > >> > >>
>>> > >> > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
The important thing to note is the block for this rogue WAL is
UNDER_RECOVERY state. I have repeatedly asked HDFS dev if the stale node
thing kicks in correctly for UNDER_RECOVERY blocks but failed.


On Sat, Apr 20, 2013 at 10:47 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Nicholas,
>
> Regarding the following, I think this is not a recovery - the file below
> is an HFIle and is being accessed on a get request. On this cluster, I
> don't have block locality. I see these exceptions for a while and then they
> are gone, which means the stale node thing kicks in.
>
> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.94:50010 for file
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
> 02f639bb43944d4ba9abcf58287831c0
> for block
>
> This is the real bummer. The stale datanode is 1st even 90 seconds
> afterwards.
>
> *2013-04-19 00:28:35*,777 WARN
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
> hdfs://
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed, returning error
> java.io.IOException: Cannot obtain block length for
> LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
> getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
> 10.156.192.106:50010, 10.156.195.38:50010]}*
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
> >---at
> org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
> >---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
> >---at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
> >---at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
> >---at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
> >---at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
> >---at
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
> >---at java.lang.Thread.run(Thread.java:662)
>
>
>
> On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <nk...@gmail.com>wrote:
>
>> Hi,
>>
>> I looked at it again with a fresh eye. As Varun was saying, the root cause
>> is the wrong order of the block locations.
>>
>> The root cause of the root cause is actually simple: HBASE started the
>> recovery while the node was not yet stale from an HDFS pov.
>>
>> Varun mentioned this timing:
>> Lost Beat: 27:30
>> Became stale: 27:50 - * this is a guess and reverse engineered (stale
>> timeout 20 seconds)
>> Became dead: 37:51
>>
>> But the  recovery started at 27:13 (15 seconds before we have this log
>> line)
>> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
>> connect to /10.156.194.94:50010 for file
>>
>> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
>> for block
>>
>> BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException:
>> 15000 millis timeout while waiting for channel to be ready for connect. ch
>> : java.nio.channels.SocketChannel[connection-pending remote=/
>> 10.156.194.94:50010]
>>
>> So when we took the blocks from the NN, the datanode was not stale, so you
>> have the wrong (random) order.
>>
>> ZooKeeper can expire a session before the timeout. I don't what why it
>> does
>> this in this case, but I don't consider it as a ZK bug: if ZK knows that a
>> node is dead, it's its role to expire the session. There is something more
>> fishy: we started the recovery while the datanode was still responding to
>> heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS
>> before vanishing away.
>>
>> Anyway, we then have an exception when we try to connect, because the RS
>> does not have a TCP connection to this datanode. And this is retried many
>> times.
>>
>> You would not have this with trunk, because HBASE-6435 reorders the blocks
>> inside the client, using an information not available to the NN, excluding
>> the datanode of the region server under recovery.
>>
>> Some conclusions:
>>  - we should likely backport hbase-6435 to 0.94.
>>  - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
>> hbase-6435).
>>  - There are some stuff that could be better in HDFS. I will see.
>>  - I'm worried by the SocketTimeoutException. We should get NoRouteToHost
>> at a moment, and we don't. That's also why it takes ages. I think it's an
>> AWS thing, but it brings to issue: it's slow, and, in HBase, you don't
>> know
>> if the operation could have been executed or not, so it adds complexity to
>> some scenarios. If someone with enough network and AWS knowledge could
>> clarify this point it would be great.
>>
>>  Cheers,
>>
>>  Nicolas
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > This is 0.94.3 hbase...
>> >
>> >
>> > On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>> >
>> > > Hi Ted,
>> > >
>> > > I had a long offline discussion with nicholas on this. Looks like the
>> > last
>> > > block which was still being written too, took an enormous time to
>> > recover.
>> > > Here's what happened.
>> > > a) Master split tasks and region servers process them
>> > > b) Region server tries to recover lease for each WAL log - most cases
>> are
>> > > noop since they are already rolled over/finalized
>> > > c) The last file lease recovery takes some time since the crashing
>> server
>> > > was writing to it and had a lease on it - but basically we have the
>> > lease 1
>> > > minute after the server was lost
>> > > d) Now we start the recovery for this but we end up hitting the stale
>> > data
>> > > node which is puzzling.
>> > >
>> > > It seems that we did not hit the stale datanode when we were trying to
>> > > recover the finalized WAL blocks with trivial lease recovery. However,
>> > for
>> > > the final block, we hit the stale datanode. Any clue why this might be
>> > > happening ?
>> > >
>> > > Varun
>> > >
>> > >
>> > > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com> wrote:
>> > >
>> > >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>> > >>
>> > >> Here is the criteria for stale node checking to kick in (from
>> > >>
>> > >>
>> >
>> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
>> > >> ):
>> > >>
>> > >> +   * Check if the datanode is in stale state. Here if
>> > >> +   * the namenode has not received heartbeat msg from a
>> > >> +   * datanode for more than staleInterval (default value is
>> > >> +   * {@link
>> > >> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
>> > >> +   * the datanode will be treated as stale node.
>> > >>
>> > >>
>> > >> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
>> > >> wrote:
>> > >>
>> > >> > Is there a place to upload these logs ?
>> > >> >
>> > >> >
>> > >> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <
>> varun@pinterest.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Hi Nicholas,
>> > >> > >
>> > >> > > Attached are the namenode, dn logs (of one of the healthy
>> replicas
>> > of
>> > >> the
>> > >> > > WAL block) and the rs logs which got stuch doing the log split.
>> > Action
>> > >> > > begins at 2013-04-19 00:27*.
>> > >> > >
>> > >> > > Also, the rogue block is 5723958680970112840_174056. Its very
>> > >> interesting
>> > >> > > to trace this guy through the HDFS logs (dn and nn).
>> > >> > >
>> > >> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ?
>> > Also
>> > >> > does
>> > >> > > the stale node stuff kick in for that state ?
>> > >> > >
>> > >> > > Thanks
>> > >> > > Varun
>> > >> > >
>> > >> > >
>> > >> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <
>> nkeywal@gmail.com
>> > >> > >wrote:
>> > >> > >
>> > >> > >> Thanks for the detailed scenario and analysis. I'm going to
>> have a
>> > >> look.
>> > >> > >> I can't access the logs (
>> ec2-107-20-237-30.compute-1.amazonaws.com
>> > >> > >> timeouts), could you please send them directly to me?
>> > >> > >>
>> > >> > >> Thanks,
>> > >> > >>
>> > >> > >> Nicolas
>> > >> > >>
>> > >> > >>
>> > >> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <
>> > varun@pinterest.com>
>> > >> > >> wrote:
>> > >> > >>
>> > >> > >> > Hi Nicholas,
>> > >> > >> >
>> > >> > >> > Here is the failure scenario, I have dug up the logs.
>> > >> > >> >
>> > >> > >> > A machine fails and stops accepting/transmitting traffic. The
>> > >> HMaster
>> > >> > >> > starts the distributed split for 13 tasks. There are 12 region
>> > >> > servers.
>> > >> > >> 12
>> > >> > >> > tasks succeed but the 13th one takes a looong time.
>> > >> > >> >
>> > >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is
>> 20
>> > >> > >> seconds.
>> > >> > >> > Both patches are there.
>> > >> > >> >
>> > >> > >> > a) Machine fails around 27:30
>> > >> > >> > b) Master starts the split around 27:40 and submits the tasks.
>> > The
>> > >> one
>> > >> > >> task
>> > >> > >> > which fails seems to be the one which contains the WAL being
>> > >> currently
>> > >> > >> > written to:
>> > >> > >> >
>> > >> > >> > 2013-04-19 00:27:44,325 INFO
>> > >> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter:
>> Splitting
>> > >> hlog:
>> > >> > >> > hdfs://
>> > >> > >> >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > >> > >> > ,
>> > >> > >> > length=0
>> > >> > >> >
>> > >> > >> > Basically this region server picks up the task but finds the
>> > >> length of
>> > >> > >> this
>> > >> > >> > file to be 0 and drops. This happens again
>> > >> > >> >
>> > >> > >> > c) Finally another region server picks up the task but it
>> ends up
>> > >> > going
>> > >> > >> to
>> > >> > >> > the bad datanode which should not happen because of the stale
>> > node
>> > >> > >> timeout)
>> > >> > >> > Unfortunately it hits the 45 retries and a connect timeout of
>> 20
>> > >> > seconds
>> > >> > >> > every time. This delays recovery significantly. Now I guess
>> > >> reducing #
>> > >> > >> of
>> > >> > >> > retries to 1 is one possibility.
>> > >> > >> > But then the namenode logs are very interesting.
>> > >> > >> >
>> > >> > >> > d) Namenode seems to be in cyclic lease recovery loop until
>> the
>> > >> node
>> > >> > is
>> > >> > >> > marked dead. There is this one last block which exhibits this.
>> > >> > >> >
>> > >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
>> > >> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
>> > >> > >> > primaryNodeIndex=1,
>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
>> > >> started,
>> > >> > >> > primary=10.156.192.106:50010
>> > >> > >> > 2013-04-19 00:28:09,744 WARN
>> org.apache.hadoop.hdfs.StateChange:
>> > >> DIR*
>> > >> > >> > NameSystem.internalReleaseLease: File
>> > >> > >> >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > >> > >> > has not been closed. Lease recovery is in progress.
>> RecoveryId =
>> > >> > 174413
>> > >> > >> for
>> > >> > >> > block
>> > blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
>> > >> > >> > primaryNodeIndex=1,
>> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>> > >> > >> >
>> > >> > >> > I see this over and over again in the logs until the datanode
>> is
>> > >> > marked
>> > >> > >> > dead. It seems to be cycling through the replicas for this WAL
>> > >> block
>> > >> > and
>> > >> > >> > trying to add it to the recovery list. I looked at the code
>> and
>> > it
>> > >> > says:
>> > >> > >> >
>> > >> > >> >       // Cannot close file right now, since the last block
>> > requires
>> > >> > >> > recovery.
>> > >> > >> >       // This may potentially cause infinite loop in lease
>> > recovery
>> > >> > >> >       // if there are no valid replicas on data-nodes.
>> > >> > >> >       NameNode.stateChangeLog.warn(
>> > >> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
>> > >> > >> >                 "File " + src + " has not been closed." +
>> > >> > >> >                " Lease recovery is in progress. " +
>> > >> > >> >                 "RecoveryId = " + blockRecoveryId + " for
>> block
>> > " +
>> > >> > >> > lastBlock);
>> > >> > >> >       break;
>> > >> > >> >
>> > >> > >> > Eventually for this block, we get
>> > >> > >> >
>> > >> > >> > 2013-04-19 00:41:20,736 INFO
>> > >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> > >> > >> >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
>> > >> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
>> > >> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
>> > >> > >> 10.156.195.38:50010
>> > >> > >> > ],
>> > >> > >> > closeFile=true, deleteBlock=false)
>> > >> > >> > 2013-04-19 00:41:20,736 ERROR
>> > >> > >> > org.apache.hadoop.security.UserGroupInformation:
>> > >> > >> PriviledgedActionException
>> > >> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > >> > >> > not found
>> > >> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
>> > >> Server
>> > >> > >> > handler 35 on 8020, call
>> > >> > >> >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
>> > >> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > >> > >> > not found
>> > >> > >> >
>> > >> > >> > On the datanode side, i see a call for recover blocks - I see
>> > that
>> > >> a
>> > >> > >> write
>> > >> > >> > pipeline is there, which gets terminated with some socket
>> > >> timeouts...
>> > >> > >> >
>> > >> > >> > 00:28:11,471 INFO
>> > org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > >> > >> NameNode
>> > >> > >> > at
>> > ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
>> > >> > >> >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
>> > >> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
>> > >> > 10.156.195.38:50010
>> > >> > >> ],
>> > >> > >> > newGenerationStamp=174413)
>> > >> > >> >
>> > >> > >> > Not sure but this looks like a case where data could be lost
>>   ?
>> > >> > >> >
>> > >> > >> > Varun
>> > >> > >> >
>> > >> > >> >
>> > >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
>> > >> nkeywal@gmail.com>
>> > >> > >> > wrote:
>> > >> > >> >
>> > >> > >> > > Hey Varun,
>> > >> > >> > >
>> > >> > >> > > Could you please share the logs and the configuration (hdfs
>> /
>> > >> hbase
>> > >> > >> > > settings + cluster description). What's the failure
>> scenario?
>> > >> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node
>> > status.
>> > >> > But
>> > >> > >> > these
>> > >> > >> > > node will be given the lowest priority when reading.
>> > >> > >> > >
>> > >> > >> > >
>> > >> > >> > > Cheers,
>> > >> > >> > >
>> > >> > >> > > Nicolas
>> > >> > >> > >
>> > >> > >> > >
>> > >> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
>> > >> varun@pinterest.com>
>> > >> > >> > wrote:
>> > >> > >> > >
>> > >> > >> > > > Hi,
>> > >> > >> > > >
>> > >> > >> > > > We are facing problems with really slow HBase region
>> server
>> > >> > >> recoveries
>> > >> > >> > ~
>> > >> > >> > > 20
>> > >> > >> > > > minuted. Version is hbase 0.94.3 compiled with
>> > >> hadoop.profile=2.0.
>> > >> > >> > > >
>> > >> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912
>> > patched
>> > >> and
>> > >> > >> > stale
>> > >> > >> > > > node timeouts configured correctly. Time for dead node
>> > >> detection
>> > >> > is
>> > >> > >> > still
>> > >> > >> > > > 10 minutes.
>> > >> > >> > > >
>> > >> > >> > > > We see that our region server is trying to read an HLog is
>> > >> stuck
>> > >> > >> there
>> > >> > >> > > for
>> > >> > >> > > > a long time. Logs here:
>> > >> > >> > > >
>> > >> > >> > > > 2013-04-12 21:14:30,248 WARN
>> > org.apache.hadoop.hdfs.DFSClient:
>> > >> > >> Failed
>> > >> > >> > to
>> > >> > >> > > > connect to /10.156.194.251:50010 for file
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
>> > >> > >> > > > for block
>> > >> > >> > > >
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
>> > >> > >> > > > 15000 millis timeout while waiting for channel to be ready
>> > for
>> > >> > read.
>> > >> > >> > ch :
>> > >> > >> > > > java.nio.channels.SocketChannel[connected local=/
>> > >> > >> 10.156.192.173:52818
>> > >> > >> > > > remote=/
>> > >> > >> > > > 10.156.194.251:50010]
>> > >> > >> > > >
>> > >> > >> > > > I would think that HDFS 3703 would make the server fail
>> fast
>> > >> and
>> > >> > go
>> > >> > >> to
>> > >> > >> > > the
>> > >> > >> > > > third datanode. Currently, the recovery seems way too slow
>> > for
>> > >> > >> > production
>> > >> > >> > > > usage...
>> > >> > >> > > >
>> > >> > >> > > > Varun
>> > >> > >> > > >
>> > >> > >> > >
>> > >> > >> >
>> > >> > >>
>> > >> > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
Hi Nicholas,

Regarding the following, I think this is not a recovery - the file below is
an HFIle and is being accessed on a get request. On this cluster, I don't
have block locality. I see these exceptions for a while and then they are
gone, which means the stale node thing kicks in.

2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.156.194.94:50010 for file
/hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/
02f639bb43944d4ba9abcf58287831c0
for block

This is the real bummer. The stale datanode is 1st even 90 seconds
afterwards.

*2013-04-19 00:28:35*,777 WARN
org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of
hdfs://
ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141failed,
returning error
java.io.IOException: Cannot obtain block length for
LocatedBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
getBlockSize()=0; corrupt=false; offset=0; locs=*[10.156.194.94:50010,
10.156.192.106:50010, 10.156.195.38:50010]}*
>---at
org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:238)
>---at
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:182)
>---at
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:124)
>---at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:117)
>---at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1080)
>---at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:245)
>---at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:78)
>---at
org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1787)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
>---at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1707)
>---at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1728)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>---at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:717)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
>---at
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195)
>---at
org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163)
>---at java.lang.Thread.run(Thread.java:662)



On Sat, Apr 20, 2013 at 1:16 AM, Nicolas Liochon <nk...@gmail.com> wrote:

> Hi,
>
> I looked at it again with a fresh eye. As Varun was saying, the root cause
> is the wrong order of the block locations.
>
> The root cause of the root cause is actually simple: HBASE started the
> recovery while the node was not yet stale from an HDFS pov.
>
> Varun mentioned this timing:
> Lost Beat: 27:30
> Became stale: 27:50 - * this is a guess and reverse engineered (stale
> timeout 20 seconds)
> Became dead: 37:51
>
> But the  recovery started at 27:13 (15 seconds before we have this log
> line)
> 2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.94:50010 for file
>
> /hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
> for block
>
> BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException:
> 15000 millis timeout while waiting for channel to be ready for connect. ch
> : java.nio.channels.SocketChannel[connection-pending remote=/
> 10.156.194.94:50010]
>
> So when we took the blocks from the NN, the datanode was not stale, so you
> have the wrong (random) order.
>
> ZooKeeper can expire a session before the timeout. I don't what why it does
> this in this case, but I don't consider it as a ZK bug: if ZK knows that a
> node is dead, it's its role to expire the session. There is something more
> fishy: we started the recovery while the datanode was still responding to
> heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS
> before vanishing away.
>
> Anyway, we then have an exception when we try to connect, because the RS
> does not have a TCP connection to this datanode. And this is retried many
> times.
>
> You would not have this with trunk, because HBASE-6435 reorders the blocks
> inside the client, using an information not available to the NN, excluding
> the datanode of the region server under recovery.
>
> Some conclusions:
>  - we should likely backport hbase-6435 to 0.94.
>  - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
> hbase-6435).
>  - There are some stuff that could be better in HDFS. I will see.
>  - I'm worried by the SocketTimeoutException. We should get NoRouteToHost
> at a moment, and we don't. That's also why it takes ages. I think it's an
> AWS thing, but it brings to issue: it's slow, and, in HBase, you don't know
> if the operation could have been executed or not, so it adds complexity to
> some scenarios. If someone with enough network and AWS knowledge could
> clarify this point it would be great.
>
>  Cheers,
>
>  Nicolas
>
>
>
>
>
>
>
>
>
> On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > This is 0.94.3 hbase...
> >
> >
> > On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > Hi Ted,
> > >
> > > I had a long offline discussion with nicholas on this. Looks like the
> > last
> > > block which was still being written too, took an enormous time to
> > recover.
> > > Here's what happened.
> > > a) Master split tasks and region servers process them
> > > b) Region server tries to recover lease for each WAL log - most cases
> are
> > > noop since they are already rolled over/finalized
> > > c) The last file lease recovery takes some time since the crashing
> server
> > > was writing to it and had a lease on it - but basically we have the
> > lease 1
> > > minute after the server was lost
> > > d) Now we start the recovery for this but we end up hitting the stale
> > data
> > > node which is puzzling.
> > >
> > > It seems that we did not hit the stale datanode when we were trying to
> > > recover the finalized WAL blocks with trivial lease recovery. However,
> > for
> > > the final block, we hit the stale datanode. Any clue why this might be
> > > happening ?
> > >
> > > Varun
> > >
> > >
> > > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
> > >>
> > >> Here is the criteria for stale node checking to kick in (from
> > >>
> > >>
> >
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> > >> ):
> > >>
> > >> +   * Check if the datanode is in stale state. Here if
> > >> +   * the namenode has not received heartbeat msg from a
> > >> +   * datanode for more than staleInterval (default value is
> > >> +   * {@link
> > >> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> > >> +   * the datanode will be treated as stale node.
> > >>
> > >>
> > >> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
> > >> wrote:
> > >>
> > >> > Is there a place to upload these logs ?
> > >> >
> > >> >
> > >> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <varun@pinterest.com
> >
> > >> > wrote:
> > >> >
> > >> > > Hi Nicholas,
> > >> > >
> > >> > > Attached are the namenode, dn logs (of one of the healthy replicas
> > of
> > >> the
> > >> > > WAL block) and the rs logs which got stuch doing the log split.
> > Action
> > >> > > begins at 2013-04-19 00:27*.
> > >> > >
> > >> > > Also, the rogue block is 5723958680970112840_174056. Its very
> > >> interesting
> > >> > > to trace this guy through the HDFS logs (dn and nn).
> > >> > >
> > >> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ?
> > Also
> > >> > does
> > >> > > the stale node stuff kick in for that state ?
> > >> > >
> > >> > > Thanks
> > >> > > Varun
> > >> > >
> > >> > >
> > >> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <
> nkeywal@gmail.com
> > >> > >wrote:
> > >> > >
> > >> > >> Thanks for the detailed scenario and analysis. I'm going to have
> a
> > >> look.
> > >> > >> I can't access the logs (
> ec2-107-20-237-30.compute-1.amazonaws.com
> > >> > >> timeouts), could you please send them directly to me?
> > >> > >>
> > >> > >> Thanks,
> > >> > >>
> > >> > >> Nicolas
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <
> > varun@pinterest.com>
> > >> > >> wrote:
> > >> > >>
> > >> > >> > Hi Nicholas,
> > >> > >> >
> > >> > >> > Here is the failure scenario, I have dug up the logs.
> > >> > >> >
> > >> > >> > A machine fails and stops accepting/transmitting traffic. The
> > >> HMaster
> > >> > >> > starts the distributed split for 13 tasks. There are 12 region
> > >> > servers.
> > >> > >> 12
> > >> > >> > tasks succeed but the 13th one takes a looong time.
> > >> > >> >
> > >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is
> 20
> > >> > >> seconds.
> > >> > >> > Both patches are there.
> > >> > >> >
> > >> > >> > a) Machine fails around 27:30
> > >> > >> > b) Master starts the split around 27:40 and submits the tasks.
> > The
> > >> one
> > >> > >> task
> > >> > >> > which fails seems to be the one which contains the WAL being
> > >> currently
> > >> > >> > written to:
> > >> > >> >
> > >> > >> > 2013-04-19 00:27:44,325 INFO
> > >> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter:
> Splitting
> > >> hlog:
> > >> > >> > hdfs://
> > >> > >> >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > >> > ,
> > >> > >> > length=0
> > >> > >> >
> > >> > >> > Basically this region server picks up the task but finds the
> > >> length of
> > >> > >> this
> > >> > >> > file to be 0 and drops. This happens again
> > >> > >> >
> > >> > >> > c) Finally another region server picks up the task but it ends
> up
> > >> > going
> > >> > >> to
> > >> > >> > the bad datanode which should not happen because of the stale
> > node
> > >> > >> timeout)
> > >> > >> > Unfortunately it hits the 45 retries and a connect timeout of
> 20
> > >> > seconds
> > >> > >> > every time. This delays recovery significantly. Now I guess
> > >> reducing #
> > >> > >> of
> > >> > >> > retries to 1 is one possibility.
> > >> > >> > But then the namenode logs are very interesting.
> > >> > >> >
> > >> > >> > d) Namenode seems to be in cyclic lease recovery loop until the
> > >> node
> > >> > is
> > >> > >> > marked dead. There is this one last block which exhibits this.
> > >> > >> >
> > >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> > >> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> > >> > >> > primaryNodeIndex=1,
> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
> > >> started,
> > >> > >> > primary=10.156.192.106:50010
> > >> > >> > 2013-04-19 00:28:09,744 WARN
> org.apache.hadoop.hdfs.StateChange:
> > >> DIR*
> > >> > >> > NameSystem.internalReleaseLease: File
> > >> > >> >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > >> > has not been closed. Lease recovery is in progress. RecoveryId
> =
> > >> > 174413
> > >> > >> for
> > >> > >> > block
> > blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> > >> > >> > primaryNodeIndex=1,
> > >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> > >> > >> >
> > >> > >> > I see this over and over again in the logs until the datanode
> is
> > >> > marked
> > >> > >> > dead. It seems to be cycling through the replicas for this WAL
> > >> block
> > >> > and
> > >> > >> > trying to add it to the recovery list. I looked at the code and
> > it
> > >> > says:
> > >> > >> >
> > >> > >> >       // Cannot close file right now, since the last block
> > requires
> > >> > >> > recovery.
> > >> > >> >       // This may potentially cause infinite loop in lease
> > recovery
> > >> > >> >       // if there are no valid replicas on data-nodes.
> > >> > >> >       NameNode.stateChangeLog.warn(
> > >> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
> > >> > >> >                 "File " + src + " has not been closed." +
> > >> > >> >                " Lease recovery is in progress. " +
> > >> > >> >                 "RecoveryId = " + blockRecoveryId + " for block
> > " +
> > >> > >> > lastBlock);
> > >> > >> >       break;
> > >> > >> >
> > >> > >> > Eventually for this block, we get
> > >> > >> >
> > >> > >> > 2013-04-19 00:41:20,736 INFO
> > >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > >> > >> >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> > >> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> > >> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> > >> > >> 10.156.195.38:50010
> > >> > >> > ],
> > >> > >> > closeFile=true, deleteBlock=false)
> > >> > >> > 2013-04-19 00:41:20,736 ERROR
> > >> > >> > org.apache.hadoop.security.UserGroupInformation:
> > >> > >> PriviledgedActionException
> > >> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > >> > not found
> > >> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
> > >> Server
> > >> > >> > handler 35 on 8020, call
> > >> > >> >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> > >> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > >> > not found
> > >> > >> >
> > >> > >> > On the datanode side, i see a call for recover blocks - I see
> > that
> > >> a
> > >> > >> write
> > >> > >> > pipeline is there, which gets terminated with some socket
> > >> timeouts...
> > >> > >> >
> > >> > >> > 00:28:11,471 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> > >> NameNode
> > >> > >> > at
> > ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> > >> > >> >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> > >> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> > >> > 10.156.195.38:50010
> > >> > >> ],
> > >> > >> > newGenerationStamp=174413)
> > >> > >> >
> > >> > >> > Not sure but this looks like a case where data could be lost
> ?
> > >> > >> >
> > >> > >> > Varun
> > >> > >> >
> > >> > >> >
> > >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> > >> nkeywal@gmail.com>
> > >> > >> > wrote:
> > >> > >> >
> > >> > >> > > Hey Varun,
> > >> > >> > >
> > >> > >> > > Could you please share the logs and the configuration (hdfs /
> > >> hbase
> > >> > >> > > settings + cluster description). What's the failure scenario?
> > >> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node
> > status.
> > >> > But
> > >> > >> > these
> > >> > >> > > node will be given the lowest priority when reading.
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > Cheers,
> > >> > >> > >
> > >> > >> > > Nicolas
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> > >> varun@pinterest.com>
> > >> > >> > wrote:
> > >> > >> > >
> > >> > >> > > > Hi,
> > >> > >> > > >
> > >> > >> > > > We are facing problems with really slow HBase region server
> > >> > >> recoveries
> > >> > >> > ~
> > >> > >> > > 20
> > >> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> > >> hadoop.profile=2.0.
> > >> > >> > > >
> > >> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912
> > patched
> > >> and
> > >> > >> > stale
> > >> > >> > > > node timeouts configured correctly. Time for dead node
> > >> detection
> > >> > is
> > >> > >> > still
> > >> > >> > > > 10 minutes.
> > >> > >> > > >
> > >> > >> > > > We see that our region server is trying to read an HLog is
> > >> stuck
> > >> > >> there
> > >> > >> > > for
> > >> > >> > > > a long time. Logs here:
> > >> > >> > > >
> > >> > >> > > > 2013-04-12 21:14:30,248 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > >> > >> Failed
> > >> > >> > to
> > >> > >> > > > connect to /10.156.194.251:50010 for file
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > >> > >> > > > for block
> > >> > >> > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > >> > >> > > > 15000 millis timeout while waiting for channel to be ready
> > for
> > >> > read.
> > >> > >> > ch :
> > >> > >> > > > java.nio.channels.SocketChannel[connected local=/
> > >> > >> 10.156.192.173:52818
> > >> > >> > > > remote=/
> > >> > >> > > > 10.156.194.251:50010]
> > >> > >> > > >
> > >> > >> > > > I would think that HDFS 3703 would make the server fail
> fast
> > >> and
> > >> > go
> > >> > >> to
> > >> > >> > > the
> > >> > >> > > > third datanode. Currently, the recovery seems way too slow
> > for
> > >> > >> > production
> > >> > >> > > > usage...
> > >> > >> > > >
> > >> > >> > > > Varun
> > >> > >> > > >
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Slow region server recoveries

Posted by Nicolas Liochon <nk...@gmail.com>.
Hi,

I looked at it again with a fresh eye. As Varun was saying, the root cause
is the wrong order of the block locations.

The root cause of the root cause is actually simple: HBASE started the
recovery while the node was not yet stale from an HDFS pov.

Varun mentioned this timing:
Lost Beat: 27:30
Became stale: 27:50 - * this is a guess and reverse engineered (stale
timeout 20 seconds)
Became dead: 37:51

But the  recovery started at 27:13 (15 seconds before we have this log
line)
2013-04-19 00:27:28,432 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
connect to /10.156.194.94:50010 for file
/hbase/feeds/1479495ad2a02dceb41f093ebc29fe4f/home/02f639bb43944d4ba9abcf58287831c0
for block
BP-696828882-10.168.7.226-1364886167971:blk_-5977178030490858298_99853:java.net.SocketTimeoutException:
15000 millis timeout while waiting for channel to be ready for connect. ch
: java.nio.channels.SocketChannel[connection-pending remote=/
10.156.194.94:50010]

So when we took the blocks from the NN, the datanode was not stale, so you
have the wrong (random) order.

ZooKeeper can expire a session before the timeout. I don't what why it does
this in this case, but I don't consider it as a ZK bug: if ZK knows that a
node is dead, it's its role to expire the session. There is something more
fishy: we started the recovery while the datanode was still responding to
heartbeat. I don't know why. Maybe the OS has been able to kill 15 the RS
before vanishing away.

Anyway, we then have an exception when we try to connect, because the RS
does not have a TCP connection to this datanode. And this is retried many
times.

You would not have this with trunk, because HBASE-6435 reorders the blocks
inside the client, using an information not available to the NN, excluding
the datanode of the region server under recovery.

Some conclusions:
 - we should likely backport hbase-6435 to 0.94.
 - I will revive HDFS-3706 and HDFS-3705 (the non hacky way to get
hbase-6435).
 - There are some stuff that could be better in HDFS. I will see.
 - I'm worried by the SocketTimeoutException. We should get NoRouteToHost
at a moment, and we don't. That's also why it takes ages. I think it's an
AWS thing, but it brings to issue: it's slow, and, in HBase, you don't know
if the operation could have been executed or not, so it adds complexity to
some scenarios. If someone with enough network and AWS knowledge could
clarify this point it would be great.

 Cheers,

 Nicolas









On Fri, Apr 19, 2013 at 10:10 PM, Varun Sharma <va...@pinterest.com> wrote:

> This is 0.94.3 hbase...
>
>
> On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi Ted,
> >
> > I had a long offline discussion with nicholas on this. Looks like the
> last
> > block which was still being written too, took an enormous time to
> recover.
> > Here's what happened.
> > a) Master split tasks and region servers process them
> > b) Region server tries to recover lease for each WAL log - most cases are
> > noop since they are already rolled over/finalized
> > c) The last file lease recovery takes some time since the crashing server
> > was writing to it and had a lease on it - but basically we have the
> lease 1
> > minute after the server was lost
> > d) Now we start the recovery for this but we end up hitting the stale
> data
> > node which is puzzling.
> >
> > It seems that we did not hit the stale datanode when we were trying to
> > recover the finalized WAL blocks with trivial lease recovery. However,
> for
> > the final block, we hit the stale datanode. Any clue why this might be
> > happening ?
> >
> > Varun
> >
> >
> > On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
> >>
> >> Here is the criteria for stale node checking to kick in (from
> >>
> >>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> >> ):
> >>
> >> +   * Check if the datanode is in stale state. Here if
> >> +   * the namenode has not received heartbeat msg from a
> >> +   * datanode for more than staleInterval (default value is
> >> +   * {@link
> >> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> >> +   * the datanode will be treated as stale node.
> >>
> >>
> >> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> > Is there a place to upload these logs ?
> >> >
> >> >
> >> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com>
> >> > wrote:
> >> >
> >> > > Hi Nicholas,
> >> > >
> >> > > Attached are the namenode, dn logs (of one of the healthy replicas
> of
> >> the
> >> > > WAL block) and the rs logs which got stuch doing the log split.
> Action
> >> > > begins at 2013-04-19 00:27*.
> >> > >
> >> > > Also, the rogue block is 5723958680970112840_174056. Its very
> >> interesting
> >> > > to trace this guy through the HDFS logs (dn and nn).
> >> > >
> >> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ?
> Also
> >> > does
> >> > > the stale node stuff kick in for that state ?
> >> > >
> >> > > Thanks
> >> > > Varun
> >> > >
> >> > >
> >> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkeywal@gmail.com
> >> > >wrote:
> >> > >
> >> > >> Thanks for the detailed scenario and analysis. I'm going to have a
> >> look.
> >> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> >> > >> timeouts), could you please send them directly to me?
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> Nicolas
> >> > >>
> >> > >>
> >> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <
> varun@pinterest.com>
> >> > >> wrote:
> >> > >>
> >> > >> > Hi Nicholas,
> >> > >> >
> >> > >> > Here is the failure scenario, I have dug up the logs.
> >> > >> >
> >> > >> > A machine fails and stops accepting/transmitting traffic. The
> >> HMaster
> >> > >> > starts the distributed split for 13 tasks. There are 12 region
> >> > servers.
> >> > >> 12
> >> > >> > tasks succeed but the 13th one takes a looong time.
> >> > >> >
> >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
> >> > >> seconds.
> >> > >> > Both patches are there.
> >> > >> >
> >> > >> > a) Machine fails around 27:30
> >> > >> > b) Master starts the split around 27:40 and submits the tasks.
> The
> >> one
> >> > >> task
> >> > >> > which fails seems to be the one which contains the WAL being
> >> currently
> >> > >> > written to:
> >> > >> >
> >> > >> > 2013-04-19 00:27:44,325 INFO
> >> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
> >> hlog:
> >> > >> > hdfs://
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >> > >> > ,
> >> > >> > length=0
> >> > >> >
> >> > >> > Basically this region server picks up the task but finds the
> >> length of
> >> > >> this
> >> > >> > file to be 0 and drops. This happens again
> >> > >> >
> >> > >> > c) Finally another region server picks up the task but it ends up
> >> > going
> >> > >> to
> >> > >> > the bad datanode which should not happen because of the stale
> node
> >> > >> timeout)
> >> > >> > Unfortunately it hits the 45 retries and a connect timeout of 20
> >> > seconds
> >> > >> > every time. This delays recovery significantly. Now I guess
> >> reducing #
> >> > >> of
> >> > >> > retries to 1 is one possibility.
> >> > >> > But then the namenode logs are very interesting.
> >> > >> >
> >> > >> > d) Namenode seems to be in cyclic lease recovery loop until the
> >> node
> >> > is
> >> > >> > marked dead. There is this one last block which exhibits this.
> >> > >> >
> >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> >> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> >> > >> > primaryNodeIndex=1,
> >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
> >> started,
> >> > >> > primary=10.156.192.106:50010
> >> > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange:
> >> DIR*
> >> > >> > NameSystem.internalReleaseLease: File
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >> > >> > has not been closed. Lease recovery is in progress. RecoveryId =
> >> > 174413
> >> > >> for
> >> > >> > block
> blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> >> > >> > primaryNodeIndex=1,
> >> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> >> > >> >
> >> > >> > I see this over and over again in the logs until the datanode is
> >> > marked
> >> > >> > dead. It seems to be cycling through the replicas for this WAL
> >> block
> >> > and
> >> > >> > trying to add it to the recovery list. I looked at the code and
> it
> >> > says:
> >> > >> >
> >> > >> >       // Cannot close file right now, since the last block
> requires
> >> > >> > recovery.
> >> > >> >       // This may potentially cause infinite loop in lease
> recovery
> >> > >> >       // if there are no valid replicas on data-nodes.
> >> > >> >       NameNode.stateChangeLog.warn(
> >> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
> >> > >> >                 "File " + src + " has not been closed." +
> >> > >> >                " Lease recovery is in progress. " +
> >> > >> >                 "RecoveryId = " + blockRecoveryId + " for block
> " +
> >> > >> > lastBlock);
> >> > >> >       break;
> >> > >> >
> >> > >> > Eventually for this block, we get
> >> > >> >
> >> > >> > 2013-04-19 00:41:20,736 INFO
> >> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> >> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> >> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> >> > >> 10.156.195.38:50010
> >> > >> > ],
> >> > >> > closeFile=true, deleteBlock=false)
> >> > >> > 2013-04-19 00:41:20,736 ERROR
> >> > >> > org.apache.hadoop.security.UserGroupInformation:
> >> > >> PriviledgedActionException
> >> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> >> > >> >
> >> > >>
> >> >
> >>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >> > >> > not found
> >> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
> >> Server
> >> > >> > handler 35 on 8020, call
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> >> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
> >> > >> >
> >> > >>
> >> >
> >>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >> > >> > not found
> >> > >> >
> >> > >> > On the datanode side, i see a call for recover blocks - I see
> that
> >> a
> >> > >> write
> >> > >> > pipeline is there, which gets terminated with some socket
> >> timeouts...
> >> > >> >
> >> > >> > 00:28:11,471 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > >> NameNode
> >> > >> > at
> ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> >> > >> >
> >> > >> >
> >> > >>
> >> >
> >>
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> >> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> >> > 10.156.195.38:50010
> >> > >> ],
> >> > >> > newGenerationStamp=174413)
> >> > >> >
> >> > >> > Not sure but this looks like a case where data could be lost   ?
> >> > >> >
> >> > >> > Varun
> >> > >> >
> >> > >> >
> >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> >> nkeywal@gmail.com>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > Hey Varun,
> >> > >> > >
> >> > >> > > Could you please share the logs and the configuration (hdfs /
> >> hbase
> >> > >> > > settings + cluster description). What's the failure scenario?
> >> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node
> status.
> >> > But
> >> > >> > these
> >> > >> > > node will be given the lowest priority when reading.
> >> > >> > >
> >> > >> > >
> >> > >> > > Cheers,
> >> > >> > >
> >> > >> > > Nicolas
> >> > >> > >
> >> > >> > >
> >> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> >> varun@pinterest.com>
> >> > >> > wrote:
> >> > >> > >
> >> > >> > > > Hi,
> >> > >> > > >
> >> > >> > > > We are facing problems with really slow HBase region server
> >> > >> recoveries
> >> > >> > ~
> >> > >> > > 20
> >> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> >> hadoop.profile=2.0.
> >> > >> > > >
> >> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912
> patched
> >> and
> >> > >> > stale
> >> > >> > > > node timeouts configured correctly. Time for dead node
> >> detection
> >> > is
> >> > >> > still
> >> > >> > > > 10 minutes.
> >> > >> > > >
> >> > >> > > > We see that our region server is trying to read an HLog is
> >> stuck
> >> > >> there
> >> > >> > > for
> >> > >> > > > a long time. Logs here:
> >> > >> > > >
> >> > >> > > > 2013-04-12 21:14:30,248 WARN
> org.apache.hadoop.hdfs.DFSClient:
> >> > >> Failed
> >> > >> > to
> >> > >> > > > connect to /10.156.194.251:50010 for file
> >> > >> > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> >> > >> > > > for block
> >> > >> > > >
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> >> > >> > > > 15000 millis timeout while waiting for channel to be ready
> for
> >> > read.
> >> > >> > ch :
> >> > >> > > > java.nio.channels.SocketChannel[connected local=/
> >> > >> 10.156.192.173:52818
> >> > >> > > > remote=/
> >> > >> > > > 10.156.194.251:50010]
> >> > >> > > >
> >> > >> > > > I would think that HDFS 3703 would make the server fail fast
> >> and
> >> > go
> >> > >> to
> >> > >> > > the
> >> > >> > > > third datanode. Currently, the recovery seems way too slow
> for
> >> > >> > production
> >> > >> > > > usage...
> >> > >> > > >
> >> > >> > > > Varun
> >> > >> > > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
This is 0.94.3 hbase...


On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Ted,
>
> I had a long offline discussion with nicholas on this. Looks like the last
> block which was still being written too, took an enormous time to recover.
> Here's what happened.
> a) Master split tasks and region servers process them
> b) Region server tries to recover lease for each WAL log - most cases are
> noop since they are already rolled over/finalized
> c) The last file lease recovery takes some time since the crashing server
> was writing to it and had a lease on it - but basically we have the lease 1
> minute after the server was lost
> d) Now we start the recovery for this but we end up hitting the stale data
> node which is puzzling.
>
> It seems that we did not hit the stale datanode when we were trying to
> recover the finalized WAL blocks with trivial lease recovery. However, for
> the final block, we hit the stale datanode. Any clue why this might be
> happening ?
>
> Varun
>
>
> On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>>
>> Here is the criteria for stale node checking to kick in (from
>>
>> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
>> ):
>>
>> +   * Check if the datanode is in stale state. Here if
>> +   * the namenode has not received heartbeat msg from a
>> +   * datanode for more than staleInterval (default value is
>> +   * {@link
>> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
>> +   * the datanode will be treated as stale node.
>>
>>
>> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Is there a place to upload these logs ?
>> >
>> >
>> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com>
>> > wrote:
>> >
>> > > Hi Nicholas,
>> > >
>> > > Attached are the namenode, dn logs (of one of the healthy replicas of
>> the
>> > > WAL block) and the rs logs which got stuch doing the log split. Action
>> > > begins at 2013-04-19 00:27*.
>> > >
>> > > Also, the rogue block is 5723958680970112840_174056. Its very
>> interesting
>> > > to trace this guy through the HDFS logs (dn and nn).
>> > >
>> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
>> > does
>> > > the stale node stuff kick in for that state ?
>> > >
>> > > Thanks
>> > > Varun
>> > >
>> > >
>> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkeywal@gmail.com
>> > >wrote:
>> > >
>> > >> Thanks for the detailed scenario and analysis. I'm going to have a
>> look.
>> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
>> > >> timeouts), could you please send them directly to me?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Nicolas
>> > >>
>> > >>
>> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
>> > >> wrote:
>> > >>
>> > >> > Hi Nicholas,
>> > >> >
>> > >> > Here is the failure scenario, I have dug up the logs.
>> > >> >
>> > >> > A machine fails and stops accepting/transmitting traffic. The
>> HMaster
>> > >> > starts the distributed split for 13 tasks. There are 12 region
>> > servers.
>> > >> 12
>> > >> > tasks succeed but the 13th one takes a looong time.
>> > >> >
>> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
>> > >> seconds.
>> > >> > Both patches are there.
>> > >> >
>> > >> > a) Machine fails around 27:30
>> > >> > b) Master starts the split around 27:40 and submits the tasks. The
>> one
>> > >> task
>> > >> > which fails seems to be the one which contains the WAL being
>> currently
>> > >> > written to:
>> > >> >
>> > >> > 2013-04-19 00:27:44,325 INFO
>> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
>> hlog:
>> > >> > hdfs://
>> > >> >
>> > >> >
>> > >>
>> >
>> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > >> > ,
>> > >> > length=0
>> > >> >
>> > >> > Basically this region server picks up the task but finds the
>> length of
>> > >> this
>> > >> > file to be 0 and drops. This happens again
>> > >> >
>> > >> > c) Finally another region server picks up the task but it ends up
>> > going
>> > >> to
>> > >> > the bad datanode which should not happen because of the stale node
>> > >> timeout)
>> > >> > Unfortunately it hits the 45 retries and a connect timeout of 20
>> > seconds
>> > >> > every time. This delays recovery significantly. Now I guess
>> reducing #
>> > >> of
>> > >> > retries to 1 is one possibility.
>> > >> > But then the namenode logs are very interesting.
>> > >> >
>> > >> > d) Namenode seems to be in cyclic lease recovery loop until the
>> node
>> > is
>> > >> > marked dead. There is this one last block which exhibits this.
>> > >> >
>> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
>> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
>> > >> > primaryNodeIndex=1,
>> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
>> started,
>> > >> > primary=10.156.192.106:50010
>> > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange:
>> DIR*
>> > >> > NameSystem.internalReleaseLease: File
>> > >> >
>> > >> >
>> > >>
>> >
>> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > >> > has not been closed. Lease recovery is in progress. RecoveryId =
>> > 174413
>> > >> for
>> > >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
>> > >> > primaryNodeIndex=1,
>> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>> > >> >
>> > >> > I see this over and over again in the logs until the datanode is
>> > marked
>> > >> > dead. It seems to be cycling through the replicas for this WAL
>> block
>> > and
>> > >> > trying to add it to the recovery list. I looked at the code and it
>> > says:
>> > >> >
>> > >> >       // Cannot close file right now, since the last block requires
>> > >> > recovery.
>> > >> >       // This may potentially cause infinite loop in lease recovery
>> > >> >       // if there are no valid replicas on data-nodes.
>> > >> >       NameNode.stateChangeLog.warn(
>> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
>> > >> >                 "File " + src + " has not been closed." +
>> > >> >                " Lease recovery is in progress. " +
>> > >> >                 "RecoveryId = " + blockRecoveryId + " for block " +
>> > >> > lastBlock);
>> > >> >       break;
>> > >> >
>> > >> > Eventually for this block, we get
>> > >> >
>> > >> > 2013-04-19 00:41:20,736 INFO
>> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> > >> >
>> > >> >
>> > >>
>> >
>> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
>> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
>> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
>> > >> 10.156.195.38:50010
>> > >> > ],
>> > >> > closeFile=true, deleteBlock=false)
>> > >> > 2013-04-19 00:41:20,736 ERROR
>> > >> > org.apache.hadoop.security.UserGroupInformation:
>> > >> PriviledgedActionException
>> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>> > >> >
>> > >>
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > >> > not found
>> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
>> Server
>> > >> > handler 35 on 8020, call
>> > >> >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
>> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
>> > >> >
>> > >>
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > >> > not found
>> > >> >
>> > >> > On the datanode side, i see a call for recover blocks - I see that
>> a
>> > >> write
>> > >> > pipeline is there, which gets terminated with some socket
>> timeouts...
>> > >> >
>> > >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > >> NameNode
>> > >> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
>> > >> >
>> > >> >
>> > >>
>> >
>> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
>> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
>> > 10.156.195.38:50010
>> > >> ],
>> > >> > newGenerationStamp=174413)
>> > >> >
>> > >> > Not sure but this looks like a case where data could be lost   ?
>> > >> >
>> > >> > Varun
>> > >> >
>> > >> >
>> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
>> nkeywal@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Hey Varun,
>> > >> > >
>> > >> > > Could you please share the logs and the configuration (hdfs /
>> hbase
>> > >> > > settings + cluster description). What's the failure scenario?
>> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node status.
>> > But
>> > >> > these
>> > >> > > node will be given the lowest priority when reading.
>> > >> > >
>> > >> > >
>> > >> > > Cheers,
>> > >> > >
>> > >> > > Nicolas
>> > >> > >
>> > >> > >
>> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
>> varun@pinterest.com>
>> > >> > wrote:
>> > >> > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > > We are facing problems with really slow HBase region server
>> > >> recoveries
>> > >> > ~
>> > >> > > 20
>> > >> > > > minuted. Version is hbase 0.94.3 compiled with
>> hadoop.profile=2.0.
>> > >> > > >
>> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched
>> and
>> > >> > stale
>> > >> > > > node timeouts configured correctly. Time for dead node
>> detection
>> > is
>> > >> > still
>> > >> > > > 10 minutes.
>> > >> > > >
>> > >> > > > We see that our region server is trying to read an HLog is
>> stuck
>> > >> there
>> > >> > > for
>> > >> > > > a long time. Logs here:
>> > >> > > >
>> > >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
>> > >> Failed
>> > >> > to
>> > >> > > > connect to /10.156.194.251:50010 for file
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
>> > >> > > > for block
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
>> > >> > > > 15000 millis timeout while waiting for channel to be ready for
>> > read.
>> > >> > ch :
>> > >> > > > java.nio.channels.SocketChannel[connected local=/
>> > >> 10.156.192.173:52818
>> > >> > > > remote=/
>> > >> > > > 10.156.194.251:50010]
>> > >> > > >
>> > >> > > > I would think that HDFS 3703 would make the server fail fast
>> and
>> > go
>> > >> to
>> > >> > > the
>> > >> > > > third datanode. Currently, the recovery seems way too slow for
>> > >> > production
>> > >> > > > usage...
>> > >> > > >
>> > >> > > > Varun
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
Hi Ted,

I had a long offline discussion with nicholas on this. Looks like the last
block which was still being written too, took an enormous time to recover.
Here's what happened.
a) Master split tasks and region servers process them
b) Region server tries to recover lease for each WAL log - most cases are
noop since they are already rolled over/finalized
c) The last file lease recovery takes some time since the crashing server
was writing to it and had a lease on it - but basically we have the lease 1
minute after the server was lost
d) Now we start the recovery for this but we end up hitting the stale data
node which is puzzling.

It seems that we did not hit the stale datanode when we were trying to
recover the finalized WAL blocks with trivial lease recovery. However, for
the final block, we hit the stale datanode. Any clue why this might be
happening ?

Varun


On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com> wrote:

> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>
> Here is the criteria for stale node checking to kick in (from
>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> ):
>
> +   * Check if the datanode is in stale state. Here if
> +   * the namenode has not received heartbeat msg from a
> +   * datanode for more than staleInterval (default value is
> +   * {@link
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> +   * the datanode will be treated as stale node.
>
>
> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > Is there a place to upload these logs ?
> >
> >
> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com>
> > wrote:
> >
> > > Hi Nicholas,
> > >
> > > Attached are the namenode, dn logs (of one of the healthy replicas of
> the
> > > WAL block) and the rs logs which got stuch doing the log split. Action
> > > begins at 2013-04-19 00:27*.
> > >
> > > Also, the rogue block is 5723958680970112840_174056. Its very
> interesting
> > > to trace this guy through the HDFS logs (dn and nn).
> > >
> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> > does
> > > the stale node stuff kick in for that state ?
> > >
> > > Thanks
> > > Varun
> > >
> > >
> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkeywal@gmail.com
> > >wrote:
> > >
> > >> Thanks for the detailed scenario and analysis. I'm going to have a
> look.
> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> > >> timeouts), could you please send them directly to me?
> > >>
> > >> Thanks,
> > >>
> > >> Nicolas
> > >>
> > >>
> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
> > >> wrote:
> > >>
> > >> > Hi Nicholas,
> > >> >
> > >> > Here is the failure scenario, I have dug up the logs.
> > >> >
> > >> > A machine fails and stops accepting/transmitting traffic. The
> HMaster
> > >> > starts the distributed split for 13 tasks. There are 12 region
> > servers.
> > >> 12
> > >> > tasks succeed but the 13th one takes a looong time.
> > >> >
> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
> > >> seconds.
> > >> > Both patches are there.
> > >> >
> > >> > a) Machine fails around 27:30
> > >> > b) Master starts the split around 27:40 and submits the tasks. The
> one
> > >> task
> > >> > which fails seems to be the one which contains the WAL being
> currently
> > >> > written to:
> > >> >
> > >> > 2013-04-19 00:27:44,325 INFO
> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
> hlog:
> > >> > hdfs://
> > >> >
> > >> >
> > >>
> >
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > ,
> > >> > length=0
> > >> >
> > >> > Basically this region server picks up the task but finds the length
> of
> > >> this
> > >> > file to be 0 and drops. This happens again
> > >> >
> > >> > c) Finally another region server picks up the task but it ends up
> > going
> > >> to
> > >> > the bad datanode which should not happen because of the stale node
> > >> timeout)
> > >> > Unfortunately it hits the 45 retries and a connect timeout of 20
> > seconds
> > >> > every time. This delays recovery significantly. Now I guess
> reducing #
> > >> of
> > >> > retries to 1 is one possibility.
> > >> > But then the namenode logs are very interesting.
> > >> >
> > >> > d) Namenode seems to be in cyclic lease recovery loop until the node
> > is
> > >> > marked dead. There is this one last block which exhibits this.
> > >> >
> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> > >> > primaryNodeIndex=1,
> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
> started,
> > >> > primary=10.156.192.106:50010
> > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange:
> DIR*
> > >> > NameSystem.internalReleaseLease: File
> > >> >
> > >> >
> > >>
> >
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > has not been closed. Lease recovery is in progress. RecoveryId =
> > 174413
> > >> for
> > >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> > >> > primaryNodeIndex=1,
> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> > >> >
> > >> > I see this over and over again in the logs until the datanode is
> > marked
> > >> > dead. It seems to be cycling through the replicas for this WAL block
> > and
> > >> > trying to add it to the recovery list. I looked at the code and it
> > says:
> > >> >
> > >> >       // Cannot close file right now, since the last block requires
> > >> > recovery.
> > >> >       // This may potentially cause infinite loop in lease recovery
> > >> >       // if there are no valid replicas on data-nodes.
> > >> >       NameNode.stateChangeLog.warn(
> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
> > >> >                 "File " + src + " has not been closed." +
> > >> >                " Lease recovery is in progress. " +
> > >> >                 "RecoveryId = " + blockRecoveryId + " for block " +
> > >> > lastBlock);
> > >> >       break;
> > >> >
> > >> > Eventually for this block, we get
> > >> >
> > >> > 2013-04-19 00:41:20,736 INFO
> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > >> >
> > >> >
> > >>
> >
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> > >> 10.156.195.38:50010
> > >> > ],
> > >> > closeFile=true, deleteBlock=false)
> > >> > 2013-04-19 00:41:20,736 ERROR
> > >> > org.apache.hadoop.security.UserGroupInformation:
> > >> PriviledgedActionException
> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > not found
> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
> Server
> > >> > handler 35 on 8020, call
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > not found
> > >> >
> > >> > On the datanode side, i see a call for recover blocks - I see that a
> > >> write
> > >> > pipeline is there, which gets terminated with some socket
> timeouts...
> > >> >
> > >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> NameNode
> > >> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> > >> >
> > >> >
> > >>
> >
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> > 10.156.195.38:50010
> > >> ],
> > >> > newGenerationStamp=174413)
> > >> >
> > >> > Not sure but this looks like a case where data could be lost   ?
> > >> >
> > >> > Varun
> > >> >
> > >> >
> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> nkeywal@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hey Varun,
> > >> > >
> > >> > > Could you please share the logs and the configuration (hdfs /
> hbase
> > >> > > settings + cluster description). What's the failure scenario?
> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node status.
> > But
> > >> > these
> > >> > > node will be given the lowest priority when reading.
> > >> > >
> > >> > >
> > >> > > Cheers,
> > >> > >
> > >> > > Nicolas
> > >> > >
> > >> > >
> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> varun@pinterest.com>
> > >> > wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > We are facing problems with really slow HBase region server
> > >> recoveries
> > >> > ~
> > >> > > 20
> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> hadoop.profile=2.0.
> > >> > > >
> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched
> and
> > >> > stale
> > >> > > > node timeouts configured correctly. Time for dead node detection
> > is
> > >> > still
> > >> > > > 10 minutes.
> > >> > > >
> > >> > > > We see that our region server is trying to read an HLog is stuck
> > >> there
> > >> > > for
> > >> > > > a long time. Logs here:
> > >> > > >
> > >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
> > >> Failed
> > >> > to
> > >> > > > connect to /10.156.194.251:50010 for file
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > >> > > > for block
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > >> > > > 15000 millis timeout while waiting for channel to be ready for
> > read.
> > >> > ch :
> > >> > > > java.nio.channels.SocketChannel[connected local=/
> > >> 10.156.192.173:52818
> > >> > > > remote=/
> > >> > > > 10.156.194.251:50010]
> > >> > > >
> > >> > > > I would think that HDFS 3703 would make the server fail fast and
> > go
> > >> to
> > >> > > the
> > >> > > > third datanode. Currently, the recovery seems way too slow for
> > >> > production
> > >> > > > usage...
> > >> > > >
> > >> > > > Varun
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
here is the snippet
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recover RBW replica
BP-696828882-10.168.7.226-1364886167971:blk_40107897639761277_174072
2013-04-19 00:27:38,337 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
Recovering ReplicaBeingWritten, blk_40107897639761277_174072, RBW
2013-04-19 00:28:11,471 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at
ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
targets=[*10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010*],
newGenerationStamp=174413)
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174413,
replica=ReplicaBeingWritten, blk_-5723958680970112840_174056, RBW
2013-04-19 00:41:20,716 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from RBW to RUR
2013-04-19 00:41:20,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
recoveryId=174413, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174056, RUR
2013-04-19 00:41:20,745 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: recoverBlocks FAILED:
RecoveringBlock{BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056;
getBlockSize()=0; corrupt=false; offset=-1; locs=[10.156.194.94:50010,
10.156.192.106:50010, 10.156.195.38:50010]}
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_-5723958680970112840_174056, recoveryId=174418,
replica=FinalizedReplica, blk_-5723958680970112840_174413, FINALIZED
2013-04-19 00:41:23,733 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: changing replica state for
blk_-5723958680970112840_174056 from FINALIZED to RUR
2013-04-19 00:41:23,736 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
updateReplica:
BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174413,
recoveryId=174418, length=119148648, replica=ReplicaUnderRecovery,
blk_-5723958680970112840_174413, RUR

Block recovery takes a long time and eventually seems to fail - during
recoverBlock() call - all three datanodes (including the failed/stale one
is there)



On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com> wrote:

> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>
> Here is the criteria for stale node checking to kick in (from
>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> ):
>
> +   * Check if the datanode is in stale state. Here if
> +   * the namenode has not received heartbeat msg from a
> +   * datanode for more than staleInterval (default value is
> +   * {@link
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> +   * the datanode will be treated as stale node.
>
>
> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > Is there a place to upload these logs ?
> >
> >
> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com>
> > wrote:
> >
> > > Hi Nicholas,
> > >
> > > Attached are the namenode, dn logs (of one of the healthy replicas of
> the
> > > WAL block) and the rs logs which got stuch doing the log split. Action
> > > begins at 2013-04-19 00:27*.
> > >
> > > Also, the rogue block is 5723958680970112840_174056. Its very
> interesting
> > > to trace this guy through the HDFS logs (dn and nn).
> > >
> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> > does
> > > the stale node stuff kick in for that state ?
> > >
> > > Thanks
> > > Varun
> > >
> > >
> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkeywal@gmail.com
> > >wrote:
> > >
> > >> Thanks for the detailed scenario and analysis. I'm going to have a
> look.
> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> > >> timeouts), could you please send them directly to me?
> > >>
> > >> Thanks,
> > >>
> > >> Nicolas
> > >>
> > >>
> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
> > >> wrote:
> > >>
> > >> > Hi Nicholas,
> > >> >
> > >> > Here is the failure scenario, I have dug up the logs.
> > >> >
> > >> > A machine fails and stops accepting/transmitting traffic. The
> HMaster
> > >> > starts the distributed split for 13 tasks. There are 12 region
> > servers.
> > >> 12
> > >> > tasks succeed but the 13th one takes a looong time.
> > >> >
> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
> > >> seconds.
> > >> > Both patches are there.
> > >> >
> > >> > a) Machine fails around 27:30
> > >> > b) Master starts the split around 27:40 and submits the tasks. The
> one
> > >> task
> > >> > which fails seems to be the one which contains the WAL being
> currently
> > >> > written to:
> > >> >
> > >> > 2013-04-19 00:27:44,325 INFO
> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
> hlog:
> > >> > hdfs://
> > >> >
> > >> >
> > >>
> >
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > ,
> > >> > length=0
> > >> >
> > >> > Basically this region server picks up the task but finds the length
> of
> > >> this
> > >> > file to be 0 and drops. This happens again
> > >> >
> > >> > c) Finally another region server picks up the task but it ends up
> > going
> > >> to
> > >> > the bad datanode which should not happen because of the stale node
> > >> timeout)
> > >> > Unfortunately it hits the 45 retries and a connect timeout of 20
> > seconds
> > >> > every time. This delays recovery significantly. Now I guess
> reducing #
> > >> of
> > >> > retries to 1 is one possibility.
> > >> > But then the namenode logs are very interesting.
> > >> >
> > >> > d) Namenode seems to be in cyclic lease recovery loop until the node
> > is
> > >> > marked dead. There is this one last block which exhibits this.
> > >> >
> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> > >> > primaryNodeIndex=1,
> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
> started,
> > >> > primary=10.156.192.106:50010
> > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange:
> DIR*
> > >> > NameSystem.internalReleaseLease: File
> > >> >
> > >> >
> > >>
> >
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > has not been closed. Lease recovery is in progress. RecoveryId =
> > 174413
> > >> for
> > >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> > >> > primaryNodeIndex=1,
> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> > >> >
> > >> > I see this over and over again in the logs until the datanode is
> > marked
> > >> > dead. It seems to be cycling through the replicas for this WAL block
> > and
> > >> > trying to add it to the recovery list. I looked at the code and it
> > says:
> > >> >
> > >> >       // Cannot close file right now, since the last block requires
> > >> > recovery.
> > >> >       // This may potentially cause infinite loop in lease recovery
> > >> >       // if there are no valid replicas on data-nodes.
> > >> >       NameNode.stateChangeLog.warn(
> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
> > >> >                 "File " + src + " has not been closed." +
> > >> >                " Lease recovery is in progress. " +
> > >> >                 "RecoveryId = " + blockRecoveryId + " for block " +
> > >> > lastBlock);
> > >> >       break;
> > >> >
> > >> > Eventually for this block, we get
> > >> >
> > >> > 2013-04-19 00:41:20,736 INFO
> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > >> >
> > >> >
> > >>
> >
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> > >> 10.156.195.38:50010
> > >> > ],
> > >> > closeFile=true, deleteBlock=false)
> > >> > 2013-04-19 00:41:20,736 ERROR
> > >> > org.apache.hadoop.security.UserGroupInformation:
> > >> PriviledgedActionException
> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > not found
> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
> Server
> > >> > handler 35 on 8020, call
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > not found
> > >> >
> > >> > On the datanode side, i see a call for recover blocks - I see that a
> > >> write
> > >> > pipeline is there, which gets terminated with some socket
> timeouts...
> > >> >
> > >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> NameNode
> > >> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> > >> >
> > >> >
> > >>
> >
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> > 10.156.195.38:50010
> > >> ],
> > >> > newGenerationStamp=174413)
> > >> >
> > >> > Not sure but this looks like a case where data could be lost   ?
> > >> >
> > >> > Varun
> > >> >
> > >> >
> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> nkeywal@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hey Varun,
> > >> > >
> > >> > > Could you please share the logs and the configuration (hdfs /
> hbase
> > >> > > settings + cluster description). What's the failure scenario?
> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node status.
> > But
> > >> > these
> > >> > > node will be given the lowest priority when reading.
> > >> > >
> > >> > >
> > >> > > Cheers,
> > >> > >
> > >> > > Nicolas
> > >> > >
> > >> > >
> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> varun@pinterest.com>
> > >> > wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > We are facing problems with really slow HBase region server
> > >> recoveries
> > >> > ~
> > >> > > 20
> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> hadoop.profile=2.0.
> > >> > > >
> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched
> and
> > >> > stale
> > >> > > > node timeouts configured correctly. Time for dead node detection
> > is
> > >> > still
> > >> > > > 10 minutes.
> > >> > > >
> > >> > > > We see that our region server is trying to read an HLog is stuck
> > >> there
> > >> > > for
> > >> > > > a long time. Logs here:
> > >> > > >
> > >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
> > >> Failed
> > >> > to
> > >> > > > connect to /10.156.194.251:50010 for file
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > >> > > > for block
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > >> > > > 15000 millis timeout while waiting for channel to be ready for
> > read.
> > >> > ch :
> > >> > > > java.nio.channels.SocketChannel[connected local=/
> > >> 10.156.192.173:52818
> > >> > > > remote=/
> > >> > > > 10.156.194.251:50010]
> > >> > > >
> > >> > > > I would think that HDFS 3703 would make the server fail fast and
> > go
> > >> to
> > >> > > the
> > >> > > > third datanode. Currently, the recovery seems way too slow for
> > >> > production
> > >> > > > usage...
> > >> > > >
> > >> > > > Varun
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Slow region server recoveries

Posted by Ted Yu <yu...@gmail.com>.
Can you show snippet from DN log which mentioned UNDER_RECOVERY ?

Here is the criteria for stale node checking to kick in (from
https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
):

+   * Check if the datanode is in stale state. Here if
+   * the namenode has not received heartbeat msg from a
+   * datanode for more than staleInterval (default value is
+   * {@link DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
+   * the datanode will be treated as stale node.


On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com> wrote:

> Is there a place to upload these logs ?
>
>
> On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > Hi Nicholas,
> >
> > Attached are the namenode, dn logs (of one of the healthy replicas of the
> > WAL block) and the rs logs which got stuch doing the log split. Action
> > begins at 2013-04-19 00:27*.
> >
> > Also, the rogue block is 5723958680970112840_174056. Its very interesting
> > to trace this guy through the HDFS logs (dn and nn).
> >
> > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> does
> > the stale node stuff kick in for that state ?
> >
> > Thanks
> > Varun
> >
> >
> > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkeywal@gmail.com
> >wrote:
> >
> >> Thanks for the detailed scenario and analysis. I'm going to have a look.
> >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> >> timeouts), could you please send them directly to me?
> >>
> >> Thanks,
> >>
> >> Nicolas
> >>
> >>
> >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> > Hi Nicholas,
> >> >
> >> > Here is the failure scenario, I have dug up the logs.
> >> >
> >> > A machine fails and stops accepting/transmitting traffic. The HMaster
> >> > starts the distributed split for 13 tasks. There are 12 region
> servers.
> >> 12
> >> > tasks succeed but the 13th one takes a looong time.
> >> >
> >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
> >> seconds.
> >> > Both patches are there.
> >> >
> >> > a) Machine fails around 27:30
> >> > b) Master starts the split around 27:40 and submits the tasks. The one
> >> task
> >> > which fails seems to be the one which contains the WAL being currently
> >> > written to:
> >> >
> >> > 2013-04-19 00:27:44,325 INFO
> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
> >> > hdfs://
> >> >
> >> >
> >>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >> > ,
> >> > length=0
> >> >
> >> > Basically this region server picks up the task but finds the length of
> >> this
> >> > file to be 0 and drops. This happens again
> >> >
> >> > c) Finally another region server picks up the task but it ends up
> going
> >> to
> >> > the bad datanode which should not happen because of the stale node
> >> timeout)
> >> > Unfortunately it hits the 45 retries and a connect timeout of 20
> seconds
> >> > every time. This delays recovery significantly. Now I guess reducing #
> >> of
> >> > retries to 1 is one possibility.
> >> > But then the namenode logs are very interesting.
> >> >
> >> > d) Namenode seems to be in cyclic lease recovery loop until the node
> is
> >> > marked dead. There is this one last block which exhibits this.
> >> >
> >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> >> > primaryNodeIndex=1,
> >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
> >> > primary=10.156.192.106:50010
> >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> >> > NameSystem.internalReleaseLease: File
> >> >
> >> >
> >>
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> >> > has not been closed. Lease recovery is in progress. RecoveryId =
> 174413
> >> for
> >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> >> > primaryNodeIndex=1,
> >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> >> >
> >> > I see this over and over again in the logs until the datanode is
> marked
> >> > dead. It seems to be cycling through the replicas for this WAL block
> and
> >> > trying to add it to the recovery list. I looked at the code and it
> says:
> >> >
> >> >       // Cannot close file right now, since the last block requires
> >> > recovery.
> >> >       // This may potentially cause infinite loop in lease recovery
> >> >       // if there are no valid replicas on data-nodes.
> >> >       NameNode.stateChangeLog.warn(
> >> >                 "DIR* NameSystem.internalReleaseLease: " +
> >> >                 "File " + src + " has not been closed." +
> >> >                " Lease recovery is in progress. " +
> >> >                 "RecoveryId = " + blockRecoveryId + " for block " +
> >> > lastBlock);
> >> >       break;
> >> >
> >> > Eventually for this block, we get
> >> >
> >> > 2013-04-19 00:41:20,736 INFO
> >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> >> >
> >> >
> >>
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> >> 10.156.195.38:50010
> >> > ],
> >> > closeFile=true, deleteBlock=false)
> >> > 2013-04-19 00:41:20,736 ERROR
> >> > org.apache.hadoop.security.UserGroupInformation:
> >> PriviledgedActionException
> >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> >> >
> >>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >> > not found
> >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC Server
> >> > handler 35 on 8020, call
> >> >
> >> >
> >>
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> >> > from 10.156.192.106:53271: error: java.io.IOException: Block
> >> >
> >>
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> >> > not found
> >> >
> >> > On the datanode side, i see a call for recover blocks - I see that a
> >> write
> >> > pipeline is there, which gets terminated with some socket timeouts...
> >> >
> >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> NameNode
> >> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
> >> >
> >> >
> >>
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> 10.156.195.38:50010
> >> ],
> >> > newGenerationStamp=174413)
> >> >
> >> > Not sure but this looks like a case where data could be lost   ?
> >> >
> >> > Varun
> >> >
> >> >
> >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <nk...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hey Varun,
> >> > >
> >> > > Could you please share the logs and the configuration (hdfs / hbase
> >> > > settings + cluster description). What's the failure scenario?
> >> > > From an HDFS pov, HDFS 3703 does not change the dead node status.
> But
> >> > these
> >> > > node will be given the lowest priority when reading.
> >> > >
> >> > >
> >> > > Cheers,
> >> > >
> >> > > Nicolas
> >> > >
> >> > >
> >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <va...@pinterest.com>
> >> > wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > We are facing problems with really slow HBase region server
> >> recoveries
> >> > ~
> >> > > 20
> >> > > > minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
> >> > > >
> >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and
> >> > stale
> >> > > > node timeouts configured correctly. Time for dead node detection
> is
> >> > still
> >> > > > 10 minutes.
> >> > > >
> >> > > > We see that our region server is trying to read an HLog is stuck
> >> there
> >> > > for
> >> > > > a long time. Logs here:
> >> > > >
> >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
> >> Failed
> >> > to
> >> > > > connect to /10.156.194.251:50010 for file
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> >> > > > for block
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> >> > > > 15000 millis timeout while waiting for channel to be ready for
> read.
> >> > ch :
> >> > > > java.nio.channels.SocketChannel[connected local=/
> >> 10.156.192.173:52818
> >> > > > remote=/
> >> > > > 10.156.194.251:50010]
> >> > > >
> >> > > > I would think that HDFS 3703 would make the server fail fast and
> go
> >> to
> >> > > the
> >> > > > third datanode. Currently, the recovery seems way too slow for
> >> > production
> >> > > > usage...
> >> > > >
> >> > > > Varun
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
Is there a place to upload these logs ?


On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Nicholas,
>
> Attached are the namenode, dn logs (of one of the healthy replicas of the
> WAL block) and the rs logs which got stuch doing the log split. Action
> begins at 2013-04-19 00:27*.
>
> Also, the rogue block is 5723958680970112840_174056. Its very interesting
> to trace this guy through the HDFS logs (dn and nn).
>
> Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also does
> the stale node stuff kick in for that state ?
>
> Thanks
> Varun
>
>
> On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nk...@gmail.com>wrote:
>
>> Thanks for the detailed scenario and analysis. I'm going to have a look.
>> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
>> timeouts), could you please send them directly to me?
>>
>> Thanks,
>>
>> Nicolas
>>
>>
>> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi Nicholas,
>> >
>> > Here is the failure scenario, I have dug up the logs.
>> >
>> > A machine fails and stops accepting/transmitting traffic. The HMaster
>> > starts the distributed split for 13 tasks. There are 12 region servers.
>> 12
>> > tasks succeed but the 13th one takes a looong time.
>> >
>> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
>> seconds.
>> > Both patches are there.
>> >
>> > a) Machine fails around 27:30
>> > b) Master starts the split around 27:40 and submits the tasks. The one
>> task
>> > which fails seems to be the one which contains the WAL being currently
>> > written to:
>> >
>> > 2013-04-19 00:27:44,325 INFO
>> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
>> > hdfs://
>> >
>> >
>> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > ,
>> > length=0
>> >
>> > Basically this region server picks up the task but finds the length of
>> this
>> > file to be 0 and drops. This happens again
>> >
>> > c) Finally another region server picks up the task but it ends up going
>> to
>> > the bad datanode which should not happen because of the stale node
>> timeout)
>> > Unfortunately it hits the 45 retries and a connect timeout of 20 seconds
>> > every time. This delays recovery significantly. Now I guess reducing #
>> of
>> > retries to 1 is one possibility.
>> > But then the namenode logs are very interesting.
>> >
>> > d) Namenode seems to be in cyclic lease recovery loop until the node is
>> > marked dead. There is this one last block which exhibits this.
>> >
>> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
>> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
>> > primaryNodeIndex=1,
>> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
>> > primary=10.156.192.106:50010
>> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
>> > NameSystem.internalReleaseLease: File
>> >
>> >
>> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > has not been closed. Lease recovery is in progress. RecoveryId = 174413
>> for
>> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
>> > primaryNodeIndex=1,
>> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>> >
>> > I see this over and over again in the logs until the datanode is marked
>> > dead. It seems to be cycling through the replicas for this WAL block and
>> > trying to add it to the recovery list. I looked at the code and it says:
>> >
>> >       // Cannot close file right now, since the last block requires
>> > recovery.
>> >       // This may potentially cause infinite loop in lease recovery
>> >       // if there are no valid replicas on data-nodes.
>> >       NameNode.stateChangeLog.warn(
>> >                 "DIR* NameSystem.internalReleaseLease: " +
>> >                 "File " + src + " has not been closed." +
>> >                " Lease recovery is in progress. " +
>> >                 "RecoveryId = " + blockRecoveryId + " for block " +
>> > lastBlock);
>> >       break;
>> >
>> > Eventually for this block, we get
>> >
>> > 2013-04-19 00:41:20,736 INFO
>> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> >
>> >
>> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
>> > *5723958680970112840_174056*, newgenerationstamp=174413,
>> > newlength=119148648, newtargets=[10.156.192.106:50010,
>> 10.156.195.38:50010
>> > ],
>> > closeFile=true, deleteBlock=false)
>> > 2013-04-19 00:41:20,736 ERROR
>> > org.apache.hadoop.security.UserGroupInformation:
>> PriviledgedActionException
>> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > not found
>> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC Server
>> > handler 35 on 8020, call
>> >
>> >
>> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
>> > from 10.156.192.106:53271: error: java.io.IOException: Block
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > not found
>> >
>> > On the datanode side, i see a call for recover blocks - I see that a
>> write
>> > pipeline is there, which gets terminated with some socket timeouts...
>> >
>> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
>> NameNode
>> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
>> >
>> >
>> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
>> > targets=[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010
>> ],
>> > newGenerationStamp=174413)
>> >
>> > Not sure but this looks like a case where data could be lost   ?
>> >
>> > Varun
>> >
>> >
>> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <nk...@gmail.com>
>> > wrote:
>> >
>> > > Hey Varun,
>> > >
>> > > Could you please share the logs and the configuration (hdfs / hbase
>> > > settings + cluster description). What's the failure scenario?
>> > > From an HDFS pov, HDFS 3703 does not change the dead node status. But
>> > these
>> > > node will be given the lowest priority when reading.
>> > >
>> > >
>> > > Cheers,
>> > >
>> > > Nicolas
>> > >
>> > >
>> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <va...@pinterest.com>
>> > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > We are facing problems with really slow HBase region server
>> recoveries
>> > ~
>> > > 20
>> > > > minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
>> > > >
>> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and
>> > stale
>> > > > node timeouts configured correctly. Time for dead node detection is
>> > still
>> > > > 10 minutes.
>> > > >
>> > > > We see that our region server is trying to read an HLog is stuck
>> there
>> > > for
>> > > > a long time. Logs here:
>> > > >
>> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
>> Failed
>> > to
>> > > > connect to /10.156.194.251:50010 for file
>> > > >
>> > > >
>> > >
>> >
>> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
>> > > > for block
>> > > >
>> > > >
>> > >
>> >
>> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
>> > > > 15000 millis timeout while waiting for channel to be ready for read.
>> > ch :
>> > > > java.nio.channels.SocketChannel[connected local=/
>> 10.156.192.173:52818
>> > > > remote=/
>> > > > 10.156.194.251:50010]
>> > > >
>> > > > I would think that HDFS 3703 would make the server fail fast and go
>> to
>> > > the
>> > > > third datanode. Currently, the recovery seems way too slow for
>> > production
>> > > > usage...
>> > > >
>> > > > Varun
>> > > >
>> > >
>> >
>>
>
>

Re: Slow region server recoveries

Posted by Nicolas Liochon <nk...@gmail.com>.
Thanks for the detailed scenario and analysis. I'm going to have a look.
I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
timeouts), could you please send them directly to me?

Thanks,

Nicolas


On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Nicholas,
>
> Here is the failure scenario, I have dug up the logs.
>
> A machine fails and stops accepting/transmitting traffic. The HMaster
> starts the distributed split for 13 tasks. There are 12 region servers. 12
> tasks succeed but the 13th one takes a looong time.
>
> Zookeeper timeout is set to 30 seconds. Stale node timeout is 20 seconds.
> Both patches are there.
>
> a) Machine fails around 27:30
> b) Master starts the split around 27:40 and submits the tasks. The one task
> which fails seems to be the one which contains the WAL being currently
> written to:
>
> 2013-04-19 00:27:44,325 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
> hdfs://
>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> ,
> length=0
>
> Basically this region server picks up the task but finds the length of this
> file to be 0 and drops. This happens again
>
> c) Finally another region server picks up the task but it ends up going to
> the bad datanode which should not happen because of the stale node timeout)
> Unfortunately it hits the 45 retries and a connect timeout of 20 seconds
> every time. This delays recovery significantly. Now I guess reducing # of
> retries to 1 is one possibility.
> But then the namenode logs are very interesting.
>
> d) Namenode seems to be in cyclic lease recovery loop until the node is
> marked dead. There is this one last block which exhibits this.
>
> 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> primaryNodeIndex=1,
> replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
> primary=10.156.192.106:50010
> 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: File
>
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> has not been closed. Lease recovery is in progress. RecoveryId = 174413 for
> block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> primaryNodeIndex=1,
> replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>
> I see this over and over again in the logs until the datanode is marked
> dead. It seems to be cycling through the replicas for this WAL block and
> trying to add it to the recovery list. I looked at the code and it says:
>
>       // Cannot close file right now, since the last block requires
> recovery.
>       // This may potentially cause infinite loop in lease recovery
>       // if there are no valid replicas on data-nodes.
>       NameNode.stateChangeLog.warn(
>                 "DIR* NameSystem.internalReleaseLease: " +
>                 "File " + src + " has not been closed." +
>                " Lease recovery is in progress. " +
>                 "RecoveryId = " + blockRecoveryId + " for block " +
> lastBlock);
>       break;
>
> Eventually for this block, we get
>
> 2013-04-19 00:41:20,736 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> *5723958680970112840_174056*, newgenerationstamp=174413,
> newlength=119148648, newtargets=[10.156.192.106:50010, 10.156.195.38:50010
> ],
> closeFile=true, deleteBlock=false)
> 2013-04-19 00:41:20,736 ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> not found
> 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 35 on 8020, call
>
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> from 10.156.192.106:53271: error: java.io.IOException: Block
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> not found
>
> On the datanode side, i see a call for recover blocks - I see that a write
> pipeline is there, which gets terminated with some socket timeouts...
>
> 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode
> at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
>
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> targets=[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010],
> newGenerationStamp=174413)
>
> Not sure but this looks like a case where data could be lost   ?
>
> Varun
>
>
> On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <nk...@gmail.com>
> wrote:
>
> > Hey Varun,
> >
> > Could you please share the logs and the configuration (hdfs / hbase
> > settings + cluster description). What's the failure scenario?
> > From an HDFS pov, HDFS 3703 does not change the dead node status. But
> these
> > node will be given the lowest priority when reading.
> >
> >
> > Cheers,
> >
> > Nicolas
> >
> >
> > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > Hi,
> > >
> > > We are facing problems with really slow HBase region server recoveries
> ~
> > 20
> > > minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
> > >
> > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and
> stale
> > > node timeouts configured correctly. Time for dead node detection is
> still
> > > 10 minutes.
> > >
> > > We see that our region server is trying to read an HLog is stuck there
> > for
> > > a long time. Logs here:
> > >
> > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> > > connect to /10.156.194.251:50010 for file
> > >
> > >
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > > for block
> > >
> > >
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > > 15000 millis timeout while waiting for channel to be ready for read.
> ch :
> > > java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818
> > > remote=/
> > > 10.156.194.251:50010]
> > >
> > > I would think that HDFS 3703 would make the server fail fast and go to
> > the
> > > third datanode. Currently, the recovery seems way too slow for
> production
> > > usage...
> > >
> > > Varun
> > >
> >
>

Re: Slow region server recoveries

Posted by Varun Sharma <va...@pinterest.com>.
Hi Nicholas,

Here is the failure scenario, I have dug up the logs.

A machine fails and stops accepting/transmitting traffic. The HMaster
starts the distributed split for 13 tasks. There are 12 region servers. 12
tasks succeed but the 13th one takes a looong time.

Zookeeper timeout is set to 30 seconds. Stale node timeout is 20 seconds.
Both patches are there.

a) Machine fails around 27:30
b) Master starts the split around 27:40 and submits the tasks. The one task
which fails seems to be the one which contains the WAL being currently
written to:

2013-04-19 00:27:44,325 INFO
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
hdfs://
ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141,
length=0

Basically this region server picks up the task but finds the length of this
file to be 0 and drops. This happens again

c) Finally another region server picks up the task but it ends up going to
the bad datanode which should not happen because of the stale node timeout)
Unfortunately it hits the 45 retries and a connect timeout of 20 seconds
every time. This delays recovery significantly. Now I guess reducing # of
retries to 1 is one possibility.
But then the namenode logs are very interesting.

d) Namenode seems to be in cyclic lease recovery loop until the node is
marked dead. There is this one last block which exhibits this.

2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
primaryNodeIndex=1,
replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
ReplicaUnderConstruction[10.156.192.106:50010|RBW],
ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
primary=10.156.192.106:50010
2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: File
/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
has not been closed. Lease recovery is in progress. RecoveryId = 174413 for
block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
primaryNodeIndex=1,
replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
ReplicaUnderConstruction[10.156.192.106:50010|RBW],
ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}

I see this over and over again in the logs until the datanode is marked
dead. It seems to be cycling through the replicas for this WAL block and
trying to add it to the recovery list. I looked at the code and it says:

      // Cannot close file right now, since the last block requires
recovery.
      // This may potentially cause infinite loop in lease recovery
      // if there are no valid replicas on data-nodes.
      NameNode.stateChangeLog.warn(
                "DIR* NameSystem.internalReleaseLease: " +
                "File " + src + " has not been closed." +
               " Lease recovery is in progress. " +
                "RecoveryId = " + blockRecoveryId + " for block " +
lastBlock);
      break;

Eventually for this block, we get

2013-04-19 00:41:20,736 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
*5723958680970112840_174056*, newgenerationstamp=174413,
newlength=119148648, newtargets=[10.156.192.106:50010, 10.156.195.38:50010],
closeFile=true, deleteBlock=false)
2013-04-19 00:41:20,736 ERROR
org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
(=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
not found
2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 35 on 8020, call
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
from 10.156.192.106:53271: error: java.io.IOException: Block
(=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
not found

On the datanode side, i see a call for recover blocks - I see that a write
pipeline is there, which gets terminated with some socket timeouts...

00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode
at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
targets=[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010],
newGenerationStamp=174413)

Not sure but this looks like a case where data could be lost   ?

Varun


On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <nk...@gmail.com> wrote:

> Hey Varun,
>
> Could you please share the logs and the configuration (hdfs / hbase
> settings + cluster description). What's the failure scenario?
> From an HDFS pov, HDFS 3703 does not change the dead node status. But these
> node will be given the lowest priority when reading.
>
>
> Cheers,
>
> Nicolas
>
>
> On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi,
> >
> > We are facing problems with really slow HBase region server recoveries ~
> 20
> > minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
> >
> > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale
> > node timeouts configured correctly. Time for dead node detection is still
> > 10 minutes.
> >
> > We see that our region server is trying to read an HLog is stuck there
> for
> > a long time. Logs here:
> >
> > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> > connect to /10.156.194.251:50010 for file
> >
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > for block
> >
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > 15000 millis timeout while waiting for channel to be ready for read. ch :
> > java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818
> > remote=/
> > 10.156.194.251:50010]
> >
> > I would think that HDFS 3703 would make the server fail fast and go to
> the
> > third datanode. Currently, the recovery seems way too slow for production
> > usage...
> >
> > Varun
> >
>

Re: Slow region server recoveries

Posted by Nicolas Liochon <nk...@gmail.com>.
Hey Varun,

Could you please share the logs and the configuration (hdfs / hbase
settings + cluster description). What's the failure scenario?
>From an HDFS pov, HDFS 3703 does not change the dead node status. But these
node will be given the lowest priority when reading.


Cheers,

Nicolas


On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> We are facing problems with really slow HBase region server recoveries ~ 20
> minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
>
> Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and stale
> node timeouts configured correctly. Time for dead node detection is still
> 10 minutes.
>
> We see that our region server is trying to read an HLog is stuck there for
> a long time. Logs here:
>
> 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.156.194.251:50010 for file
>
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> for block
>
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> 15000 millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818
> remote=/
> 10.156.194.251:50010]
>
> I would think that HDFS 3703 would make the server fail fast and go to the
> third datanode. Currently, the recovery seems way too slow for production
> usage...
>
> Varun
>