You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2013/04/19 23:18:58 UTC

Slow region server recoveries due to lease recovery going to stale data node

I think the issue would be more appropriate for hdfs-dev@ mailing list.

Putting use@hbase as Bcc.

---------- Forwarded message ----------
From: Varun Sharma <va...@pinterest.com>
Date: Fri, Apr 19, 2013 at 1:10 PM
Subject: Re: Slow region server recoveries
To: user@hbase.apache.org


This is 0.94.3 hbase...


On Fri, Apr 19, 2013 at 1:09 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Ted,
>
> I had a long offline discussion with nicholas on this. Looks like the last
> block which was still being written too, took an enormous time to recover.
> Here's what happened.
> a) Master split tasks and region servers process them
> b) Region server tries to recover lease for each WAL log - most cases are
> noop since they are already rolled over/finalized
> c) The last file lease recovery takes some time since the crashing server
> was writing to it and had a lease on it - but basically we have the lease
1
> minute after the server was lost
> d) Now we start the recovery for this but we end up hitting the stale data
> node which is puzzling.
>
> It seems that we did not hit the stale datanode when we were trying to
> recover the finalized WAL blocks with trivial lease recovery. However, for
> the final block, we hit the stale datanode. Any clue why this might be
> happening ?
>
> Varun
>
>
> On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>>
>> Here is the criteria for stale node checking to kick in (from
>>
>>
https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
>> ):
>>
>> +   * Check if the datanode is in stale state. Here if
>> +   * the namenode has not received heartbeat msg from a
>> +   * datanode for more than staleInterval (default value is
>> +   * {@link
>> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
>> +   * the datanode will be treated as stale node.
>>
>>
>> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Is there a place to upload these logs ?
>> >
>> >
>> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com>
>> > wrote:
>> >
>> > > Hi Nicholas,
>> > >
>> > > Attached are the namenode, dn logs (of one of the healthy replicas of
>> the
>> > > WAL block) and the rs logs which got stuch doing the log split.
Action
>> > > begins at 2013-04-19 00:27*.
>> > >
>> > > Also, the rogue block is 5723958680970112840_174056. Its very
>> interesting
>> > > to trace this guy through the HDFS logs (dn and nn).
>> > >
>> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
>> > does
>> > > the stale node stuff kick in for that state ?
>> > >
>> > > Thanks
>> > > Varun
>> > >
>> > >
>> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkeywal@gmail.com
>> > >wrote:
>> > >
>> > >> Thanks for the detailed scenario and analysis. I'm going to have a
>> look.
>> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
>> > >> timeouts), could you please send them directly to me?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Nicolas
>> > >>
>> > >>
>> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
>> > >> wrote:
>> > >>
>> > >> > Hi Nicholas,
>> > >> >
>> > >> > Here is the failure scenario, I have dug up the logs.
>> > >> >
>> > >> > A machine fails and stops accepting/transmitting traffic. The
>> HMaster
>> > >> > starts the distributed split for 13 tasks. There are 12 region
>> > servers.
>> > >> 12
>> > >> > tasks succeed but the 13th one takes a looong time.
>> > >> >
>> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
>> > >> seconds.
>> > >> > Both patches are there.
>> > >> >
>> > >> > a) Machine fails around 27:30
>> > >> > b) Master starts the split around 27:40 and submits the tasks. The
>> one
>> > >> task
>> > >> > which fails seems to be the one which contains the WAL being
>> currently
>> > >> > written to:
>> > >> >
>> > >> > 2013-04-19 00:27:44,325 INFO
>> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
>> hlog:
>> > >> > hdfs://
>> > >> >
>> > >> >
>> > >>
>> >
>>
ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > >> > ,
>> > >> > length=0
>> > >> >
>> > >> > Basically this region server picks up the task but finds the
>> length of
>> > >> this
>> > >> > file to be 0 and drops. This happens again
>> > >> >
>> > >> > c) Finally another region server picks up the task but it ends up
>> > going
>> > >> to
>> > >> > the bad datanode which should not happen because of the stale node
>> > >> timeout)
>> > >> > Unfortunately it hits the 45 retries and a connect timeout of 20
>> > seconds
>> > >> > every time. This delays recovery significantly. Now I guess
>> reducing #
>> > >> of
>> > >> > retries to 1 is one possibility.
>> > >> > But then the namenode logs are very interesting.
>> > >> >
>> > >> > d) Namenode seems to be in cyclic lease recovery loop until the
>> node
>> > is
>> > >> > marked dead. There is this one last block which exhibits this.
>> > >> >
>> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
>> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
>> > >> > primaryNodeIndex=1,
>> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
>> started,
>> > >> > primary=10.156.192.106:50010
>> > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange:
>> DIR*
>> > >> > NameSystem.internalReleaseLease: File
>> > >> >
>> > >> >
>> > >>
>> >
>>
/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > >> > has not been closed. Lease recovery is in progress. RecoveryId =
>> > 174413
>> > >> for
>> > >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
>> > >> > primaryNodeIndex=1,
>> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>> > >> >
>> > >> > I see this over and over again in the logs until the datanode is
>> > marked
>> > >> > dead. It seems to be cycling through the replicas for this WAL
>> block
>> > and
>> > >> > trying to add it to the recovery list. I looked at the code and it
>> > says:
>> > >> >
>> > >> >       // Cannot close file right now, since the last block
requires
>> > >> > recovery.
>> > >> >       // This may potentially cause infinite loop in lease
recovery
>> > >> >       // if there are no valid replicas on data-nodes.
>> > >> >       NameNode.stateChangeLog.warn(
>> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
>> > >> >                 "File " + src + " has not been closed." +
>> > >> >                " Lease recovery is in progress. " +
>> > >> >                 "RecoveryId = " + blockRecoveryId + " for block "
+
>> > >> > lastBlock);
>> > >> >       break;
>> > >> >
>> > >> > Eventually for this block, we get
>> > >> >
>> > >> > 2013-04-19 00:41:20,736 INFO
>> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> > >> >
>> > >> >
>> > >>
>> >
>>
commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
>> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
>> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
>> > >> 10.156.195.38:50010
>> > >> > ],
>> > >> > closeFile=true, deleteBlock=false)
>> > >> > 2013-04-19 00:41:20,736 ERROR
>> > >> > org.apache.hadoop.security.UserGroupInformation:
>> > >> PriviledgedActionException
>> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>> > >> >
>> > >>
>> >
>>
(=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > >> > not found
>> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
>> Server
>> > >> > handler 35 on 8020, call
>> > >> >
>> > >> >
>> > >>
>> >
>>
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
>> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
>> > >> >
>> > >>
>> >
>>
(=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > >> > not found
>> > >> >
>> > >> > On the datanode side, i see a call for recover blocks - I see that
>> a
>> > >> write
>> > >> > pipeline is there, which gets terminated with some socket
>> timeouts...
>> > >> >
>> > >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > >> NameNode
>> > >> > at
ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
>> > >> >
>> > >> >
>> > >>
>> >
>>
recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
>> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
>> > 10.156.195.38:50010
>> > >> ],
>> > >> > newGenerationStamp=174413)
>> > >> >
>> > >> > Not sure but this looks like a case where data could be lost   ?
>> > >> >
>> > >> > Varun
>> > >> >
>> > >> >
>> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
>> nkeywal@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Hey Varun,
>> > >> > >
>> > >> > > Could you please share the logs and the configuration (hdfs /
>> hbase
>> > >> > > settings + cluster description). What's the failure scenario?
>> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node
status.
>> > But
>> > >> > these
>> > >> > > node will be given the lowest priority when reading.
>> > >> > >
>> > >> > >
>> > >> > > Cheers,
>> > >> > >
>> > >> > > Nicolas
>> > >> > >
>> > >> > >
>> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
>> varun@pinterest.com>
>> > >> > wrote:
>> > >> > >
>> > >> > > > Hi,
>> > >> > > >
>> > >> > > > We are facing problems with really slow HBase region server
>> > >> recoveries
>> > >> > ~
>> > >> > > 20
>> > >> > > > minuted. Version is hbase 0.94.3 compiled with
>> hadoop.profile=2.0.
>> > >> > > >
>> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched
>> and
>> > >> > stale
>> > >> > > > node timeouts configured correctly. Time for dead node
>> detection
>> > is
>> > >> > still
>> > >> > > > 10 minutes.
>> > >> > > >
>> > >> > > > We see that our region server is trying to read an HLog is
>> stuck
>> > >> there
>> > >> > > for
>> > >> > > > a long time. Logs here:
>> > >> > > >
>> > >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
>> > >> Failed
>> > >> > to
>> > >> > > > connect to /10.156.194.251:50010 for file
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>>
/hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
>> > >> > > > for block
>> > >> > > >
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> >
>>
BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
>> > >> > > > 15000 millis timeout while waiting for channel to be ready for
>> > read.
>> > >> > ch :
>> > >> > > > java.nio.channels.SocketChannel[connected local=/
>> > >> 10.156.192.173:52818
>> > >> > > > remote=/
>> > >> > > > 10.156.194.251:50010]
>> > >> > > >
>> > >> > > > I would think that HDFS 3703 would make the server fail fast
>> and
>> > go
>> > >> to
>> > >> > > the
>> > >> > > > third datanode. Currently, the recovery seems way too slow for
>> > >> > production
>> > >> > > > usage...
>> > >> > > >
>> > >> > > > Varun
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>