You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Suraj Varma <sv...@gmail.com> on 2012/07/03 01:56:50 UTC

Re: HMaster not failing over dead RegionServers

This looks like it is trying to reach a datanode ... doesn't it?
> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /10.125.18.129:50020. Already tried 14 time(s).

Is this from a master log or from a region server log? (I'm guess the
above is from a region server log while trying to replay hlogs)

Sometime back, we had a similar symptom (HLog splitting takes the long
time due to the retries) and found that even though the datanode died,
it was not being detected by the namenode. This leads to the region
server retrying over dead datanodes over and over stretching out the
splitting process.

See this thread:
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10033.html

We found that by default, it takes 15 mins for a datanode death to be
detected by a NN ... and this seems to cause the NN serving back the
dead DN as a valid one when RS tries to read the hlogs.
The parameters in question are: dfs.heartbeat.recheck.interval and
heartbeat.recheck.interval ... tweaking this down caused the recovery
to be much faster.
Also - hbase.rpc.timeout and zookeeper.session.timeout are two other
configurations that need to be tweaked down from defaults for quick
recovery.

Not sure if this is the case in your error - but, might be something
to investigate ...
--Suraj


On Sat, Jun 30, 2012 at 8:53 AM, Jimmy Xiang <jx...@cloudera.com> wrote:
> Bryan,
>
> The master could not detect if the region server is dead.
> How do you set the zookeeper session timeout?
>
> Thanks,
> Jimmy
>
> On Sat, Jun 30, 2012 at 8:09 AM, Stack <st...@duboce.net> wrote:
>> On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault
>> <bb...@hubspot.com> wrote:
>>> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
>>> 10.125.18.129:50020. Already tried 14 time(s).
>>>
>>
>> This was one of the servers that went down?
>>
>>> It was not following through the splitting of HLog files and didn't appear
>>> to be moving regions off failed hosts.  After giving it about 20 minutes to
>>> try to right itself, I tried restarting the service.  The restart script
>>> just hung for a while printing dots and nothing apparent was happening on
>>> the logs at the time.
>>
>> Can we see the log  Bryan?
>>
>> You might thread dump when its hung-up the next time Bryan (Would be
>> something for us to do a looksee on).
>>
>>> Finally I kill -9 the process, so that another
>>> master could take over.  The new master seemed to start splitting logs, but
>>> eventually got into the same state of printing the above message.
>>>
>>
>> You think it a particular log?
>>
>>
>>> Eventually it all worked out, but it took WAY too long (almost an hour, all
>>> said).  Is this something that is tunable?
>>
>> Have RS carry less WALs?  Its a configuration.
>>
>>> They should have instantly been
>>> removed from the list instead of retrying so many times.  Each server was
>>> retried upwards of 30-40 times.
>>>
>>
>> Yeah, thats a bit silly.
>>
>> We're working on the MTTR in general.  You logs would be of interest
>> to a few of us if its ok that someone else can take a look.
>>
>> St.Ack
>>
>>> I am running cdh3u2 (0.90.4).
>>>
>>> Thanks,
>>>
>>> Bryan

Re: HMaster not failing over dead RegionServers

Posted by Bryan Beaudreault <bb...@hubspot.com>.
Thanks a bunch for the insight.  This message was actually coming from
master, but it still needs to grab the HLog files from hdfs, so I can still
see it being what you mentioned.  I'm going to look into tuning these
parameters down in preparation for future failures.

On Mon, Jul 2, 2012 at 7:56 PM, Suraj Varma <sv...@gmail.com> wrote:

> This looks like it is trying to reach a datanode ... doesn't it?
> > 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
> 10.125.18.129:50020. Already tried 14 time(s).
>
> Is this from a master log or from a region server log? (I'm guess the
> above is from a region server log while trying to replay hlogs)
>
> Sometime back, we had a similar symptom (HLog splitting takes the long
> time due to the retries) and found that even though the datanode died,
> it was not being detected by the namenode. This leads to the region
> server retrying over dead datanodes over and over stretching out the
> splitting process.
>
> See this thread:
> http://www.mail-archive.com/core-user@hadoop.apache.org/msg10033.html
>
> We found that by default, it takes 15 mins for a datanode death to be
> detected by a NN ... and this seems to cause the NN serving back the
> dead DN as a valid one when RS tries to read the hlogs.
> The parameters in question are: dfs.heartbeat.recheck.interval and
> heartbeat.recheck.interval ... tweaking this down caused the recovery
> to be much faster.
> Also - hbase.rpc.timeout and zookeeper.session.timeout are two other
> configurations that need to be tweaked down from defaults for quick
> recovery.
>
> Not sure if this is the case in your error - but, might be something
> to investigate ...
> --Suraj
>
>
> On Sat, Jun 30, 2012 at 8:53 AM, Jimmy Xiang <jx...@cloudera.com> wrote:
> > Bryan,
> >
> > The master could not detect if the region server is dead.
> > How do you set the zookeeper session timeout?
> >
> > Thanks,
> > Jimmy
> >
> > On Sat, Jun 30, 2012 at 8:09 AM, Stack <st...@duboce.net> wrote:
> >> On Sat, Jun 30, 2012 at 7:04 AM, Bryan Beaudreault
> >> <bb...@hubspot.com> wrote:
> >>> 12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: /
> >>> 10.125.18.129:50020. Already tried 14 time(s).
> >>>
> >>
> >> This was one of the servers that went down?
> >>
> >>> It was not following through the splitting of HLog files and didn't
> appear
> >>> to be moving regions off failed hosts.  After giving it about 20
> minutes to
> >>> try to right itself, I tried restarting the service.  The restart
> script
> >>> just hung for a while printing dots and nothing apparent was happening
> on
> >>> the logs at the time.
> >>
> >> Can we see the log  Bryan?
> >>
> >> You might thread dump when its hung-up the next time Bryan (Would be
> >> something for us to do a looksee on).
> >>
> >>> Finally I kill -9 the process, so that another
> >>> master could take over.  The new master seemed to start splitting
> logs, but
> >>> eventually got into the same state of printing the above message.
> >>>
> >>
> >> You think it a particular log?
> >>
> >>
> >>> Eventually it all worked out, but it took WAY too long (almost an
> hour, all
> >>> said).  Is this something that is tunable?
> >>
> >> Have RS carry less WALs?  Its a configuration.
> >>
> >>> They should have instantly been
> >>> removed from the list instead of retrying so many times.  Each server
> was
> >>> retried upwards of 30-40 times.
> >>>
> >>
> >> Yeah, thats a bit silly.
> >>
> >> We're working on the MTTR in general.  You logs would be of interest
> >> to a few of us if its ok that someone else can take a look.
> >>
> >> St.Ack
> >>
> >>> I am running cdh3u2 (0.90.4).
> >>>
> >>> Thanks,
> >>>
> >>> Bryan
>