You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dejan Menges <de...@gmail.com> on 2015/04/13 10:11:53 UTC

How server gets into failed servers list?

Hi,

We had some issues recently with HDFS - hardware issue with one of the
nodes, nodes died, HDFS recovered, but we figured out that something is
wrong with HBase. Checking HMaster log, we saw that bunch of our region
servers got to the famous failed servers list, and it was going on and on
until we restarted every one of them.

Are we doing something wrong? Is it possible somehow to tune this out, once
the server is in this list to forget about it or something?

Main question - how HMaster decides at all that server should be in the
failed server list, and what does this means exactly?

Was looking into HBase book, googling, but beside some generic answers
wasn't able to find anything more internal.

Thanks in advance!

Re: How server gets into failed servers list?

Posted by Esteban Gutierrez <es...@cloudera.com>.
Thanks Dejan,

Please keep us posted!

cheers,
esteban.


--
Cloudera, Inc.


On Mon, Apr 13, 2015 at 11:08 AM, Dejan Menges <de...@gmail.com>
wrote:

> Hi Esteban,
>
> Thanks for pointing to that, will try to collect all logs tomorrow and to
> take deeper look and post here specific errors. Yes, good news are that all
> logs are preserved.
>
> Thanks a lot,
> Dejan
>
> On Mon, Apr 13, 2015 at 8:01 PM Esteban Gutierrez <es...@cloudera.com>
> wrote:
>
> > Hi Dejan,
> >
> > Do you have the logs from any of those failed region servers? Usually in
> > case of a critical failure the RS will shutdown itself or if the RS
> "hangs"
> > for a long time and the master will start processing the expiration of
> that
> > RS and reject the RS if it tries to reconnect with a YouAreDeadException.
> > The HBase master and RS logs for sure will tell us.
> >
> > thanks,
> > esteban.
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> > On Mon, Apr 13, 2015 at 1:11 AM, Dejan Menges <de...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We had some issues recently with HDFS - hardware issue with one of the
> > > nodes, nodes died, HDFS recovered, but we figured out that something is
> > > wrong with HBase. Checking HMaster log, we saw that bunch of our region
> > > servers got to the famous failed servers list, and it was going on and
> on
> > > until we restarted every one of them.
> > >
> > > Are we doing something wrong? Is it possible somehow to tune this out,
> > once
> > > the server is in this list to forget about it or something?
> > >
> > > Main question - how HMaster decides at all that server should be in the
> > > failed server list, and what does this means exactly?
> > >
> > > Was looking into HBase book, googling, but beside some generic answers
> > > wasn't able to find anything more internal.
> > >
> > > Thanks in advance!
> > >
> >
>

Re: How server gets into failed servers list?

Posted by Dejan Menges <de...@gmail.com>.
Hi Esteban,

Thanks for pointing to that, will try to collect all logs tomorrow and to
take deeper look and post here specific errors. Yes, good news are that all
logs are preserved.

Thanks a lot,
Dejan

On Mon, Apr 13, 2015 at 8:01 PM Esteban Gutierrez <es...@cloudera.com>
wrote:

> Hi Dejan,
>
> Do you have the logs from any of those failed region servers? Usually in
> case of a critical failure the RS will shutdown itself or if the RS "hangs"
> for a long time and the master will start processing the expiration of that
> RS and reject the RS if it tries to reconnect with a YouAreDeadException.
> The HBase master and RS logs for sure will tell us.
>
> thanks,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
> On Mon, Apr 13, 2015 at 1:11 AM, Dejan Menges <de...@gmail.com>
> wrote:
>
> > Hi,
> >
> > We had some issues recently with HDFS - hardware issue with one of the
> > nodes, nodes died, HDFS recovered, but we figured out that something is
> > wrong with HBase. Checking HMaster log, we saw that bunch of our region
> > servers got to the famous failed servers list, and it was going on and on
> > until we restarted every one of them.
> >
> > Are we doing something wrong? Is it possible somehow to tune this out,
> once
> > the server is in this list to forget about it or something?
> >
> > Main question - how HMaster decides at all that server should be in the
> > failed server list, and what does this means exactly?
> >
> > Was looking into HBase book, googling, but beside some generic answers
> > wasn't able to find anything more internal.
> >
> > Thanks in advance!
> >
>

Re: How server gets into failed servers list?

Posted by Esteban Gutierrez <es...@cloudera.com>.
Hi Dejan,

Do you have the logs from any of those failed region servers? Usually in
case of a critical failure the RS will shutdown itself or if the RS "hangs"
for a long time and the master will start processing the expiration of that
RS and reject the RS if it tries to reconnect with a YouAreDeadException.
The HBase master and RS logs for sure will tell us.

thanks,
esteban.


--
Cloudera, Inc.


On Mon, Apr 13, 2015 at 1:11 AM, Dejan Menges <de...@gmail.com>
wrote:

> Hi,
>
> We had some issues recently with HDFS - hardware issue with one of the
> nodes, nodes died, HDFS recovered, but we figured out that something is
> wrong with HBase. Checking HMaster log, we saw that bunch of our region
> servers got to the famous failed servers list, and it was going on and on
> until we restarted every one of them.
>
> Are we doing something wrong? Is it possible somehow to tune this out, once
> the server is in this list to forget about it or something?
>
> Main question - how HMaster decides at all that server should be in the
> failed server list, and what does this means exactly?
>
> Was looking into HBase book, googling, but beside some generic answers
> wasn't able to find anything more internal.
>
> Thanks in advance!
>