You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by ishwar ramani <rv...@gmail.com> on 2010/06/11 21:48:03 UTC

hbase cluster cold start: master and region server did not connect!

Hi,

I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
our system.
I recently noticed that a hbase query was timing out with

org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
to locate root region


I looked at the master logs and none of the region servers had connected

2010-06-04 00:00:21,510 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN


The master had a stderr output when it started

java.io.EOFException
....
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
DFSClient_-107490689

The regionservers have been trying to connect with the master ever since
with the error

2010-06-03 14:33:28,960 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
master. Retrying. Error was: java.net.ConnectException: Connection refused


All the region servers and master processes are running now. Except none of
the region servers are connected.


My first question is how to monitor this problem. None of the logs report an
error.  I monitor processes so they are all fine. The logs don't report any
error.
How do i check for the general health of the cluster?


My second question is why did this happen?

thanks
ishwar

Re: hbase cluster cold start: master and region server did not connect!

Posted by ishwar ramani <rv...@gmail.com>.
Hi Jean,

It happened again today during a server restart. This involved a hadoop
start following by a hbase start.
There was also an exception when hbase master came up on reading  a file
from hadoop. Not sure if that is the problem.
Pasted those logs too.


Current state of the system: master, zookeeper, region servers are all up.
But region servers are not connected to master.

Here are the logs ....


1. logs on hbase master and hadoop namenode.
hbase-master.out :http://pastebin.com/6a88nRh5
hadoop-namemode: http://pastebin.com/wHP5uQBh

2.  syslog on hbase master.
http://pastebin.com/S9KVVsSf

3. syslog on hbase regionservers. Posted one the other is the same.
http://pastebin.com/kR42Xt2t


I did a netstat -tna to confirm that master is listening on port
127.0.0.121:60000

I did a restart of regionservers only and its able to connect fine.


thanks
ishwar


On Fri, Jun 11, 2010 at 12:56 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> You can check the general health by using the webui, it runs on the
> master node at port 60010.
>
> For the errors, the context you gave is so limited that giving any
> meaningful answer is impossible. Please post full logs on a web server
> or on pastebin.com (or your preferred code pasting site) if it fits.
>
> J-D
>
> On Fri, Jun 11, 2010 at 12:48 PM, ishwar ramani <rv...@gmail.com>
> wrote:
> > Hi,
> >
> > I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
> > our system.
> > I recently noticed that a hbase query was timing out with
> >
> > org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out
> trying
> > to locate root region
> >
> >
> > I looked at the master logs and none of the region servers had connected
> >
> > 2010-06-04 00:00:21,510 INFO
> org.apache.hadoop.hbase.master.ServerManager: 0
> > region servers, 0 dead, average load NaN
> >
> >
> > The master had a stderr output when it started
> >
> > java.io.EOFException
> > ....
> > org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
> > complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
> > DFSClient_-107490689
> >
> > The regionservers have been trying to connect with the master ever since
> > with the error
> >
> > 2010-06-03 14:33:28,960 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
> > master. Retrying. Error was: java.net.ConnectException: Connection
> refused
> >
> >
> > All the region servers and master processes are running now. Except none
> of
> > the region servers are connected.
> >
> >
> > My first question is how to monitor this problem. None of the logs report
> an
> > error.  I monitor processes so they are all fine. The logs don't report
> any
> > error.
> > How do i check for the general health of the cluster?
> >
> >
> > My second question is why did this happen?
> >
> > thanks
> > ishwar
> >
>

Re: hbase cluster cold start: master and region server did not connect!

Posted by Jean-Daniel Cryans <jd...@apache.org>.
You can check the general health by using the webui, it runs on the
master node at port 60010.

For the errors, the context you gave is so limited that giving any
meaningful answer is impossible. Please post full logs on a web server
or on pastebin.com (or your preferred code pasting site) if it fits.

J-D

On Fri, Jun 11, 2010 at 12:48 PM, ishwar ramani <rv...@gmail.com> wrote:
> Hi,
>
> I have a hbase hadoop cluster setup. 6 days back we did a cold restart of
> our system.
> I recently noticed that a hbase query was timing out with
>
> org.apache.hadoop.hbase.client.NoServerForRegionException: Timed out trying
> to locate root region
>
>
> I looked at the master logs and none of the region servers had connected
>
> 2010-06-04 00:00:21,510 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
>
>
> The master had a stderr output when it started
>
> java.io.EOFException
> ....
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Could not
> complete write to file /hbase/devLogsTable/1225469767/oldlogfile.log by
> DFSClient_-107490689
>
> The regionservers have been trying to connect with the master ever since
> with the error
>
> 2010-06-03 14:33:28,960 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to
> master. Retrying. Error was: java.net.ConnectException: Connection refused
>
>
> All the region servers and master processes are running now. Except none of
> the region servers are connected.
>
>
> My first question is how to monitor this problem. None of the logs report an
> error.  I monitor processes so they are all fine. The logs don't report any
> error.
> How do i check for the general health of the cluster?
>
>
> My second question is why did this happen?
>
> thanks
> ishwar
>