You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dan Brodsky <da...@gmail.com> on 2012/10/17 15:01:25 UTC

Regionservers not connecting to master

Good morning,

I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
regionservers.

Several weeks ago, we had six HDFS datanodes go offline suddenly (with
no meaningful error messages), and since then, I have been unable to
get all 10 regionservers to connect to the Hbase master. I've tried
bringing the cluster down and rebooting all the boxes, but no joy. The
machines are all running, and hbase-regionserver appears to start
normally on each one.

Right now, my master status page (http://namenode:60010) shows 3
regionservers online. There are also dozens of regions in transition
listed on the status page (in the PENDING_OPEN state), but each of
those are on one of the regionservers already online.

The 7 other regionservers' log files show a successful connection to
one ZK peer, followed by a regular trail of these messages:

2012-10-17 12:36:08,394 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
hitRatio=0cachingAccesses=0, cachingHits=0,
cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN

If I had to wager a guess, it seems like the 7 offline regionservers
are not connecting to other ZK peers, but there isn't anything in the
ZK logs to indicate why.

Thoughts?

Dan

RE: Regionservers not connecting to master

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.
Can you try like start any of the regionservers that are not connecting at
all.  May be start 2 of them.
Observer master logs.  See whether it says 
'Waiting for RegionServers to checkin'?.  

Just to confirm your ZK ip and port is correct thro out the cluster? If
multitenant cluster then you may be the other regionservers are connecting
to someother ZK cluster? 
Wild guess :)

Regards
Ram
> -----Original Message-----
> From: Dan Brodsky [mailto:danbrodsky@gmail.com]
> Sent: Wednesday, October 17, 2012 6:31 PM
> To: user@hbase.apache.org
> Subject: Regionservers not connecting to master
> 
> Good morning,
> 
> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
> peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
> regionservers.
> 
> Several weeks ago, we had six HDFS datanodes go offline suddenly (with
> no meaningful error messages), and since then, I have been unable to
> get all 10 regionservers to connect to the Hbase master. I've tried
> bringing the cluster down and rebooting all the boxes, but no joy. The
> machines are all running, and hbase-regionserver appears to start
> normally on each one.
> 
> Right now, my master status page (http://namenode:60010) shows 3
> regionservers online. There are also dozens of regions in transition
> listed on the status page (in the PENDING_OPEN state), but each of
> those are on one of the regionservers already online.
> 
> The 7 other regionservers' log files show a successful connection to
> one ZK peer, followed by a regular trail of these messages:
> 
> 2012-10-17 12:36:08,394 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
> hitRatio=0cachingAccesses=0, cachingHits=0,
> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
> 
> If I had to wager a guess, it seems like the 7 offline regionservers
> are not connecting to other ZK peers, but there isn't anything in the
> ZK logs to indicate why.
> 
> Thoughts?
> 
> Dan


Re: Regionservers not connecting to master

Posted by Dan Brodsky <da...@gmail.com>.
Nope. I'm honestly not sure how the files changed, but I will keep an eye
on it.


On Fri, Nov 2, 2012 at 2:22 PM, Kevin O'dell <ke...@cloudera.com>wrote:

> Do you use Puppet?
>
> On Fri, Nov 2, 2012 at 1:13 PM, Dan Brodsky <da...@gmail.com> wrote:
>
> > Ram,
> >
> > I wanted to follow up with you since you helped me with your below
> comment.
> >
> > It turns out that the ZK configuration files somehow got changed
> (reverted
> > to their default values?), and I'm not sure who/when/how. The zoo.cfg
> files
> > didn't have the list of quorum peers, and the myid files that told each
> ZK
> > peer their ordinal value had been deleted. So, effectively, I had three
> ZK
> > standalone servers, instead of one quorum.
> >
> > Problem fixed, Hbase is happy again.
> >
> > Cheers,
> >
> > Dan
> >
> >
> >
> > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan <
> > ramkrishna.vasudevan@huawei.com> wrote:
> >
> > > Can you try like start any of the regionservers that are not connecting
> > at
> > > all.  May be start 2 of them.
> > > Observer master logs.  See whether it says
> > > 'Waiting for RegionServers to checkin'?.
> > >
> > > Just to confirm your ZK ip and port is correct thro out the cluster? If
> > > multitenant cluster then you may be the other regionservers are
> > connecting
> > > to someother ZK cluster?
> > > Wild guess :)
> > >
> > > Regards
> > > Ram
> > > > -----Original Message-----
> > > > From: Dan Brodsky [mailto:danbrodsky@gmail.com]
> > > > Sent: Wednesday, October 17, 2012 6:31 PM
> > > > To: user@hbase.apache.org
> > > > Subject: Regionservers not connecting to master
> > > >
> > > > Good morning,
> > > >
> > > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
> > > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
> > > > peer VM, and one on a third box). All 10 HDFS datanodes are also
> Hbase
> > > > regionservers.
> > > >
> > > > Several weeks ago, we had six HDFS datanodes go offline suddenly
> (with
> > > > no meaningful error messages), and since then, I have been unable to
> > > > get all 10 regionservers to connect to the Hbase master. I've tried
> > > > bringing the cluster down and rebooting all the boxes, but no joy.
> The
> > > > machines are all running, and hbase-regionserver appears to start
> > > > normally on each one.
> > > >
> > > > Right now, my master status page (http://namenode:60010) shows 3
> > > > regionservers online. There are also dozens of regions in transition
> > > > listed on the status page (in the PENDING_OPEN state), but each of
> > > > those are on one of the regionservers already online.
> > > >
> > > > The 7 other regionservers' log files show a successful connection to
> > > > one ZK peer, followed by a regular trail of these messages:
> > > >
> > > > 2012-10-17 12:36:08,394 DEBUG
> > > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
> > > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
> > > > hitRatio=0cachingAccesses=0, cachingHits=0,
> > > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
> > > >
> > > > If I had to wager a guess, it seems like the 7 offline regionservers
> > > > are not connecting to other ZK peers, but there isn't anything in the
> > > > ZK logs to indicate why.
> > > >
> > > > Thoughts?
> > > >
> > > > Dan
> > >
> > >
> >
>
>
>
> --
> Kevin O'Dell
> Customer Operations Engineer, Cloudera
>

Re: Regionservers not connecting to master

Posted by Kevin O'dell <ke...@cloudera.com>.
Do you use Puppet?

On Fri, Nov 2, 2012 at 1:13 PM, Dan Brodsky <da...@gmail.com> wrote:

> Ram,
>
> I wanted to follow up with you since you helped me with your below comment.
>
> It turns out that the ZK configuration files somehow got changed (reverted
> to their default values?), and I'm not sure who/when/how. The zoo.cfg files
> didn't have the list of quorum peers, and the myid files that told each ZK
> peer their ordinal value had been deleted. So, effectively, I had three ZK
> standalone servers, instead of one quorum.
>
> Problem fixed, Hbase is happy again.
>
> Cheers,
>
> Dan
>
>
>
> On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan <
> ramkrishna.vasudevan@huawei.com> wrote:
>
> > Can you try like start any of the regionservers that are not connecting
> at
> > all.  May be start 2 of them.
> > Observer master logs.  See whether it says
> > 'Waiting for RegionServers to checkin'?.
> >
> > Just to confirm your ZK ip and port is correct thro out the cluster? If
> > multitenant cluster then you may be the other regionservers are
> connecting
> > to someother ZK cluster?
> > Wild guess :)
> >
> > Regards
> > Ram
> > > -----Original Message-----
> > > From: Dan Brodsky [mailto:danbrodsky@gmail.com]
> > > Sent: Wednesday, October 17, 2012 6:31 PM
> > > To: user@hbase.apache.org
> > > Subject: Regionservers not connecting to master
> > >
> > > Good morning,
> > >
> > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
> > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
> > > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
> > > regionservers.
> > >
> > > Several weeks ago, we had six HDFS datanodes go offline suddenly (with
> > > no meaningful error messages), and since then, I have been unable to
> > > get all 10 regionservers to connect to the Hbase master. I've tried
> > > bringing the cluster down and rebooting all the boxes, but no joy. The
> > > machines are all running, and hbase-regionserver appears to start
> > > normally on each one.
> > >
> > > Right now, my master status page (http://namenode:60010) shows 3
> > > regionservers online. There are also dozens of regions in transition
> > > listed on the status page (in the PENDING_OPEN state), but each of
> > > those are on one of the regionservers already online.
> > >
> > > The 7 other regionservers' log files show a successful connection to
> > > one ZK peer, followed by a regular trail of these messages:
> > >
> > > 2012-10-17 12:36:08,394 DEBUG
> > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
> > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
> > > hitRatio=0cachingAccesses=0, cachingHits=0,
> > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
> > >
> > > If I had to wager a guess, it seems like the 7 offline regionservers
> > > are not connecting to other ZK peers, but there isn't anything in the
> > > ZK logs to indicate why.
> > >
> > > Thoughts?
> > >
> > > Dan
> >
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Regionservers not connecting to master

Posted by ramkrishna vasudevan <ra...@gmail.com>.
Nice...Thanks for your follow up.

Regards
Ram

On Fri, Nov 2, 2012 at 11:43 PM, Dan Brodsky <da...@gmail.com> wrote:

> Ram,
>
> I wanted to follow up with you since you helped me with your below comment.
>
> It turns out that the ZK configuration files somehow got changed (reverted
> to their default values?), and I'm not sure who/when/how. The zoo.cfg files
> didn't have the list of quorum peers, and the myid files that told each ZK
> peer their ordinal value had been deleted. So, effectively, I had three ZK
> standalone servers, instead of one quorum.
>
> Problem fixed, Hbase is happy again.
>
> Cheers,
>
> Dan
>
>
>
> On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan <
> ramkrishna.vasudevan@huawei.com> wrote:
>
> > Can you try like start any of the regionservers that are not connecting
> at
> > all.  May be start 2 of them.
> > Observer master logs.  See whether it says
> > 'Waiting for RegionServers to checkin'?.
> >
> > Just to confirm your ZK ip and port is correct thro out the cluster? If
> > multitenant cluster then you may be the other regionservers are
> connecting
> > to someother ZK cluster?
> > Wild guess :)
> >
> > Regards
> > Ram
> > > -----Original Message-----
> > > From: Dan Brodsky [mailto:danbrodsky@gmail.com]
> > > Sent: Wednesday, October 17, 2012 6:31 PM
> > > To: user@hbase.apache.org
> > > Subject: Regionservers not connecting to master
> > >
> > > Good morning,
> > >
> > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
> > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
> > > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
> > > regionservers.
> > >
> > > Several weeks ago, we had six HDFS datanodes go offline suddenly (with
> > > no meaningful error messages), and since then, I have been unable to
> > > get all 10 regionservers to connect to the Hbase master. I've tried
> > > bringing the cluster down and rebooting all the boxes, but no joy. The
> > > machines are all running, and hbase-regionserver appears to start
> > > normally on each one.
> > >
> > > Right now, my master status page (http://namenode:60010) shows 3
> > > regionservers online. There are also dozens of regions in transition
> > > listed on the status page (in the PENDING_OPEN state), but each of
> > > those are on one of the regionservers already online.
> > >
> > > The 7 other regionservers' log files show a successful connection to
> > > one ZK peer, followed by a regular trail of these messages:
> > >
> > > 2012-10-17 12:36:08,394 DEBUG
> > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
> > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
> > > hitRatio=0cachingAccesses=0, cachingHits=0,
> > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
> > >
> > > If I had to wager a guess, it seems like the 7 offline regionservers
> > > are not connecting to other ZK peers, but there isn't anything in the
> > > ZK logs to indicate why.
> > >
> > > Thoughts?
> > >
> > > Dan
> >
> >
>

Re: Regionservers not connecting to master

Posted by Dan Brodsky <da...@gmail.com>.
Ram,

I wanted to follow up with you since you helped me with your below comment.

It turns out that the ZK configuration files somehow got changed (reverted
to their default values?), and I'm not sure who/when/how. The zoo.cfg files
didn't have the list of quorum peers, and the myid files that told each ZK
peer their ordinal value had been deleted. So, effectively, I had three ZK
standalone servers, instead of one quorum.

Problem fixed, Hbase is happy again.

Cheers,

Dan



On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan <
ramkrishna.vasudevan@huawei.com> wrote:

> Can you try like start any of the regionservers that are not connecting at
> all.  May be start 2 of them.
> Observer master logs.  See whether it says
> 'Waiting for RegionServers to checkin'?.
>
> Just to confirm your ZK ip and port is correct thro out the cluster? If
> multitenant cluster then you may be the other regionservers are connecting
> to someother ZK cluster?
> Wild guess :)
>
> Regards
> Ram
> > -----Original Message-----
> > From: Dan Brodsky [mailto:danbrodsky@gmail.com]
> > Sent: Wednesday, October 17, 2012 6:31 PM
> > To: user@hbase.apache.org
> > Subject: Regionservers not connecting to master
> >
> > Good morning,
> >
> > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
> > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
> > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
> > regionservers.
> >
> > Several weeks ago, we had six HDFS datanodes go offline suddenly (with
> > no meaningful error messages), and since then, I have been unable to
> > get all 10 regionservers to connect to the Hbase master. I've tried
> > bringing the cluster down and rebooting all the boxes, but no joy. The
> > machines are all running, and hbase-regionserver appears to start
> > normally on each one.
> >
> > Right now, my master status page (http://namenode:60010) shows 3
> > regionservers online. There are also dozens of regions in transition
> > listed on the status page (in the PENDING_OPEN state), but each of
> > those are on one of the regionservers already online.
> >
> > The 7 other regionservers' log files show a successful connection to
> > one ZK peer, followed by a regular trail of these messages:
> >
> > 2012-10-17 12:36:08,394 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
> > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
> > hitRatio=0cachingAccesses=0, cachingHits=0,
> > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
> >
> > If I had to wager a guess, it seems like the 7 offline regionservers
> > are not connecting to other ZK peers, but there isn't anything in the
> > ZK logs to indicate why.
> >
> > Thoughts?
> >
> > Dan
>
>

RE: Regionservers not connecting to master

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.
Just check out your etc/hosts files.  I have not worked on VMs anyway to
tell the problem more precisely.

Regards
Ram

> -----Original Message-----
> From: Dan Brodsky [mailto:danbrodsky@gmail.com]
> Sent: Wednesday, October 17, 2012 11:05 PM
> To: user@hbase.apache.org
> Subject: Re: Regionservers not connecting to master
> 
> Well, slight change: only 1 of the ZK peers happens to work. When a RS
> connects to the other 2, it doesn't go further than that. The 1 ZK
> node that happens to work is the one that runs on the same VM as the
> master.
> 
> Sounds like it could be network connectivity issues, so I'm going to
> investigate that a bit further, but other suggestions are welcome.
> 
> 
> On Wed, Oct 17, 2012 at 1:29 PM, Dan Brodsky <da...@gmail.com>
> wrote:
> > Ram,
> >
> > Thanks for your suggestions.
> >
> > The datanodes are all built using the same image, so I know they're
> > all pointed to the same ZK nodes.
> >
> > I monitored all three ZK logs, the master log, and the regionserver
> > log for each RS I was trying to bring back online. I'm glad I have a
> > big screen. :-) Here is what I found:
> >
> > Whenever a regionserver connects to one particular ZK peer *first*,
> it
> > never goes online. The ZK log shows a successful connection
> > negotiating a timeout value, and the RS's log shows a successful ZK
> > connection, but then it just sits there.
> >
> > When a regionserver starts up and connects to one of the other two ZK
> > peers first, it connects to a second one successfully, then contacts
> > the master, and it comes up and all is happy.
> >
> > So the problem of regionservers not connecting to master only happens
> > when the RS tries one particular ZK node as its first ZK connection.
> > But the logs aren't helpful for diagnosing further than that.
> >
> > Additional thoughts?
> >
> >
> > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan
> > <ra...@huawei.com> wrote:
> >> Can you try like start any of the regionservers that are not
> connecting at
> >> all.  May be start 2 of them.
> >> Observer master logs.  See whether it says
> >> 'Waiting for RegionServers to checkin'?.
> >>
> >> Just to confirm your ZK ip and port is correct thro out the cluster?
> If
> >> multitenant cluster then you may be the other regionservers are
> connecting
> >> to someother ZK cluster?
> >> Wild guess :)
> >>
> >> Regards
> >> Ram
> >>> -----Original Message-----
> >>> From: Dan Brodsky [mailto:danbrodsky@gmail.com]
> >>> Sent: Wednesday, October 17, 2012 6:31 PM
> >>> To: user@hbase.apache.org
> >>> Subject: Regionservers not connecting to master
> >>>
> >>> Good morning,
> >>>
> >>> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus
> three
> >>> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
> >>> peer VM, and one on a third box). All 10 HDFS datanodes are also
> Hbase
> >>> regionservers.
> >>>
> >>> Several weeks ago, we had six HDFS datanodes go offline suddenly
> (with
> >>> no meaningful error messages), and since then, I have been unable
> to
> >>> get all 10 regionservers to connect to the Hbase master. I've tried
> >>> bringing the cluster down and rebooting all the boxes, but no joy.
> The
> >>> machines are all running, and hbase-regionserver appears to start
> >>> normally on each one.
> >>>
> >>> Right now, my master status page (http://namenode:60010) shows 3
> >>> regionservers online. There are also dozens of regions in
> transition
> >>> listed on the status page (in the PENDING_OPEN state), but each of
> >>> those are on one of the regionservers already online.
> >>>
> >>> The 7 other regionservers' log files show a successful connection
> to
> >>> one ZK peer, followed by a regular trail of these messages:
> >>>
> >>> 2012-10-17 12:36:08,394 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats:
> total=8.17
> >>> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
> >>> hitRatio=0cachingAccesses=0, cachingHits=0,
> >>> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
> >>>
> >>> If I had to wager a guess, it seems like the 7 offline
> regionservers
> >>> are not connecting to other ZK peers, but there isn't anything in
> the
> >>> ZK logs to indicate why.
> >>>
> >>> Thoughts?
> >>>
> >>> Dan
> >>


Re: Regionservers not connecting to master

Posted by Dan Brodsky <da...@gmail.com>.
Well, slight change: only 1 of the ZK peers happens to work. When a RS
connects to the other 2, it doesn't go further than that. The 1 ZK
node that happens to work is the one that runs on the same VM as the
master.

Sounds like it could be network connectivity issues, so I'm going to
investigate that a bit further, but other suggestions are welcome.


On Wed, Oct 17, 2012 at 1:29 PM, Dan Brodsky <da...@gmail.com> wrote:
> Ram,
>
> Thanks for your suggestions.
>
> The datanodes are all built using the same image, so I know they're
> all pointed to the same ZK nodes.
>
> I monitored all three ZK logs, the master log, and the regionserver
> log for each RS I was trying to bring back online. I'm glad I have a
> big screen. :-) Here is what I found:
>
> Whenever a regionserver connects to one particular ZK peer *first*, it
> never goes online. The ZK log shows a successful connection
> negotiating a timeout value, and the RS's log shows a successful ZK
> connection, but then it just sits there.
>
> When a regionserver starts up and connects to one of the other two ZK
> peers first, it connects to a second one successfully, then contacts
> the master, and it comes up and all is happy.
>
> So the problem of regionservers not connecting to master only happens
> when the RS tries one particular ZK node as its first ZK connection.
> But the logs aren't helpful for diagnosing further than that.
>
> Additional thoughts?
>
>
> On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan
> <ra...@huawei.com> wrote:
>> Can you try like start any of the regionservers that are not connecting at
>> all.  May be start 2 of them.
>> Observer master logs.  See whether it says
>> 'Waiting for RegionServers to checkin'?.
>>
>> Just to confirm your ZK ip and port is correct thro out the cluster? If
>> multitenant cluster then you may be the other regionservers are connecting
>> to someother ZK cluster?
>> Wild guess :)
>>
>> Regards
>> Ram
>>> -----Original Message-----
>>> From: Dan Brodsky [mailto:danbrodsky@gmail.com]
>>> Sent: Wednesday, October 17, 2012 6:31 PM
>>> To: user@hbase.apache.org
>>> Subject: Regionservers not connecting to master
>>>
>>> Good morning,
>>>
>>> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
>>> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
>>> peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
>>> regionservers.
>>>
>>> Several weeks ago, we had six HDFS datanodes go offline suddenly (with
>>> no meaningful error messages), and since then, I have been unable to
>>> get all 10 regionservers to connect to the Hbase master. I've tried
>>> bringing the cluster down and rebooting all the boxes, but no joy. The
>>> machines are all running, and hbase-regionserver appears to start
>>> normally on each one.
>>>
>>> Right now, my master status page (http://namenode:60010) shows 3
>>> regionservers online. There are also dozens of regions in transition
>>> listed on the status page (in the PENDING_OPEN state), but each of
>>> those are on one of the regionservers already online.
>>>
>>> The 7 other regionservers' log files show a successful connection to
>>> one ZK peer, followed by a regular trail of these messages:
>>>
>>> 2012-10-17 12:36:08,394 DEBUG
>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
>>> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
>>> hitRatio=0cachingAccesses=0, cachingHits=0,
>>> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
>>>
>>> If I had to wager a guess, it seems like the 7 offline regionservers
>>> are not connecting to other ZK peers, but there isn't anything in the
>>> ZK logs to indicate why.
>>>
>>> Thoughts?
>>>
>>> Dan
>>

Re: Regionservers not connecting to master

Posted by Dan Brodsky <da...@gmail.com>.
Ram,

Thanks for your suggestions.

The datanodes are all built using the same image, so I know they're
all pointed to the same ZK nodes.

I monitored all three ZK logs, the master log, and the regionserver
log for each RS I was trying to bring back online. I'm glad I have a
big screen. :-) Here is what I found:

Whenever a regionserver connects to one particular ZK peer *first*, it
never goes online. The ZK log shows a successful connection
negotiating a timeout value, and the RS's log shows a successful ZK
connection, but then it just sits there.

When a regionserver starts up and connects to one of the other two ZK
peers first, it connects to a second one successfully, then contacts
the master, and it comes up and all is happy.

So the problem of regionservers not connecting to master only happens
when the RS tries one particular ZK node as its first ZK connection.
But the logs aren't helpful for diagnosing further than that.

Additional thoughts?


On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan
<ra...@huawei.com> wrote:
> Can you try like start any of the regionservers that are not connecting at
> all.  May be start 2 of them.
> Observer master logs.  See whether it says
> 'Waiting for RegionServers to checkin'?.
>
> Just to confirm your ZK ip and port is correct thro out the cluster? If
> multitenant cluster then you may be the other regionservers are connecting
> to someother ZK cluster?
> Wild guess :)
>
> Regards
> Ram
>> -----Original Message-----
>> From: Dan Brodsky [mailto:danbrodsky@gmail.com]
>> Sent: Wednesday, October 17, 2012 6:31 PM
>> To: user@hbase.apache.org
>> Subject: Regionservers not connecting to master
>>
>> Good morning,
>>
>> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three
>> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
>> peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase
>> regionservers.
>>
>> Several weeks ago, we had six HDFS datanodes go offline suddenly (with
>> no meaningful error messages), and since then, I have been unable to
>> get all 10 regionservers to connect to the Hbase master. I've tried
>> bringing the cluster down and rebooting all the boxes, but no joy. The
>> machines are all running, and hbase-regionserver appears to start
>> normally on each one.
>>
>> Right now, my master status page (http://namenode:60010) shows 3
>> regionservers online. There are also dozens of regions in transition
>> listed on the status page (in the PENDING_OPEN state), but each of
>> those are on one of the regionservers already online.
>>
>> The 7 other regionservers' log files show a successful connection to
>> one ZK peer, followed by a regular trail of these messages:
>>
>> 2012-10-17 12:36:08,394 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17
>> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
>> hitRatio=0cachingAccesses=0, cachingHits=0,
>> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN
>>
>> If I had to wager a guess, it seems like the 7 offline regionservers
>> are not connecting to other ZK peers, but there isn't anything in the
>> ZK logs to indicate why.
>>
>> Thoughts?
>>
>> Dan
>