You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jay Wilson <re...@circle-cross-jn.com> on 2012/07/06 05:25:25 UTC

HBASE -- RS expire?

Finally my HMaster has stabilized and been running for 7 hours.  I
believe my networking issues are behind me now.  Thank you everyone for
the help.

New issue is my RSes continue to die after about 20 minutes.  Again the
cluster is idle.  No jobs are running and I get this on all of my RSes
at almost the same time:

2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server devrackA-04/172.18.0.5:2181
2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to devrackA-04/172.18.0.5:2181, initiating session
2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
= 0x13858fc240f0003, negotiated timeout = 180000
2012-07-05 19:34:05,399 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
hook thread: Shutdownhook:regionserver60020
2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
read additional data from server sessionid 0x13858fc240f0003, likely
server has closed socket, closing socket connection and attempting reconnect
2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server devrackA-03/172.18.0.4:2181
2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to devrackA-03/172.18.0.4:2181, initiating session
2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
closing socket connection
2012-07-05 20:06:40,586 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired

Could the fact that the cluster is idle cause the sessions to expire?
It's almost like a timing trigger pops, the sessions expire, and then
can reconnect.  Is there a timer I need to adjust?

Could this be related to a TCP or IP timer that needs to be adjusted?
The session goes into a Fin/Wait state and then closes?

Thank you
---
Jay Wilson

RE: HBASE -- RS expire?

Posted by Pablo Musa <pa...@psafe.com>.
> Is your ZK managed by HBase or are you managing it yourself?

Is it a good or bad option?
What are the pros and cons about each one.

Thanks,
Pablo

Re: HBASE -- RS expire?

Posted by re...@circle-cross-jn.com.

	My ZKs is managed by HBase.

	Yes all nodes in the cluster can see my 3 ZKs.  All of them are on
the same subnet 172.18/16 and phsyically on the same switch.

	---

	Jay Wilson 

----- Original Message -----
From: user@hbase.apache.org
To:
Cc:
Sent:Thu, 5 Jul 2012 21:41:52 -0700
Subject:Re: HBASE -- RS expire?

 Is your ZK managed by HBase or are you managing it yourself?

 BTW - All ZK nodes should be reachable by all nodes in the cluster.

 The YouAreDeadException would be in RS logs if at all.

 On Thursday, July 5, 2012 at 9:38 PM, Jay Wilson wrote:

 > Funny you mention that. I asked the techs to set it up that why.
 > 
 > I went to pull my ZK logs and found that 1 RS is still running.
What is
 > interesting is that RS is connected to ZK on devrackA-05. The 2
RSes
 > that died where connected to ZK on devrackA-03. devrackA-03 has ZK
and
 > HMaster on it.
 > 
 > I did not find the YouAreDeadException in the ZK logs. What I
found was:
 > 
 > 2012-07-05 20:06:40,577 INFO
org.apache.zookeeper.server.NIOServerCnxn: [1]
 > Accepted socket connection from /172.18.0.72:54449
 > 2012-07-05 20:06:40,579 INFO
org.apache.zookeeper.server.NIOServerCnxn: [2]
 > Client attempting to renew session 0x13858fc240f0003 at
/172.18.0.72:54449
 > 2012-07-05 20:06:40,579 INFO
org.apache.zookeeper.server.quorum.Learner: [3]
 > Revalidating client: 87918032693690371
 > 2012-07-05 20:06:40,580 INFO
org.apache.zookeeper.server.NIOServerCnxn: [4]
 > Invalid session 0x13858fc240f0003 for client /172.18.0.72:54449,
 > probably expired
 > 
 > In the RS logs I can see it attempt to reconnect with ZK on
devrackA-03,
 > get the reject and then attempt ZK on devrackA-04.
 > 
 > ---
 > Jay Wilson
 > 
 > 
 > 
 > On 7/5/2012 9:08 PM, Amandeep Khurana wrote:
 > > The timeout can be configured using the session timeout
configuration. The default for that is 180s, but that means that if
the RS doesn't heartbeat to ZK for 180s, it's considered dead. Unless
the machines are really loaded or GCs are pausing the RS processes, I
don't see any other reason except the network I'm assuming you gave
ZK a dedicated disk so it could write its edit logs (based on a
previous thread). 
 > > 
 > > 
 > > On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote:
 > > 
 > > > I don't see that in the RS logs. Would I see that in the ZK
logs?
 > > > 
 > > > At this point there is no network. Just a switch. I reduced
the number
 > > > of nodes to 40 and had all of them placed on the same switch
with a
 > > > single vlan. I even had the network techs use a completely
different
 > > > switch just to be safe.
 > > > 
 > > > Is there some heatbeat timer I can tweak?
 > > > 
 > > > ---
 > > > Jay Wilson
 > > > 
 > > > On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
 > > > > 
 > > > > 
 > > > > On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
 > > > > 
 > > > > > Finally my HMaster has stabilized and been running for 7
hours. I
 > > > > > believe my networking issues are behind me now. Thank you
everyone for
 > > > > > the help.
 > > > > > 
 > > > > 
 > > > > 
 > > > > 
 > > > > Awesome.
 > > > > 
 > > > > Looks like the same issue is biting you with the RS too. The
RS isn't heartbeating to ZK and the ZK session expires, causing the RS
to die.
 > > > > Do you see a YouAreDeadException in the logs? 
 > > > > > 
 > > > > > New issue is my RSes continue to die after about 20
minutes. Again the
 > > > > > cluster is idle. No jobs are running and I get this on all
of my RSes
 > > > > > at almost the same time:
 > > > > > 
 > > > > > 2012-07-05 19:34:05,283 INFO
org.apache.zookeeper.ClientCnxn: [5] Opening
 > > > > > socket connection to server devrackA-04/172.18.0.5:2181
 > > > > > 2012-07-05 19:34:05,288 INFO
org.apache.zookeeper.ClientCnxn: [6] Socket
 > > > > > connection established to devrackA-04/172.18.0.5:2181,
initiating session
 > > > > > 2012-07-05 19:34:05,301 INFO
org.apache.zookeeper.ClientCnxn: [7] Session
 > > > > > establishment complete on server
devrackA-04/172.18.0.5:2181, sessionid
 > > > > > = 0x13858fc240f0003, negotiated timeout = 180000
 > > > > > 2012-07-05 19:34:05,399 INFO
 > > > > > org.apache.hadoop.hbase.regionserver.ShutdownHook: [8]
Installed shutdown
 > > > > > hook thread: Shutdownhook:regionserver60020
 > > > > > 2012-07-05 20:06:40,279 INFO
org.apache.zookeeper.ClientCnxn: [9] Unable to
 > > > > > read additional data from server sessionid
0x13858fc240f0003, likely
 > > > > > server has closed socket, closing socket connection and
attempting reconnect
 > > > > > 2012-07-05 20:06:40,573 INFO
org.apache.zookeeper.ClientCnxn: [10] Opening
 > > > > > socket connection to server devrackA-03/172.18.0.4:2181
 > > > > > 2012-07-05 20:06:40,574 INFO
org.apache.zookeeper.ClientCnxn: [11] Socket
 > > > > > connection established to devrackA-03/172.18.0.4:2181,
initiating session
 > > > > > 2012-07-05 20:06:40,578 INFO
org.apache.zookeeper.ClientCnxn: [12] Unable to
 > > > > > reconnect to ZooKeeper service, session 0x13858fc240f0003
has expired,
 > > > > > closing socket connection
 > > > > > 2012-07-05 20:06:40,586 FATAL
 > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: [13]
ABORTING region
 > > > > > server serverName=devrackB-07,60020,1341542045088,
load=(requests=0,
 > > > > > regions=0, usedHeap=0, maxHeap=0):
regionserver:60020-0x13858fc240f0003
 > > > > > regionserver:60020-0x13858fc240f0003 received expired from
ZooKeeper,
 > > > > > aborting
 > > > > >
org.apache.zookeeper.KeeperException$SessionExpiredException: [14]
 > > > > > KeeperErrorCode = Session expired
 > > > > > 
 > > > > > Could the fact that the cluster is idle cause the sessions
to expire?
 > > > > > It's almost like a timing trigger pops, the sessions
expire, and then
 > > > > > can reconnect. Is there a timer I need to adjust?
 > > > > > 
 > > > > > Could this be related to a TCP or IP timer that needs to
be adjusted?
 > > > > > The session goes into a Fin/Wait state and then closes?
 > > > > > 
 > > > > > Thank you
 > > > > > ---
 > > > > > Jay Wilson
 > > > > > 
 > > > > 
 > > > > 
 > > > 
 > > > 
 > > 
 > > 
 > 
 > 
 > 



Links:
------
[1] http://org.apache.zookeeper.server.NIOServerCnxn
[2] http://org.apache.zookeeper.server.NIOServerCnxn
[3] http://org.apache.zookeeper.server.quorum.Learner
[4] http://org.apache.zookeeper.server.NIOServerCnxn
[5] http://org.apache.zookeeper.ClientCnxn
[6] http://org.apache.zookeeper.ClientCnxn
[7] http://org.apache.zookeeper.ClientCnxn
[8] http://org.apache.hadoop.hbase.regionserver.ShutdownHook
[9] http://orgapache.zookeeper.ClientCnxn
[10] http://org.apache.zookeeper.ClientCnxn
[11] http://org.apache.zookeeper.ClientCnxn
[12] http://org.apache.zookeeper.ClientCnxn
[13] http://org.apache.hadoop.hbase.regionserver.HRegionServer
[14] http://sitemail.hostway.com/http:


Re: HBASE -- RS expire?

Posted by Amandeep Khurana <am...@gmail.com>.
Is your ZK managed by HBase or are you managing it yourself?

BTW - All ZK nodes should be reachable by all nodes in the cluster.

The YouAreDeadException would be in RS logs if at all.


On Thursday, July 5, 2012 at 9:38 PM, Jay Wilson wrote:

> Funny you mention that. I asked the techs to set it up that why.
> 
> I went to pull my ZK logs and found that 1 RS is still running. What is
> interesting is that RS is connected to ZK on devrackA-05. The 2 RSes
> that died where connected to ZK on devrackA-03. devrackA-03 has ZK and
> HMaster on it.
> 
> I did not find the YouAreDeadException in the ZK logs. What I found was:
> 
> 2012-07-05 20:06:40,577 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Accepted socket connection from /172.18.0.72:54449
> 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Client attempting to renew session 0x13858fc240f0003 at /172.18.0.72:54449
> 2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.quorum.Learner:
> Revalidating client: 87918032693690371
> 2012-07-05 20:06:40,580 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Invalid session 0x13858fc240f0003 for client /172.18.0.72:54449,
> probably expired
> 
> In the RS logs I can see it attempt to reconnect with ZK on devrackA-03,
> get the reject and then attempt ZK on devrackA-04.
> 
> ---
> Jay Wilson
> 
> 
> 
> On 7/5/2012 9:08 PM, Amandeep Khurana wrote:
> > The timeout can be configured using the session timeout configuration. The default for that is 180s, but that means that if the RS doesn't heartbeat to ZK for 180s, it's considered dead. Unless the machines are really loaded or GCs are pausing the RS processes, I don't see any other reason except the network. I'm assuming you gave ZK a dedicated disk so it could write its edit logs (based on a previous thread). 
> > 
> > 
> > On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote:
> > 
> > > I don't see that in the RS logs. Would I see that in the ZK logs?
> > > 
> > > At this point there is no network. Just a switch. I reduced the number
> > > of nodes to 40 and had all of them placed on the same switch with a
> > > single vlan. I even had the network techs use a completely different
> > > switch just to be safe.
> > > 
> > > Is there some heatbeat timer I can tweak?
> > > 
> > > ---
> > > Jay Wilson
> > > 
> > > On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
> > > > 
> > > > 
> > > > On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
> > > > 
> > > > > Finally my HMaster has stabilized and been running for 7 hours. I
> > > > > believe my networking issues are behind me now. Thank you everyone for
> > > > > the help.
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > Awesome.
> > > > 
> > > > Looks like the same issue is biting you with the RS too. The RS isn't heartbeating to ZK and the ZK session expires, causing the RS to die.
> > > > Do you see a YouAreDeadException in the logs? 
> > > > > 
> > > > > New issue is my RSes continue to die after about 20 minutes. Again the
> > > > > cluster is idle. No jobs are running and I get this on all of my RSes
> > > > > at almost the same time:
> > > > > 
> > > > > 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
> > > > > socket connection to server devrackA-04/172.18.0.5:2181
> > > > > 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
> > > > > connection established to devrackA-04/172.18.0.5:2181, initiating session
> > > > > 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
> > > > > establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
> > > > > = 0x13858fc240f0003, negotiated timeout = 180000
> > > > > 2012-07-05 19:34:05,399 INFO
> > > > > org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
> > > > > hook thread: Shutdownhook:regionserver60020
> > > > > 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > > > > read additional data from server sessionid 0x13858fc240f0003, likely
> > > > > server has closed socket, closing socket connection and attempting reconnect
> > > > > 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
> > > > > socket connection to server devrackA-03/172.18.0.4:2181
> > > > > 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
> > > > > connection established to devrackA-03/172.18.0.4:2181, initiating session
> > > > > 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > > > > reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
> > > > > closing socket connection
> > > > > 2012-07-05 20:06:40,586 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > > > > server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
> > > > > regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
> > > > > regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
> > > > > aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > 
> > > > > Could the fact that the cluster is idle cause the sessions to expire?
> > > > > It's almost like a timing trigger pops, the sessions expire, and then
> > > > > can reconnect. Is there a timer I need to adjust?
> > > > > 
> > > > > Could this be related to a TCP or IP timer that needs to be adjusted?
> > > > > The session goes into a Fin/Wait state and then closes?
> > > > > 
> > > > > Thank you
> > > > > ---
> > > > > Jay Wilson
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
> 
> 



Re: HBASE -- RS expire?

Posted by Jay Wilson <re...@circle-cross-jn.com>.
Funny you mention that.  I asked the techs to set it up that why.

I went to pull my ZK logs and found that 1 RS is still running.  What is
interesting is that RS is connected to ZK on devrackA-05.  The 2 RSes
that died where connected to ZK on devrackA-03.  devrackA-03 has ZK and
HMaster on it.

I did not find the YouAreDeadException in the ZK logs.  What I found was:

2012-07-05 20:06:40,577 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /172.18.0.72:54449
2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to renew session 0x13858fc240f0003 at /172.18.0.72:54449
2012-07-05 20:06:40,579 INFO org.apache.zookeeper.server.quorum.Learner:
Revalidating client: 87918032693690371
2012-07-05 20:06:40,580 INFO org.apache.zookeeper.server.NIOServerCnxn:
Invalid session 0x13858fc240f0003 for client /172.18.0.72:54449,
probably expired

In the RS logs I can see it attempt to reconnect with ZK on devrackA-03,
get the reject and then attempt ZK on devrackA-04.

---
Jay Wilson



On 7/5/2012 9:08 PM, Amandeep Khurana wrote:
> The timeout can be configured using the session timeout configuration. The default for that is 180s, but that means that if the RS doesn't heartbeat to ZK for 180s, it's considered dead. Unless the machines are really loaded or GCs are pausing the RS processes, I don't see any other reason except the network. I'm assuming you gave ZK a dedicated disk so it could write its edit logs (based on a previous thread). 
> 
> 
> On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote:
> 
>> I don't see that in the RS logs. Would I see that in the ZK logs?
>>
>> At this point there is no network. Just a switch. I reduced the number
>> of nodes to 40 and had all of them placed on the same switch with a
>> single vlan. I even had the network techs use a completely different
>> switch just to be safe.
>>
>> Is there some heatbeat timer I can tweak?
>>
>> ---
>> Jay Wilson
>>
>> On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
>>>
>>>
>>> On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
>>>
>>>> Finally my HMaster has stabilized and been running for 7 hours. I
>>>> believe my networking issues are behind me now. Thank you everyone for
>>>> the help.
>>>>
>>>
>>>
>>> Awesome.
>>>
>>> Looks like the same issue is biting you with the RS too. The RS isn't heartbeating to ZK and the ZK session expires, causing the RS to die.
>>> Do you see a YouAreDeadException in the logs? 
>>>>
>>>> New issue is my RSes continue to die after about 20 minutes. Again the
>>>> cluster is idle. No jobs are running and I get this on all of my RSes
>>>> at almost the same time:
>>>>
>>>> 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server devrackA-04/172.18.0.5:2181
>>>> 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection established to devrackA-04/172.18.0.5:2181, initiating session
>>>> 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
>>>> establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
>>>> = 0x13858fc240f0003, negotiated timeout = 180000
>>>> 2012-07-05 19:34:05,399 INFO
>>>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
>>>> hook thread: Shutdownhook:regionserver60020
>>>> 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
>>>> read additional data from server sessionid 0x13858fc240f0003, likely
>>>> server has closed socket, closing socket connection and attempting reconnect
>>>> 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
>>>> socket connection to server devrackA-03/172.18.0.4:2181
>>>> 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
>>>> connection established to devrackA-03/172.18.0.4:2181, initiating session
>>>> 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
>>>> reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
>>>> closing socket connection
>>>> 2012-07-05 20:06:40,586 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
>>>> regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
>>>> regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
>>>> aborting
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>>
>>>> Could the fact that the cluster is idle cause the sessions to expire?
>>>> It's almost like a timing trigger pops, the sessions expire, and then
>>>> can reconnect. Is there a timer I need to adjust?
>>>>
>>>> Could this be related to a TCP or IP timer that needs to be adjusted?
>>>> The session goes into a Fin/Wait state and then closes?
>>>>
>>>> Thank you
>>>> ---
>>>> Jay Wilson
>>>>
>>>
>>>
>>
>>
>>
> 
> 
> 



Re: HBASE -- RS expire?

Posted by Amandeep Khurana <am...@gmail.com>.
The timeout can be configured using the session timeout configuration. The default for that is 180s, but that means that if the RS doesn't heartbeat to ZK for 180s, it's considered dead. Unless the machines are really loaded or GCs are pausing the RS processes, I don't see any other reason except the network. I'm assuming you gave ZK a dedicated disk so it could write its edit logs (based on a previous thread). 


On Thursday, July 5, 2012 at 9:03 PM, Jay Wilson wrote:

> I don't see that in the RS logs. Would I see that in the ZK logs?
> 
> At this point there is no network. Just a switch. I reduced the number
> of nodes to 40 and had all of them placed on the same switch with a
> single vlan. I even had the network techs use a completely different
> switch just to be safe.
> 
> Is there some heatbeat timer I can tweak?
> 
> ---
> Jay Wilson
> 
> On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
> > 
> > 
> > On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
> > 
> > > Finally my HMaster has stabilized and been running for 7 hours. I
> > > believe my networking issues are behind me now. Thank you everyone for
> > > the help.
> > > 
> > 
> > 
> > Awesome.
> > 
> > Looks like the same issue is biting you with the RS too. The RS isn't heartbeating to ZK and the ZK session expires, causing the RS to die.
> > Do you see a YouAreDeadException in the logs? 
> > > 
> > > New issue is my RSes continue to die after about 20 minutes. Again the
> > > cluster is idle. No jobs are running and I get this on all of my RSes
> > > at almost the same time:
> > > 
> > > 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
> > > socket connection to server devrackA-04/172.18.0.5:2181
> > > 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
> > > connection established to devrackA-04/172.18.0.5:2181, initiating session
> > > 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
> > > establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
> > > = 0x13858fc240f0003, negotiated timeout = 180000
> > > 2012-07-05 19:34:05,399 INFO
> > > org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
> > > hook thread: Shutdownhook:regionserver60020
> > > 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > > read additional data from server sessionid 0x13858fc240f0003, likely
> > > server has closed socket, closing socket connection and attempting reconnect
> > > 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
> > > socket connection to server devrackA-03/172.18.0.4:2181
> > > 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
> > > connection established to devrackA-03/172.18.0.4:2181, initiating session
> > > 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > > reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
> > > closing socket connection
> > > 2012-07-05 20:06:40,586 FATAL
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > > server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
> > > regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
> > > regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
> > > aborting
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired
> > > 
> > > Could the fact that the cluster is idle cause the sessions to expire?
> > > It's almost like a timing trigger pops, the sessions expire, and then
> > > can reconnect. Is there a timer I need to adjust?
> > > 
> > > Could this be related to a TCP or IP timer that needs to be adjusted?
> > > The session goes into a Fin/Wait state and then closes?
> > > 
> > > Thank you
> > > ---
> > > Jay Wilson
> > > 
> > 
> > 
> 
> 
> 



Re: HBASE -- RS expire?

Posted by Jay Wilson <re...@circle-cross-jn.com>.
I don't see that in the RS logs.  Would I see that in the ZK logs?

At this point there is no network.  Just a switch.  I reduced the number
of nodes to 40 and had all of them placed on the same switch with a
single vlan.  I even had the network techs use a completely different
switch just to be safe.

Is there some heatbeat timer I can tweak?

---
Jay Wilson

On 7/5/2012 8:34 PM, Amandeep Khurana wrote:
> 
> 
> On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
> 
>> Finally my HMaster has stabilized and been running for 7 hours. I
>> believe my networking issues are behind me now. Thank you everyone for
>> the help.
>>
>>
> 
> Awesome.
> 
> Looks like the same issue is biting you with the RS too. The RS isn't heartbeating to ZK and the ZK session expires, causing the RS to die.
> Do you see a YouAreDeadException in the logs? 
>>
>> New issue is my RSes continue to die after about 20 minutes. Again the
>> cluster is idle. No jobs are running and I get this on all of my RSes
>> at almost the same time:
>>
>> 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server devrackA-04/172.18.0.5:2181
>> 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to devrackA-04/172.18.0.5:2181, initiating session
>> 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
>> = 0x13858fc240f0003, negotiated timeout = 180000
>> 2012-07-05 19:34:05,399 INFO
>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
>> hook thread: Shutdownhook:regionserver60020
>> 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
>> read additional data from server sessionid 0x13858fc240f0003, likely
>> server has closed socket, closing socket connection and attempting reconnect
>> 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
>> socket connection to server devrackA-03/172.18.0.4:2181
>> 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to devrackA-03/172.18.0.4:2181, initiating session
>> 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
>> reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
>> closing socket connection
>> 2012-07-05 20:06:40,586 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
>> regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
>> regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
>> aborting
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> KeeperErrorCode = Session expired
>>
>> Could the fact that the cluster is idle cause the sessions to expire?
>> It's almost like a timing trigger pops, the sessions expire, and then
>> can reconnect. Is there a timer I need to adjust?
>>
>> Could this be related to a TCP or IP timer that needs to be adjusted?
>> The session goes into a Fin/Wait state and then closes?
>>
>> Thank you
>> ---
>> Jay Wilson
>>
>>
> 
> 
> 



Re: HBASE -- RS expire?

Posted by Amandeep Khurana <am...@gmail.com>.

On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:

> Finally my HMaster has stabilized and been running for 7 hours. I
> believe my networking issues are behind me now. Thank you everyone for
> the help.
> 
> 

Awesome.

Looks like the same issue is biting you with the RS too. The RS isn't heartbeating to ZK and the ZK session expires, causing the RS to die.
Do you see a YouAreDeadException in the logs? 
> 
> New issue is my RSes continue to die after about 20 minutes. Again the
> cluster is idle. No jobs are running and I get this on all of my RSes
> at almost the same time:
> 
> 2012-07-05 19:34:05,283 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server devrackA-04/172.18.0.5:2181
> 2012-07-05 19:34:05,288 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to devrackA-04/172.18.0.5:2181, initiating session
> 2012-07-05 19:34:05,301 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server devrackA-04/172.18.0.5:2181, sessionid
> = 0x13858fc240f0003, negotiated timeout = 180000
> 2012-07-05 19:34:05,399 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown
> hook thread: Shutdownhook:regionserver60020
> 2012-07-05 20:06:40,279 INFO org.apache.zookeeper.ClientCnxn: Unable to
> read additional data from server sessionid 0x13858fc240f0003, likely
> server has closed socket, closing socket connection and attempting reconnect
> 2012-07-05 20:06:40,573 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server devrackA-03/172.18.0.4:2181
> 2012-07-05 20:06:40,574 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to devrackA-03/172.18.0.4:2181, initiating session
> 2012-07-05 20:06:40,578 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x13858fc240f0003 has expired,
> closing socket connection
> 2012-07-05 20:06:40,586 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server serverName=devrackB-07,60020,1341542045088, load=(requests=0,
> regions=0, usedHeap=0, maxHeap=0): regionserver:60020-0x13858fc240f0003
> regionserver:60020-0x13858fc240f0003 received expired from ZooKeeper,
> aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> 
> Could the fact that the cluster is idle cause the sessions to expire?
> It's almost like a timing trigger pops, the sessions expire, and then
> can reconnect. Is there a timer I need to adjust?
> 
> Could this be related to a TCP or IP timer that needs to be adjusted?
> The session goes into a Fin/Wait state and then closes?
> 
> Thank you
> ---
> Jay Wilson
> 
>