You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Murali Krishna. P" <mu...@yahoo.com> on 2009/06/28 17:23:27 UTC

Region servers going down frequently (0.20 alpha)

Hi,
  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.

My cluster is as follows:
node1 : Master, NN, DN, RS, TT, XX
node2: Zookeeper, JT, DN, RS, TT, XX
node3: DN, RS, TT, XX

where  XX is my own hbase client with around 150 threads writing to a common table.

The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..

	* RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path: 
null
	* RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
	* After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
2009-06-28 02:14:17,751 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
	* Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
ver not running, aborting
	* Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Runs every 10000000ms
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
        at java.lang.Thread.run(Thread.java:619)
2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
java.io.IOException: Region server startup failed
        at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        ... 2 more
2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
 blockCacheHitRatio=94
2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020

There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?

Thanks,
Murali Krishna
/

Re: Region servers going down frequently (0.20 alpha)

Posted by Nitay <ni...@gmail.com>.

Hi Murali,

Sorry I'm late the join the conversation. Yes, running a full ZK
quorum  (3 or 5 servers) is key for handling the sort of load you are
seeing.

About the RegionServer not coming back up, can you provide us the full
logs (with DEBUG enabled) on those machines?

Thanks,
-n

On Mon, Jun 29, 2009 at 10:26 AM, Andrew Purtell<ap...@apache.org> wrote:
> Hi,
>
> I suspect your cluster is too small for what you are trying to do
> with it. This is why I first suggested setting up a stable ZK ensemble,
> as this would be one of the first components to be affected by out of
> tolerance latencies due to swapping.
>
> How much RAM do these nodes have? How much heap are you allocating to
> each JVM? (Have you made changes to the -Xmx settings in the launcher
> script?) Do you have host level metrics running? If not, consider
> watching this with Ganglia, or, in this case, since the cluster is so
> small three terminals running top or atop. After 20, 30 minutes, is
> all available RAM full and are the nodes going in to swap?
>
> I have been debating if an all RAM system is best for Hadoop and HBase
> installations. Because the latency of the JVM GC cycle will spike out
> of tolerance if the heap goes into swap, running without swap keeps
> everyone honest. For example, a node running the following services:
>
>   datanode (1 GB heap)
>   region server (2 GB heap)
>   task tracker (1 GB heap)
>   map / reduce tasks (200 MB heap each is the default) x 4
>
> and ~1 GB for kernel and system daemons can run within 8 GB of RAM.
>
> If your tables are going to be really large (or why else use Bigtable?)
> and the workload can benefit from caching, up the RAM and allocate more
> heap to the region servers for cache.
>
>   - Andy
>
>
>
> On Mon, Jun 29, 2009 at 8:51 AM, Murali Krishna. P
> <mu...@yahoo.com>wrote:
>
>> Hi,
>>  Just want to send the latest update...With 3 ZK peers, it was much more
>> stable. The entire setup ran fine for more than 4 hours. But got the same
>> problem after that and one of the region server was shutdown. Looks like
>> zookeeper expire event and the RS restart is failing..
>>
>> Master log:
>> 2009-06-29 08:18:31,100 INFO org.apache.hadoop.hbase.master.ServerManager:
>> 3 region servers, 0 dead, average load 2.0
>> 2009-06-29 08:18:34,544 INFO org.apache.hadoop.hbase.master.ServerManager:
>> RSHOST,60020,1246272222149 znode expir
>> ed
>> 2009-06-29 08:18:34,797 INFO
>> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
>> server scrub02.image.search.re4.yahoo.
>> com,60020,1246272222149: logSplit: false, rootRescanned: false,
>> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
>>
>> RS log:
>> 2009-06-29 08:19:15,731 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
>> 60203ms, ten times longer than scheduled: 3000
>> 2009-06-29 08:19:15,955 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: Disconnected, type: None, path:
>> null
>> 2009-06-29 08:19:16,029 WARN
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message
>> (Retry: 0)
>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>
>>
>> May be the machine goes busy and the zookeeper has issues with that, but it
>> is surprising why the region server is not able to recover. Any clue?
>>
>> Thanks,
>> Murali Krishna
>>
>>
>>
>>
>> ________________________________
>> From: Andrew Purtell <ap...@apache.org>
>> To: hbase-user@hadoop.apache.org
>> Sent: Monday, 29 June, 2009 12:14:57 PM
>> Subject: Re: Region servers going down frequently (0.20 alpha)
>>
>> Hi,
>>
>> Configuring 'myid' files are part of the Zookeeper set up process.
>> Are you aware of the instructions for how to set up Zookeeper here:
>>
>>  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html
>>
>> ?
>>
>> From:
>>
>>
>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkMulitServerSetup
>>
>> "For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster
>> known as an ensemble. [...] Here are the steps to setting a server that will
>> be part of an ensemble. These steps should be performed on every host in the
>> ensemble: ..."
>>
>>    - Andy
>>
>>
>>
>>
>> ________________________________
>> From: Murali Krishna. P <mu...@yahoo.com>
>> To: hbase-user@hadoop.apache.org
>> Sent: Sunday, June 28, 2009 10:12:02 PM
>> Subject: Re: Region servers going down frequently (0.20 alpha)
>>
>> Hi Andrew,
>> Thanks for looking into this.
>> I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid'
>> file is missing. Now even if i go back to my old config, it still throws the
>> error :(
>>
>> Thanks,
>> Murali Krishna
>>
>>
>>
>>
>> ________________________________
>> From: Andrew Purtell <ap...@apache.org>
>> To: hbase-user@hadoop.apache.org
>> Sent: Sunday, 28 June, 2009 10:47:12 PM
>> Subject: Re: Region servers going down frequently (0.20 alpha)
>>
>> Hello,
>>
>> As a first step, deploy Zookeeper quorum peers on all of your nodes and
>> list all peers in the zoo.cfg files of your Zookeeper install and HBase:
>>
>>  server.1=node1:2888:3888
>>  server.2=node2:2888:3888
>>  server.3=node3:2888:3888
>>
>> Are you running mapreduce tasks as well as otherwise what you have
>> described
>> below?
>>
>> Do you see any messages in the master or region server logs along the lines
>> of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes
>> have?
>> Do you have host level metrics running? If not, consider watching this with
>> Ganglia, or, in this case, since the cluster is so small three terminals
>> running top or atop. After 20, 30 minutes, is all available RAM full and
>> are
>> the nodes going in to swap?
>>
>>   - Andy
>>
>>
>>
>>
>> ________________________________
>> From: Murali Krishna. P <mu...@yahoo.com>
>> To: hbase-user@hadoop.apache.org
>> Sent: Sunday, June 28, 2009 8:23:27 AM
>> Subject: Region servers going down frequently (0.20 alpha)
>>
>> Hi,
>>  I am repeatedly running into this issue where all the region servers tries
>> to restart but fails to come up. All the region servers seems to be having
>> same kind of exception which causes this state.
>>
>> My cluster is as follows:
>> node1 : Master, NN, DN, RS, TT, XX
>> node2: Zookeeper, JT, DN, RS, TT, XX
>> node3: DN, RS, TT, XX
>>
>> where  XX is my own hbase client with around 150 threads writing to a
>> common table.
>>
>> The setup works fine for some time and then goes down (after 20, 30 mins).
>> Here is the sequence in the region server logs..
>>
>>    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected,
>> type: None, path:
>> null
>>    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28
>> 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer:
>> Processing message (Retry: 1)
>> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>>    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event,
>> state: Expired, type: None, path: null
>> 2009-06-28 02:14:17,751 ERROR
>> org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
>> expired
>> 2009-06-28 02:14:17,751 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
>>    * Decides to restart region server, but logs of error like this:
>> 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
>> handler 280 on 60020, call exists([B@75880048,
>> row=724b330295375ad0ba68fa85325381, maxVersions=1,
>> timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945:
>> error: java.io.IOException: Ser
>> ver not running, aborting
>>    * Above might be happening because client 'XX' still trying to write?
>> Finally it closes the region server and tries to restart. But gets the
>> following exception:2009-06-28 02:14:26,462 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
>> thread.
>> 2009-06-28 02:14:26,462 INFO
>> org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
>> 2009-06-28 02:14:26,462 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
>> 2009-06-28 02:14:27,032 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
>> java.lang.NullPointerException
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
>>        at java.lang.Thread.run(Thread.java:619)
>> 2009-06-28 02:14:27,110 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception.
>> Aborting...
>> java.io.IOException: Region server startup failed
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
>>        at
>> org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
>>        at java.lang.Thread.run(Thread.java:619)
>> Caused by: java.lang.NullPointerException
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>>        ... 2 more
>> 2009-06-28 02:14:27,122 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> request=0.0, regions=9, stores=10, storefil
>> es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995,
>> blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
>> blockCacheHitRatio=94
>> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
>> server on 60020
>> 2009-06-28 02:14:27,131 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
>> 2009-06-28 02:14:27,131 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
>> 2009-06-28 02:14:27,136 INFO
>> org.apache.hadoop.hbase...regionserver.HRegionServer: aborting server at:
>> 0.0.0.0:60020
>>
>> There region server dies after that. All the 3 region servers die like this
>> and I have to start the region server manually. But aftert 10-15 minutes, it
>> runs into the same stage again. Please help me in finding what is the root
>> cause of this?
>>
>> Thanks,
>> Murali Krishna
>> /
>>
>
>
>
>

Re: Region servers going down frequently (0.20 alpha)

Posted by Andrew Purtell <ap...@apache.org>.

Hi,

I suspect your cluster is too small for what you are trying to do
with it. This is why I first suggested setting up a stable ZK ensemble,
as this would be one of the first components to be affected by out of
tolerance latencies due to swapping. 

How much RAM do these nodes have? How much heap are you allocating to
each JVM? (Have you made changes to the -Xmx settings in the launcher
script?) Do you have host level metrics running? If not, consider
watching this with Ganglia, or, in this case, since the cluster is so
small three terminals running top or atop. After 20, 30 minutes, is
all available RAM full and are the nodes going in to swap?

I have been debating if an all RAM system is best for Hadoop and HBase
installations. Because the latency of the JVM GC cycle will spike out
of tolerance if the heap goes into swap, running without swap keeps
everyone honest. For example, a node running the following services:

   datanode (1 GB heap)
   region server (2 GB heap)
   task tracker (1 GB heap)
   map / reduce tasks (200 MB heap each is the default) x 4

and ~1 GB for kernel and system daemons can run within 8 GB of RAM. 

If your tables are going to be really large (or why else use Bigtable?)
and the workload can benefit from caching, up the RAM and allocate more
heap to the region servers for cache. 

   - Andy



On Mon, Jun 29, 2009 at 8:51 AM, Murali Krishna. P
<mu...@yahoo.com>wrote:

> Hi,
>  Just want to send the latest update...With 3 ZK peers, it was much more
> stable. The entire setup ran fine for more than 4 hours. But got the same
> problem after that and one of the region server was shutdown. Looks like
> zookeeper expire event and the RS restart is failing..
>
> Master log:
> 2009-06-29 08:18:31,100 INFO org.apache.hadoop.hbase.master.ServerManager:
> 3 region servers, 0 dead, average load 2.0
> 2009-06-29 08:18:34,544 INFO org.apache.hadoop.hbase.master.ServerManager:
> RSHOST,60020,1246272222149 znode expir
> ed
> 2009-06-29 08:18:34,797 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> server scrub02.image.search.re4.yahoo.
> com,60020,1246272222149: logSplit: false, rootRescanned: false,
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
>
> RS log:
> 2009-06-29 08:19:15,731 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 60203ms, ten times longer than scheduled: 3000
> 2009-06-29 08:19:15,955 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path:
> null
> 2009-06-29 08:19:16,029 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message
> (Retry: 0)
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>
>
> May be the machine goes busy and the zookeeper has issues with that, but it
> is surprising why the region server is not able to recover. Any clue?
>
> Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrew Purtell <ap...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Monday, 29 June, 2009 12:14:57 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hi,
>
> Configuring 'myid' files are part of the Zookeeper set up process.
> Are you aware of the instructions for how to set up Zookeeper here:
>
>  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html
>
> ?
>
> From:
>
>
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkMulitServerSetup
>
> "For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster
> known as an ensemble. [...] Here are the steps to setting a server that will
> be part of an ensemble. These steps should be performed on every host in the
> ensemble: ..."
>
>    - Andy
>
>
>
>
> ________________________________
> From: Murali Krishna. P <mu...@yahoo.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 28, 2009 10:12:02 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hi Andrew,
> Thanks for looking into this.
> I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid'
> file is missing. Now even if i go back to my old config, it still throws the
> error :(
>
> Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrew Purtell <ap...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, 28 June, 2009 10:47:12 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hello,
>
> As a first step, deploy Zookeeper quorum peers on all of your nodes and
> list all peers in the zoo.cfg files of your Zookeeper install and HBase:
>
>  server.1=node1:2888:3888
>  server.2=node2:2888:3888
>  server.3=node3:2888:3888
>
> Are you running mapreduce tasks as well as otherwise what you have
> described
> below?
>
> Do you see any messages in the master or region server logs along the lines
> of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes
> have?
> Do you have host level metrics running? If not, consider watching this with
> Ganglia, or, in this case, since the cluster is so small three terminals
> running top or atop. After 20, 30 minutes, is all available RAM full and
> are
> the nodes going in to swap?
>
>   - Andy
>
>
>
>
> ________________________________
> From: Murali Krishna. P <mu...@yahoo.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 28, 2009 8:23:27 AM
> Subject: Region servers going down frequently (0.20 alpha)
>
> Hi,
>  I am repeatedly running into this issue where all the region servers tries
> to restart but fails to come up. All the region servers seems to be having
> same kind of exception which causes this state.
>
> My cluster is as follows:
> node1 : Master, NN, DN, RS, TT, XX
> node2: Zookeeper, JT, DN, RS, TT, XX
> node3: DN, RS, TT, XX
>
> where  XX is my own hbase client with around 150 threads writing to a
> common table.
>
> The setup works fine for some time and then goes down (after 20, 30 mins).
> Here is the sequence in the region server logs..
>
>    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected,
> type: None, path:
> null
>    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28
> 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer:
> Processing message (Retry: 1)
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2009-06-28 02:14:17,751 ERROR
> org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
> 2009-06-28 02:14:17,751 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
>    * Decides to restart region server, but logs of error like this:
> 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 280 on 60020, call exists([B@75880048,
> row=724b330295375ad0ba68fa85325381, maxVersions=1,
> timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945:
> error: java.io.IOException: Ser
> ver not running, aborting
>    * Above might be happening because client 'XX' still trying to write?
> Finally it closes the region server and tries to restart. But gets the
> following exception:2009-06-28 02:14:26,462 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
> thread.
> 2009-06-28 02:14:26,462 INFO
> org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
> 2009-06-28 02:14:26,462 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
> 2009-06-28 02:14:27,032 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
> java.lang.NullPointerException
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-06-28 02:14:27,110 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception.
> Aborting...
> java.io.IOException: Region server startup failed
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
>        at
> org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        ... 2 more
> 2009-06-28 02:14:27,122 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=9, stores=10, storefil
> es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995,
> blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
> blockCacheHitRatio=94
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> 2009-06-28 02:14:27,131 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> 2009-06-28 02:14:27,131 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
> 2009-06-28 02:14:27,136 INFO
> org.apache.hadoop.hbase...regionserver.HRegionServer: aborting server at:
> 0.0.0.0:60020
>
> There region server dies after that. All the 3 region servers die like this
> and I have to start the region server manually. But aftert 10-15 minutes, it
> runs into the same stage again. Please help me in finding what is the root
> cause of this?
>
> Thanks,
> Murali Krishna
> /
>

Re: Region servers going down frequently (0.20 alpha)

Posted by stack <st...@duboce.net>.

This is recent TRUNK?

I see this line in your log excerpt above:

2009-06-29 08:19:15,731 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
60203ms, ten times longer than scheduled: 3000

GC's are taking so long, lease expires and session is lost.  Have you
checked out this page: http://wiki.apache.org/hadoop/PerformanceTuning?

Make sure you're using CMS and right rough tunings for your hardware.

What else is running on this machine?  Heavy-duty processes?

You could try upping the zk timeout.  Looks like its 30seconds in default:

  <property>
    <name>zookeeper.session.timeout</name>
    <value>30000</value>
    <description>ZooKeeper session timeout. This option is not used by HBase
      directly, it is for the internals of ZooKeeper. HBase merely passes it
in
      whenever a connection is established to ZooKeeper. It is used by
ZooKeeper
      for hearbeats. In milliseconds.
    </description>
  </property>

Going by the above timeout, you would need to set it to 90 seconds or so.

On the regionserver, please send full logs.    Its supposed to recover a
lease expiration.  Sending full logs we'll be able to check it out.   BUT,
you should be running in DEBUG mode so more info on whats going on.  For how
to set DEBUG, please see the FAQ.

Thanks for your patience,
St.Ack


On Mon, Jun 29, 2009 at 8:51 AM, Murali Krishna. P
<mu...@yahoo.com>wrote:

> Hi,
>  Just want to send the latest update...With 3 ZK peers, it was much more
> stable. The entire setup ran fine for more than 4 hours. But got the same
> problem after that and one of the region server was shutdown. Looks like
> zookeeper expire event and the RS restart is failing..
>
> Master log:
> 2009-06-29 08:18:31,100 INFO org.apache.hadoop.hbase.master.ServerManager:
> 3 region servers, 0 dead, average load 2.0
> 2009-06-29 08:18:34,544 INFO org.apache.hadoop.hbase.master.ServerManager:
> RSHOST,60020,1246272222149 znode expir
> ed
> 2009-06-29 08:18:34,797 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> server scrub02.image.search.re4.yahoo.
> com,60020,1246272222149: logSplit: false, rootRescanned: false,
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
>
> RS log:
> 2009-06-29 08:19:15,731 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 60203ms, ten times longer than scheduled: 3000
> 2009-06-29 08:19:15,955 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: Disconnected, type: None, path:
> null
> 2009-06-29 08:19:16,029 WARN
> org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message
> (Retry: 0)
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>
>
> May be the machine goes busy and the zookeeper has issues with that, but it
> is surprising why the region server is not able to recover. Any clue?
>
> Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrew Purtell <ap...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Monday, 29 June, 2009 12:14:57 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hi,
>
> Configuring 'myid' files are part of the Zookeeper set up process.
> Are you aware of the instructions for how to set up Zookeeper here:
>
>  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html
>
> ?
>
> From:
>
>
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkMulitServerSetup
>
> "For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster
> known as an ensemble. [...] Here are the steps to setting a server that will
> be part of an ensemble. These steps should be performed on every host in the
> ensemble: ..."
>
>    - Andy
>
>
>
>
> ________________________________
> From: Murali Krishna. P <mu...@yahoo.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 28, 2009 10:12:02 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hi Andrew,
> Thanks for looking into this.
> I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid'
> file is missing. Now even if i go back to my old config, it still throws the
> error :(
>
> Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrew Purtell <ap...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, 28 June, 2009 10:47:12 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hello,
>
> As a first step, deploy Zookeeper quorum peers on all of your nodes and
> list all peers in the zoo.cfg files of your Zookeeper install and HBase:
>
>  server.1=node1:2888:3888
>  server.2=node2:2888:3888
>  server.3=node3:2888:3888
>
> Are you running mapreduce tasks as well as otherwise what you have
> described
> below?
>
> Do you see any messages in the master or region server logs along the lines
> of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes
> have?
> Do you have host level metrics running? If not, consider watching this with
> Ganglia, or, in this case, since the cluster is so small three terminals
> running top or atop. After 20, 30 minutes, is all available RAM full and
> are
> the nodes going in to swap?
>
>   - Andy
>
>
>
>
> ________________________________
> From: Murali Krishna. P <mu...@yahoo.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 28, 2009 8:23:27 AM
> Subject: Region servers going down frequently (0.20 alpha)
>
> Hi,
>  I am repeatedly running into this issue where all the region servers tries
> to restart but fails to come up. All the region servers seems to be having
> same kind of exception which causes this state.
>
> My cluster is as follows:
> node1 : Master, NN, DN, RS, TT, XX
> node2: Zookeeper, JT, DN, RS, TT, XX
> node3: DN, RS, TT, XX
>
> where  XX is my own hbase client with around 150 threads writing to a
> common table.
>
> The setup works fine for some time and then goes down (after 20, 30 mins).
> Here is the sequence in the region server logs..
>
>    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected,
> type: None, path:
> null
>    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28
> 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer:
> Processing message (Retry: 1)
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event,
> state: Expired, type: None, path: null
> 2009-06-28 02:14:17,751 ERROR
> org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> expired
> 2009-06-28 02:14:17,751 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
>    * Decides to restart region server, but logs of error like this:
> 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 280 on 60020, call exists([B@75880048,
> row=724b330295375ad0ba68fa85325381, maxVersions=1,
> timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945:
> error: java.io.IOException: Ser
> ver not running, aborting
>    * Above might be happening because client 'XX' still trying to write?
> Finally it closes the region server and tries to restart. But gets the
> following exception:2009-06-28 02:14:26,462 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
> thread.
> 2009-06-28 02:14:26,462 INFO
> org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
> 2009-06-28 02:14:26,462 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
> 2009-06-28 02:14:27,032 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
> java.lang.NullPointerException
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-06-28 02:14:27,110 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception.
> Aborting...
> java.io.IOException: Region server startup failed
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
>        at
> org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        ... 2 more
> 2009-06-28 02:14:27,122 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=0.0, regions=9, stores=10, storefil
> es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995,
> blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
> blockCacheHitRatio=94
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> 2009-06-28 02:14:27,131 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> 2009-06-28 02:14:27,131 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
> 2009-06-28 02:14:27,136 INFO
> org.apache.hadoop.hbase...regionserver.HRegionServer: aborting server at:
> 0.0.0.0:60020
>
> There region server dies after that. All the 3 region servers die like this
> and I have to start the region server manually. But aftert 10-15 minutes, it
> runs into the same stage again. Please help me in finding what is the root
> cause of this?
>
> Thanks,
> Murali Krishna
> /
>

Re: Region servers going down frequently (0.20 alpha)

Posted by "Murali Krishna. P" <mu...@yahoo.com>.

Hi,
  Just want to send the latest update...With 3 ZK peers, it was much more stable. The entire setup ran fine for more than 4 hours. But got the same problem after that and one of the region server was shutdown. Looks like zookeeper expire event and the RS restart is failing.. 

Master log:
2009-06-29 08:18:31,100 INFO org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, average load 2.0
2009-06-29 08:18:34,544 INFO org.apache.hadoop.hbase.master.ServerManager: RSHOST,60020,1246272222149 znode expir
ed
2009-06-29 08:18:34,797 INFO org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of server scrub02.image.search.re4.yahoo.
com,60020,1246272222149: logSplit: false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1

RS log:
2009-06-29 08:19:15,731 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 60203ms, ten times longer than scheduled: 3000
2009-06-29 08:19:15,955 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Disconnected, type: None, path: 
null
2009-06-29 08:19:16,029 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 0)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException


May be the machine goes busy and the zookeeper has issues with that, but it is surprising why the region server is not able to recover. Any clue?

Thanks,
Murali Krishna




________________________________
From: Andrew Purtell <ap...@apache.org>
To: hbase-user@hadoop.apache.org
Sent: Monday, 29 June, 2009 12:14:57 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hi,

Configuring 'myid' files are part of the Zookeeper set up process.
Are you aware of the instructions for how to set up Zookeeper here: 

  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html

?

From:

  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkMulitServerSetup

"For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. [...] Here are the steps to setting a server that will be part of an ensemble. These steps should be performed on every host in the ensemble: ..."

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 10:12:02 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hi Andrew,
Thanks for looking into this.
I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing. Now even if i go back to my old config, it still throws the error :(

Thanks,
Murali Krishna




________________________________
From: Andrew Purtell <ap...@apache.org>
To: hbase-user@hadoop.apache.org
Sent: Sunday, 28 June, 2009 10:47:12 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hello,

As a first step, deploy Zookeeper quorum peers on all of your nodes and 
list all peers in the zoo.cfg files of your Zookeeper install and HBase:

  server.1=node1:2888:3888
  server.2=node2:2888:3888
  server.3=node3:2888:3888

Are you running mapreduce tasks as well as otherwise what you have described
below? 

Do you see any messages in the master or region server logs along the lines
of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
Do you have host level metrics running? If not, consider watching this with
Ganglia, or, in this case, since the cluster is so small three terminals
running top or atop. After 20, 30 minutes, is all available RAM full and are
the nodes going in to swap? 

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 8:23:27 AM
Subject: Region servers going down frequently (0.20 alpha)

Hi,
  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.

My cluster is as follows:
node1 : Master, NN, DN, RS, TT, XX
node2: Zookeeper, JT, DN, RS, TT, XX
node3: DN, RS, TT, XX

where  XX is my own hbase client with around 150 threads writing to a common table.

The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..

    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path: 
null
    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
ver not running, aborting
    * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
        at java.lang.Thread.run(Thread.java:619)
2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
java.io.IOException: Region server startup failed
        at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        ... 2 more
2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
blockCacheHitRatio=94
2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase...regionserver.HRegionServer: aborting server at: 0.0.0.0:60020

There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?

Thanks,
Murali Krishna
/

Re: Region servers going down frequently (0.20 alpha)

Posted by "Murali Krishna. P" <mu...@yahoo.com>.

Hi,
Restarted the cluster after clearing the old logs and reproduced the problem. One of the 3 region server went down now. I am attaching RS logs. (Still with one ZK)
The bad part is that the regions are not reassigned and if i try to do a 'get', it still tries to connect to that region server. :(

 Thanks,
Murali Krishna




________________________________
From: Andrew Purtell <ap...@apache.org>
To: hbase-user@hadoop..apache.org
Sent: Monday, 29 June, 2009 12:14:57 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hi,

Configuring 'myid' files are part of the Zookeeper set up process.
Are you aware of the instructions for how to set up Zookeeper here: 

  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html

?

From:

  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkMulitServerSetup

"For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. [...] Here are the steps to setting a server that will be part of an ensemble. These steps should be performed on every host in the ensemble: ..."

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 10:12:02 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hi Andrew,
Thanks for looking into this.
I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing. Now even if i go back to my old config, it still throws the error :(

Thanks,
Murali Krishna




________________________________
From: Andrew Purtell <ap...@apache.org>
To: hbase-user@hadoop.apache.org
Sent: Sunday, 28 June, 2009 10:47:12 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hello,

As a first step, deploy Zookeeper quorum peers on all of your nodes and 
list all peers in the zoo.cfg files of your Zookeeper install and HBase:

  server.1=node1:2888:3888
  server.2=node2:2888:3888
  server.3=node3:2888:3888

Are you running mapreduce tasks as well as otherwise what you have described
below? 

Do you see any messages in the master or region server logs along the lines
of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
Do you have host level metrics running? If not, consider watching this with
Ganglia, or, in this case, since the cluster is so small three terminals
running top or atop. After 20, 30 minutes, is all available RAM full and are
the nodes going in to swap? 

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo..com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 8:23:27 AM
Subject: Region servers going down frequently (0.20 alpha)

Hi,
  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.

My cluster is as follows:
node1 : Master, NN, DN, RS, TT, XX
node2: Zookeeper, JT, DN, RS, TT, XX
node3: DN, RS, TT, XX

where  XX is my own hbase client with around 150 threads writing to a common table.

The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..

    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path: 
null
    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
2009-06-28 02:14:17,751 ERROR org..apache..hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
ver not running, aborting
    * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
        at java.lang.Thread.run(Thread.java:619)
2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
java.io.IOException: Region server startup failed
        at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        ... 2 more
2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
blockCacheHitRatio=94
2009-06-28 02:14:27,131 INFO org.apache..hadoop.ipc.HBaseServer: Stopping server on 60020
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020

There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?

Thanks,
Murali Krishna
/

Re: Region servers going down frequently (0.20 alpha)

Posted by Andrew Purtell <ap...@apache.org>.

Hi,

Configuring 'myid' files are part of the Zookeeper set up process.
Are you aware of the instructions for how to set up Zookeeper here: 

  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html

?

From:

  http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkMulitServerSetup

"For reliable ZooKeeper service, you should deploy ZooKeeper in a cluster known as an ensemble. [...] Here are the steps to setting a server that will be part of an ensemble. These steps should be performed on every host in the ensemble: ..."

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 10:12:02 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hi Andrew,
Thanks for looking into this.
I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing. Now even if i go back to my old config, it still throws the error :(

Thanks,
Murali Krishna




________________________________
From: Andrew Purtell <ap...@apache.org>
To: hbase-user@hadoop.apache.org
Sent: Sunday, 28 June, 2009 10:47:12 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hello,

As a first step, deploy Zookeeper quorum peers on all of your nodes and 
list all peers in the zoo.cfg files of your Zookeeper install and HBase:

  server.1=node1:2888:3888
  server.2=node2:2888:3888
  server.3=node3:2888:3888

Are you running mapreduce tasks as well as otherwise what you have described
below? 

Do you see any messages in the master or region server logs along the lines
of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
Do you have host level metrics running? If not, consider watching this with
Ganglia, or, in this case, since the cluster is so small three terminals
running top or atop. After 20, 30 minutes, is all available RAM full and are
the nodes going in to swap? 

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 8:23:27 AM
Subject: Region servers going down frequently (0.20 alpha)

Hi,
  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.

My cluster is as follows:
node1 : Master, NN, DN, RS, TT, XX
node2: Zookeeper, JT, DN, RS, TT, XX
node3: DN, RS, TT, XX

where  XX is my own hbase client with around 150 threads writing to a common table.

The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..

    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path: 
null
    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
ver not running, aborting
    * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
        at java.lang.Thread.run(Thread.java:619)
2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
java.io.IOException: Region server startup failed
        at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        ... 2 more
2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
blockCacheHitRatio=94
2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020

There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?

Thanks,
Murali Krishna
/

Re: Region servers going down frequently (0.20 alpha)

Posted by "Murali Krishna. P" <mu...@yahoo.com>.

Hi,
 managed bring up the cluster after reseting the zoo.cfg to have only one server. But only one of the regionserver is running now. Others not starting saying address already in use '0.0.0.0/60020', but there is no old process running.. 

Attached the region server log of the one which is running. It seems to be stable till now. I have 4G memory and the usage is close 4G now.

 Thanks,
Murali Krishna




________________________________
From: Ryan Rawson <ry...@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Monday, 29 June, 2009 10:52:25 AM
Subject: Re: Region servers going down frequently (0.20 alpha)

Can you post more of the regionserver logs prior to the crash?

you can use pastebin.com if you'd like...

-ryan

On Sun, Jun 28, 2009 at 10:12 PM, Murali Krishna.
P<mu...@yahoo.com> wrote:
> Hi Andrew,
>  Thanks for looking into this.
> I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing. Now even if i go back to my old config, it still throws the error :(
>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrew Purtell <ap...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, 28 June, 2009 10:47:12 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hello,
>
> As a first step, deploy Zookeeper quorum peers on all of your nodes and
> list all peers in the zoo.cfg files of your Zookeeper install and HBase:
>
>  server.1=node1:2888:3888
>  server.2=node2:2888:3888
>  server.3=node3:2888:3888
>
> Are you running mapreduce tasks as well as otherwise what you have described
> below?
>
> Do you see any messages in the master or region server logs along the lines
> of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
> Do you have host level metrics running? If not, consider watching this with
> Ganglia, or, in this case, since the cluster is so small three terminals
> running top or atop. After 20, 30 minutes, is all available RAM full and are
> the nodes going in to swap?
>
>   - Andy
>
>
>
>
> ________________________________
> From: Murali Krishna. P <mu...@yahoo.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 28, 2009 8:23:27 AM
> Subject: Region servers going down frequently (0.20 alpha)
>
> Hi,
>  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.
>
> My cluster is as follows:
> node1 : Master, NN, DN, RS, TT, XX
> node2: Zookeeper, JT, DN, RS, TT, XX
> node3: DN, RS, TT, XX
>
> where  XX is my own hbase client with around 150 threads writing to a common table.
>
> The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..
>
>    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path:
> null
>    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
> 2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
> 2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
>    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
> ver not running, aborting
>    * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
> 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
> 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
> 2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
> java.lang.NullPointerException
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
> java.io.IOException: Region server startup failed
>        at org.apache.hadoop.hbase.regionserver..HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
>        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        ... 2 more
> 2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
> es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
> blockCacheHitRatio=94
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
> 2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020
>
> There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?
>
> Thanks,
> Murali Krishna
> /

Re: Region servers going down frequently (0.20 alpha)

Posted by Ryan Rawson <ry...@gmail.com>.

Can you post more of the regionserver logs prior to the crash?

you can use pastebin.com if you'd like...

-ryan

On Sun, Jun 28, 2009 at 10:12 PM, Murali Krishna.
P<mu...@yahoo.com> wrote:
> Hi Andrew,
>  Thanks for looking into this.
> I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing. Now even if i go back to my old config, it still throws the error :(
>
>  Thanks,
> Murali Krishna
>
>
>
>
> ________________________________
> From: Andrew Purtell <ap...@apache.org>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, 28 June, 2009 10:47:12 PM
> Subject: Re: Region servers going down frequently (0.20 alpha)
>
> Hello,
>
> As a first step, deploy Zookeeper quorum peers on all of your nodes and
> list all peers in the zoo.cfg files of your Zookeeper install and HBase:
>
>  server.1=node1:2888:3888
>  server.2=node2:2888:3888
>  server.3=node3:2888:3888
>
> Are you running mapreduce tasks as well as otherwise what you have described
> below?
>
> Do you see any messages in the master or region server logs along the lines
> of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
> Do you have host level metrics running? If not, consider watching this with
> Ganglia, or, in this case, since the cluster is so small three terminals
> running top or atop. After 20, 30 minutes, is all available RAM full and are
> the nodes going in to swap?
>
>   - Andy
>
>
>
>
> ________________________________
> From: Murali Krishna. P <mu...@yahoo.com>
> To: hbase-user@hadoop.apache.org
> Sent: Sunday, June 28, 2009 8:23:27 AM
> Subject: Region servers going down frequently (0.20 alpha)
>
> Hi,
>  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.
>
> My cluster is as follows:
> node1 : Master, NN, DN, RS, TT, XX
> node2: Zookeeper, JT, DN, RS, TT, XX
> node3: DN, RS, TT, XX
>
> where  XX is my own hbase client with around 150 threads writing to a common table.
>
> The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..
>
>    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path:
> null
>    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
> org.apache.hadoop.hbase.Leases$LeaseStillHeldException
>    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
> 2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
> 2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
>    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
> ver not running, aborting
>    * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
> 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
> 2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
> 2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
> java.lang.NullPointerException
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
>        at java.lang.Thread.run(Thread.java:619)
> 2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
> java.io.IOException: Region server startup failed
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
>        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
>        at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.NullPointerException
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
>        ... 2 more
> 2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
> es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
> blockCacheHitRatio=94
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
> 2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
> 2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020
>
> There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?
>
> Thanks,
> Murali Krishna
> /

Re: Region servers going down frequently (0.20 alpha)

Posted by "Murali Krishna. P" <mu...@yahoo.com>.

Hi Andrew,
 Thanks for looking into this.
I tried adding 3 nodes to the zoo.cfg but it threw an erros saying 'myid' file is missing. Now even if i go back to my old config, it still throws the error :(

 Thanks,
Murali Krishna




________________________________
From: Andrew Purtell <ap...@apache.org>
To: hbase-user@hadoop.apache.org
Sent: Sunday, 28 June, 2009 10:47:12 PM
Subject: Re: Region servers going down frequently (0.20 alpha)

Hello,

As a first step, deploy Zookeeper quorum peers on all of your nodes and 
list all peers in the zoo.cfg files of your Zookeeper install and HBase:

  server.1=node1:2888:3888
  server.2=node2:2888:3888
  server.3=node3:2888:3888

Are you running mapreduce tasks as well as otherwise what you have described
below? 

Do you see any messages in the master or region server logs along the lines
of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
Do you have host level metrics running? If not, consider watching this with
Ganglia, or, in this case, since the cluster is so small three terminals
running top or atop. After 20, 30 minutes, is all available RAM full and are
the nodes going in to swap? 

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 8:23:27 AM
Subject: Region servers going down frequently (0.20 alpha)

Hi,
  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.

My cluster is as follows:
node1 : Master, NN, DN, RS, TT, XX
node2: Zookeeper, JT, DN, RS, TT, XX
node3: DN, RS, TT, XX

where  XX is my own hbase client with around 150 threads writing to a common table.

The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..

    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path: 
null
    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
2009-06-28 02:14:17,751 ERROR org..apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
ver not running, aborting
    * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: Runs every 10000000ms
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
        at java.lang.Thread.run(Thread.java:619)
2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
java.io.IOException: Region server startup failed
        at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        ... 2 more
2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
blockCacheHitRatio=94
2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020

There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?

Thanks,
Murali Krishna
/

Re: Region servers going down frequently (0.20 alpha)

Posted by Andrew Purtell <ap...@apache.org>.

Hello,

As a first step, deploy Zookeeper quorum peers on all of your nodes and 
list all peers in the zoo.cfg files of your Zookeeper install and HBase:

  server.1=node1:2888:3888
  server.2=node2:2888:3888
  server.3=node3:2888:3888

Are you running mapreduce tasks as well as otherwise what you have described
below? 

Do you see any messages in the master or region server logs along the lines
of "we slept for NNNNNNms, wanted NNNNms"? How much RAM do these nodes have?
Do you have host level metrics running? If not, consider watching this with
Ganglia, or, in this case, since the cluster is so small three terminals
running top or atop. After 20, 30 minutes, is all available RAM full and are
the nodes going in to swap? 

   - Andy




________________________________
From: Murali Krishna. P <mu...@yahoo.com>
To: hbase-user@hadoop.apache.org
Sent: Sunday, June 28, 2009 8:23:27 AM
Subject: Region servers going down frequently (0.20 alpha)

Hi,
  I am repeatedly running into this issue where all the region servers tries to restart but fails to come up. All the region servers seems to be having same kind of exception which causes this state.

My cluster is as follows:
node1 : Master, NN, DN, RS, TT, XX
node2: Zookeeper, JT, DN, RS, TT, XX
node3: DN, RS, TT, XX

where  XX is my own hbase client with around 150 threads writing to a common table.

The setup works fine for some time and then goes down (after 20, 30 mins). Here is the sequence in the region server logs..

    * RS gets a zookeeper event : Got ZooKeeper event, state: Disconnected, type: None, path: 
null
    * RS retries 'processing image', gets LeaseStillHeldE: 2009-06-28 02:14:17,013 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
    * After 10 retries, gets another zoookeeper event : Got ZooKeeper event, state: Expired, type: None, path: null
2009-06-28 02:14:17,751 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
2009-06-28 02:14:17,751 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Restarting Region Server
    * Decides to restart region server, but logs of error like this: 2009-06-28 02:14:17,997 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 280 on 60020, call exists([B@75880048, row=724b330295375ad0ba68fa85325381, maxVersions=1, timeRange=[0,9223372036854775807), families=ALL) from 69.147.127.248:48945: error: java.io.IOException: Ser
ver not running, aborting
    * Above might be happening because client 'XX' still trying to write? Finally it closes the region server and tries to restart. But gets the following exception:2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Runs every 10000000ms
2009-06-28 02:14:26,462 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
2009-06-28 02:14:27,032 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Failed init
java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer..java:431)
        at java.lang.Thread.run(Thread.java:619)
2009-06-28 02:14:27,110 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: Unhandled exception. Aborting...
java.io.IOException: Region server startup failed
        at org.apache.hadoop.hbase.regionserver.HRegionServer.convertThrowableToIOE(HRegionServer.java:832)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:751)
        at org..apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:431)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HRegionServer.init(HRegionServer.java:713)
        ... 2 more
2009-06-28 02:14:27,122 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: request=0.0, regions=9, stores=10, storefil
es=20, storefileIndexSize=0, memcacheSize=52, usedHeap=170, maxHeap=1995, blockCacheSize=49971560, blockCacheFree=28440, blockCacheCount=765,
blockCacheHitRatio=94
2009-06-28 02:14:27,131 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer
2009-06-28 02:14:27,131 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: On abort, closed hlog
2009-06-28 02:14:27,136 INFO org.apache.hadoop.hbase..regionserver.HRegionServer: aborting server at: 0.0.0.0:60020

There region server dies after that. All the 3 region servers die like this and I have to start the region server manually. But aftert 10-15 minutes, it runs into the same stage again. Please help me in finding what is the root cause of this?

Thanks,
Murali Krishna
/