You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Vimal Jain <vk...@gmail.com> on 2013/06/05 08:28:56 UTC

HMaster and HRegionServer going down

Hi,
I have set up Hbase in pseudo-distributed mode.
It was working fine for 6 days , but suddenly today morning both HMaster
and Hregion process went down.
I checked in logs of both hadoop and hbase.
Please help here.
Here are the snippets :-

*Datanode logs:*
2013-06-05 05:12:51,436 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
for block blk_1597245478875608321_2818 java.io.EOFException: while trying
to read 2347 bytes
2013-06-05 05:12:51,442 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_1597245478875608321_2818 received exception java.io.EOFException: while
trying to read 2347 bytes
2013-06-05 05:12:51,442 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
192.168.20.30:50010,
storageID=DS-1816106352-192.168.20.30-50010-1369314076237, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 2347 bytes


*HRegion logs:*
2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4694929ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333remote=/
192.168.20.30:50010]
2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
11695345ms instead of 10000000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_1597245478875608321_2818 bad datanode[0]
192.168.20.30:50010
2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error while
syncing
java.io.IOException: All datanodes 192.168.20.30:50010 are bad. Aborting...
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
2013-06-05 05:12:51,110 FATAL
org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting
close of hlog
java.io.IOException: Reflection
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.io.IOException: DFSOutputStream is closed
2013-06-05 05:12:51,180 FATAL
org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting
close of hlog
java.io.IOException: Reflection
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.io.IOException: DFSOutputStream is closed
2013-06-05 05:12:51,183 ERROR
org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog writer
java.io.IOException: Reflection
Caused by: java.lang.reflect.InvocationTargetException
Caused by: java.io.IOException: DFSOutputStream is closed
2013-06-05 05:12:51,184 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
Riding over HLog close failure! error count=1
2013-06-05 05:12:52,557 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
hbase.rummycircle.com,60020,1369877672964:
regionserver:60020-0x13ef31264d00001 regionserver:60020-0x13ef31264d00001
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
2013-06-05 05:12:52,557 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
loaded coprocessors are: []
2013-06-05 05:12:52,621 INFO
org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
interrupted while waiting for task, exiting: java.lang.InterruptedException
java.io.InterruptedIOException: Aborting compaction of store cfp_info in
region event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
because user requested stop.
2013-06-05 05:12:53,425 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
,60020,1369877672964
2013-06-05 05:12:55,426 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
,60020,1369877672964
2013-06-05 05:12:59,427 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
,60020,1369877672964
2013-06-05 05:13:07,427 WARN
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
ZooKeeper exception:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
,60020,1369877672964
2013-06-05 05:13:07,427 ERROR
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper delete
failed after 3 retries
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
,60020,1369877672964
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
java.io.IOException: All datanodes 192.168.20.30:50010 are bad. Aborting...
java.io.IOException: All datanodes 192.168.20.30:50010 are bad. Aborting...
    at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)


*HMaster logs:*
2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4702394ms instead of 10000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4988731ms instead of 300000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4988726ms instead of 300000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4698291ms instead of 10000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4694502ms instead of 1000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4694492ms instead of 1000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
4695589ms instead of 60000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
Master server abort: loaded coprocessors are: []
2013-06-05 05:12:52,465 INFO org.apache.hadoop.hbase.master.ServerManager:
Waiting for region servers count to settle; currently checked in 1, slept
for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500
ms, interval of 1500 ms.
2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
Region server hbase.rummycircle.com,60020,1369877672964 reported a fatal
error:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
2013-06-05 05:12:53,970 INFO org.apache.hadoop.hbase.master.ServerManager:
Waiting for region servers count to settle; currently checked in 1, slept
for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500
ms, interval of 1500 ms.
2013-06-05 05:12:55,476 INFO org.apache.hadoop.hbase.master.ServerManager:
Waiting for region servers count to settle; currently checked in 1, slept
for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500
ms, interval of 1500 ms.
2013-06-05 05:12:56,981 INFO org.apache.hadoop.hbase.master.ServerManager:
Finished waiting for region servers count to settle; checked in 1, slept
for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
running.
2013-06-05 05:12:57,019 INFO
org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
-ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
java.io.EOFException
2013-06-05 05:17:52,302 WARN
org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs
in [hdfs://
192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting]
installed = 19 but only 0 done
2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
java.io.IOException: Giving up after tries=1
Caused by: java.lang.InterruptedException: sleep interrupted
2013-06-05 05:17:52,381 ERROR
org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
java.lang.RuntimeException: HMaster Aborted



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

Hi Azuryy ,
Currently i am not able to reproduce the problem.
Also i had checked namenode log ,i did not find any issue in it.


On Thu, Jun 6, 2013 at 10:53 AM, Azuryy Yu <az...@gmail.com> wrote:

> And, please check your namenode log.
>
>
> On Thu, Jun 6, 2013 at 1:20 PM, Azuryy Yu <az...@gmail.com> wrote:
>
> > Can you reproduce the problem? if yes,
> >
> > add the following in your hbase-env.sh
> >
> > export HBASE_MASTER_OPTS="-verbose:gc -XX:+PrintGCDateStamps
> > -XX:+PrintGCDetails -Xloggc:$HBASE_LOG_DIR/hmaster_gc.log
> > $HBASE_MASTER_OPTS"
> >
> > export HBASE_REGIONSERVER_OPTS="-verbose:gc -XX:+PrintGCDateStamps
> > -XX:+PrintGCDetails -Xloggc:$HBASE_LOG_DIR/hmaster_gc.log
> > $HBASE_REGIONSERVER_OPTS"
> >
> > then, you will got GC log, I just guess this problem was lead with GC.
> >
> >
> >
> > On Thu, Jun 6, 2013 at 10:53 AM, Vimal Jain <vk...@gmail.com> wrote:
> >
> >> Hi Azuryy/Ted,
> >> Can you please help here...
> >> On Jun 5, 2013 7:23 PM, "Kevin O'dell" <ke...@cloudera.com>
> wrote:
> >>
> >> > No!
> >> >
> >> > Just kidding, you can unsubscribe by going to the Apache site:
> >> >
> >> > http://hbase.apache.org/mail-lists.html
> >> >
> >> >
> >> > On Wed, Jun 5, 2013 at 9:34 AM, Joseph Coleman <
> >> > joe.coleman@infinitecampus.com> wrote:
> >> >
> >> > > Please remove me from this list
> >> > >
> >> > >
> >> > > On 6/5/13 8:32 AM, "Vimal Jain" <vk...@gmail.com> wrote:
> >> > >
> >> > > >Ok.
> >> > > >I dont have any batch read/write to hbase.
> >> > > >
> >> > > >
> >> > > >On Wed, Jun 5, 2013 at 6:08 PM, Azuryy Yu <az...@gmail.com>
> >> wrote:
> >> > > >
> >> > > >> gc log cannot get by default. need some configuration. do you
> have
> >> > some
> >> > > >> batch read or write to hbase?
> >> > > >>
> >> > > >> --Send from my Sony mobile.
> >> > > >> On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> >> > > >>
> >> > > >> > I dont have GC logs.Do you get it by default  or it has to be
> >> > > >>configured
> >> > > >> ?
> >> > > >> > After i came to know about crash , i checked which all
> processes
> >> are
> >> > > >> > running using "jps"
> >> > > >> > It displayed 4 processes ,
> >> "namenode","datanode","secondarynamenode"
> >> > > >>and
> >> > > >> > "HQuorumpeer".
> >> > > >> > So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and
> >> then i
> >> > > >> stopped
> >> > > >> > hbase by running $HBASE_HOME/bin/stop-hbase.sh
> >> > > >> >
> >> > > >> >
> >> > > >> > On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com>
> >> > wrote:
> >> > > >> >
> >> > > >> > > do you have GC log? and what you did during crash? and whats
> >> your
> >> > gc
> >> > > >> > > options?
> >> > > >> > >
> >> > > >> > > for the dn error, thats net work issue generally, because dn
> >> > > >>received
> >> > > >> an
> >> > > >> > > incomplete packet.
> >> > > >> > >
> >> > > >> > > --Send from my Sony mobile.
> >> > > >> > > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com>
> wrote:
> >> > > >> > >
> >> > > >> > > > Yes.
> >> > > >> > > > Thats true.
> >> > > >> > > > There are some errors in all 3 logs during same period ,
> i.e.
> >> > > >>data ,
> >> > > >> > > master
> >> > > >> > > > and region.
> >> > > >> > > > But i am unable to deduce the exact cause of error.
> >> > > >> > > > Can you please help in detecting the problem ?
> >> > > >> > > >
> >> > > >> > > > So far i am suspecting following :-
> >> > > >> > > > I have 1GB heap (default) allocated for all 3 processes ,
> >> i.e.
> >> > > >> > > > Master,Region,Zookeeper.
> >> > > >> > > > Both  Master and Region took more time for GC ( as inferred
> >> from
> >> > > >> lines
> >> > > >> > in
> >> > > >> > > > logs like "slept more time than configured one" etc ) .
> >> > > >> > > > Due to this there was  zookeeper connection time out for
> both
> >> > > >>Master
> >> > > >> > and
> >> > > >> > > > Region and hence both went down.
> >> > > >> > > >
> >> > > >> > > > I am newbie to Hbase and hence may be my findings are not
> >> > correct.
> >> > > >> > > > I want to be 100 % sure before increasing heap space for
> both
> >> > > >>Master
> >> > > >> > and
> >> > > >> > > > Region ( Both around 2GB) to solve this.
> >> > > >> > > > At present i have restarted the cluster with default heap
> >> space
> >> > > >>only
> >> > > >> (
> >> > > >> > > 1GB
> >> > > >> > > > ).
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <
> >> azuryyyu@gmail.com>
> >> > > >> wrote:
> >> > > >> > > >
> >> > > >> > > > > there have errors in your dats node log, and the error
> time
> >> > > >>match
> >> > > >> > with
> >> > > >> > > rs
> >> > > >> > > > > log error time.
> >> > > >> > > > >
> >> > > >> > > > > --Send from my Sony mobile.
> >> > > >> > > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com>
> >> > wrote:
> >> > > >> > > > >
> >> > > >> > > > > > I don't think so , as i dont find any issues in data
> node
> >> > > >>logs.
> >> > > >> > > > > > Also there are lot of exceptions like "session
> expired" ,
> >> > > >>"slept
> >> > > >> > more
> >> > > >> > > > > than
> >> > > >> > > > > > configured time" . what are these ?
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <
> >> > azuryyyu@gmail.com
> >> > > >
> >> > > >> > > wrote:
> >> > > >> > > > > >
> >> > > >> > > > > > > Because your data node 192.168.20.30 broke down.
> which
> >> > > >>leads to
> >> > > >> > RS
> >> > > >> > > > > down.
> >> > > >> > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain
> >> > > >><vk...@gmail.com>
> >> > > >> > > wrote:
> >> > > >> > > > > > >
> >> > > >> > > > > > > > Here is the complete log:
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
> >> > > >> > > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> >> > > >> > > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
> >> > > >> > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <
> >> > > >> vkjk89@gmail.com>
> >> > > >> > > > > wrote:
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > > Hi,
> >> > > >> > > > > > > > > I have set up Hbase in pseudo-distributed mode.
> >> > > >> > > > > > > > > It was working fine for 6 days , but suddenly
> today
> >> > > >>morning
> >> > > >> > > both
> >> > > >> > > > > > > HMaster
> >> > > >> > > > > > > > > and Hregion process went down.
> >> > > >> > > > > > > > > I checked in logs of both hadoop and hbase.
> >> > > >> > > > > > > > > Please help here.
> >> > > >> > > > > > > > > Here are the snippets :-
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > > *Datanode logs:*
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,436 INFO
> >> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > > >>Exception
> >> > > >> in
> >> > > >> > > > > > > > receiveBlock
> >> > > >> > > > > > > > > for block blk_1597245478875608321_2818
> >> > > >> java.io.EOFException:
> >> > > >> > > > while
> >> > > >> > > > > > > trying
> >> > > >> > > > > > > > > to read 2347 bytes
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,442 INFO
> >> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > > >>writeBlock
> >> > > >> > > > > > > > > blk_1597245478875608321_2818 received exception
> >> > > >> > > > > java.io.EOFException:
> >> > > >> > > > > > > > while
> >> > > >> > > > > > > > > trying to read 2347 bytes
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,442 ERROR
> >> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > > >> > > > > > DatanodeRegistration(
> >> > > >> > > > > > > > > 192.168.20.30:50010,
> >> > > >> > > > > > > > >
> >> > > >>storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> >> > > >> > > > > > > > infoPort=50075,
> >> > > >> > > > > > > > > ipcPort=50020):DataXceiver
> >> > > >> > > > > > > > > java.io.EOFException: while trying to read 2347
> >> bytes
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > > *HRegion logs:*
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4694929ms instead of 3000ms, this is likely
> >> due
> >> > > >>to a
> >> > > >> > long
> >> > > >> > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,045 WARN
> >> > > >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > >> > > > > > > > > DFSOutputStream ResponseProcessor exception  for
> >> block
> >> > > >> > > > > > > > >
> >> > > >> blk_1597245478875608321_2818java.net.SocketTimeoutException:
> >> > > >> > > > 63000
> >> > > >> > > > > > > millis
> >> > > >> > > > > > > > > timeout while waiting for channel to be ready for
> >> > read.
> >> > > >>ch
> >> > > >> :
> >> > > >> > > > > > > > > java.nio.channels.SocketChannel[connected local=/
> >> > > >> > > > > 192.168.20.30:44333
> >> > > >> > > > > > > > remote=/
> >> > > >> > > > > > > > > 192.168.20.30:50010]
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,046 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 11695345ms instead of 10000000ms, this is
> >> likely
> >> > > >>due
> >> > > >> > to a
> >> > > >> > > > > long
> >> > > >> > > > > > > > > garbage collecting pause and it's usually bad,
> see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,048 WARN
> >> > > >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > >> > > > > Error
> >> > > >> > > > > > > > > Recovery for block blk_1597245478875608321_2818
> bad
> >> > > >> > datanode[0]
> >> > > >> > > > > > > > > 192.168.20.30:50010
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,075 WARN
> >> > > >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > >> > > > > Error
> >> > > >> > > > > > > > while
> >> > > >> > > > > > > > > syncing
> >> > > >> > > > > > > > > java.io.IOException: All datanodes
> >> > 192.168.20.30:50010
> >> > > >>are
> >> > > >> > > bad.
> >> > > >> > > > > > > > > Aborting...
> >> > > >> > > > > > > > >     at
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > >
> >> >
> >>
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> >> > > >>Client.java:3096)
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,110 FATAL
> >> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
> >> Could
> >> > not
> >> > > >> > sync.
> >> > > >> > > > > > > Requesting
> >> > > >> > > > > > > > > close of hlog
> >> > > >> > > > > > > > > java.io.IOException: Reflection
> >> > > >> > > > > > > > > Caused by:
> >> java.lang.reflect.InvocationTargetException
> >> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream
> is
> >> > > >>closed
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,180 FATAL
> >> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
> >> Could
> >> > not
> >> > > >> > sync.
> >> > > >> > > > > > > Requesting
> >> > > >> > > > > > > > > close of hlog
> >> > > >> > > > > > > > > java.io.IOException: Reflection
> >> > > >> > > > > > > > > Caused by:
> >> java.lang.reflect.InvocationTargetException
> >> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream
> is
> >> > > >>closed
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,183 ERROR
> >> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
> >> Failed
> >> > > >>close
> >> > > >> > of
> >> > > >> > > > HLog
> >> > > >> > > > > > > > writer
> >> > > >> > > > > > > > > java.io.IOException: Reflection
> >> > > >> > > > > > > > > Caused by:
> >> java.lang.reflect.InvocationTargetException
> >> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream
> is
> >> > > >>closed
> >> > > >> > > > > > > > > 2013-06-05 05:12:51,184 WARN
> >> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
> >> Riding
> >> > > >>over
> >> > > >> > HLog
> >> > > >> > > > > close
> >> > > >> > > > > > > > > failure! error count=1
> >> > > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> >> > > >> > > > > > > > >
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > > >> ABORTING
> >> > > >> > > > region
> >> > > >> > > > > > > > server
> >> > > >> > > > > > > > > hbase.rummycircle.com,60020,1369877672964:
> >> > > >> > > > > > > > > regionserver:60020-0x13ef31264d00001
> >> > > >> > > > > > > regionserver:60020-0x13ef31264d00001
> >> > > >> > > > > > > > > received expired from ZooKeeper, aborting
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired
> >> > > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> >> > > >> > > > > > > > >
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > > >> > > RegionServer
> >> > > >> > > > > > abort:
> >> > > >> > > > > > > > > loaded coprocessors are: []
> >> > > >> > > > > > > > > 2013-06-05 05:12:52,621 INFO
> >> > > >> > > > > > > > >
> >> org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> >> > > >> > > > SplitLogWorker
> >> > > >> > > > > > > > > interrupted while waiting for task, exiting:
> >> > > >> > > > > > > > java.lang.InterruptedException
> >> > > >> > > > > > > > > java.io.InterruptedIOException: Aborting
> >> compaction of
> >> > > >> store
> >> > > >> > > > > cfp_info
> >> > > >> > > > > > > in
> >> > > >> > > > > > > > > region
> >> > > >> > > > > > >
> >> > > >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> >> > > >> > > > > > > > > because user requested stop.
> >> > > >> > > > > > > > > 2013-06-05 05:12:53,425 WARN
> >> > > >> > > > > > > > >
> >> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > >> > > Possibly
> >> > > >> > > > > > > > transient
> >> > > >> > > > > > > > > ZooKeeper exception:
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > >> > > > > > hbase.rummycircle.com
> >> > > >> > > > > > > > > ,60020,1369877672964
> >> > > >> > > > > > > > > 2013-06-05 05:12:55,426 WARN
> >> > > >> > > > > > > > >
> >> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > >> > > Possibly
> >> > > >> > > > > > > > transient
> >> > > >> > > > > > > > > ZooKeeper exception:
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > >> > > > > > hbase.rummycircle.com
> >> > > >> > > > > > > > > ,60020,1369877672964
> >> > > >> > > > > > > > > 2013-06-05 05:12:59,427 WARN
> >> > > >> > > > > > > > >
> >> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > >> > > Possibly
> >> > > >> > > > > > > > transient
> >> > > >> > > > > > > > > ZooKeeper exception:
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > >> > > > > > hbase.rummycircle.com
> >> > > >> > > > > > > > > ,60020,1369877672964
> >> > > >> > > > > > > > > 2013-06-05 05:13:07,427 WARN
> >> > > >> > > > > > > > >
> >> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > >> > > Possibly
> >> > > >> > > > > > > > transient
> >> > > >> > > > > > > > > ZooKeeper exception:
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > >> > > > > > hbase.rummycircle.com
> >> > > >> > > > > > > > > ,60020,1369877672964
> >> > > >> > > > > > > > > 2013-06-05 05:13:07,427 ERROR
> >> > > >> > > > > > > > >
> >> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > >> > > ZooKeeper
> >> > > >> > > > > > > delete
> >> > > >> > > > > > > > > failed after 3 retries
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > >> > > > > > hbase.rummycircle.com
> >> > > >> > > > > > > > > ,60020,1369877672964
> >> > > >> > > > > > > > >     at
> >> > > >> > > > > > > > >
> >> > > >> > > > >
> >> > > >>
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >> > > >> > > > > > > > >     at
> >> > > >> > > > > > > >
> >> > > >> > > >
> >> > >
> >>org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >> > > >> > > > > > > > > 2013-06-05 05:13:07,436 ERROR
> >> > > >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > >> > > > > > > Exception
> >> > > >> > > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> >> > > >> > > > > ,60020,1369877672964/
> >> > > >> > > > > > > > > hbase.rummycircle.com
> >> > > >> %2C60020%2C1369877672964.1370382721642
> >> > > >> > :
> >> > > >> > > > > > > > > java.io.IOException: All datanodes
> >> > 192.168.20.30:50010
> >> > > >>are
> >> > > >> > > bad.
> >> > > >> > > > > > > > > Aborting...
> >> > > >> > > > > > > > > java.io.IOException: All datanodes
> >> > 192.168.20.30:50010
> >> > > >>are
> >> > > >> > > bad.
> >> > > >> > > > > > > > > Aborting...
> >> > > >> > > > > > > > >     at
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > >
> >> >
> >>
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> >> > > >>Client.java:3096)
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > > *HMaster logs:*
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4702394ms instead of 10000ms, this is
> likely
> >> due
> >> > > >>to a
> >> > > >> > > long
> >> > > >> > > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4988731ms instead of 300000ms, this is
> likely
> >> > due
> >> > > >>to
> >> > > >> a
> >> > > >> > > long
> >> > > >> > > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4988726ms instead of 300000ms, this is
> likely
> >> > due
> >> > > >>to
> >> > > >> a
> >> > > >> > > long
> >> > > >> > > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4698291ms instead of 10000ms, this is
> likely
> >> due
> >> > > >>to a
> >> > > >> > > long
> >> > > >> > > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,711 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4694502ms instead of 1000ms, this is likely
> >> due
> >> > > >>to a
> >> > > >> > long
> >> > > >> > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,714 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4694492ms instead of 1000ms, this is likely
> >> due
> >> > > >>to a
> >> > > >> > long
> >> > > >> > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:50,715 WARN
> >> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > >> > > > > We
> >> > > >> > > > > > > > > slept 4695589ms instead of 60000ms, this is
> likely
> >> due
> >> > > >>to a
> >> > > >> > > long
> >> > > >> > > > > > > garbage
> >> > > >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > >> > > > > > > > >
> >> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > >> > > > > > > > > 2013-06-05 05:12:52,263 FATAL
> >> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> >> > > >> > > > > > > > > Master server abort: loaded coprocessors are: []
> >> > > >> > > > > > > > > 2013-06-05 05:12:52,465 INFO
> >> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > >> > > > > > > > > Waiting for region servers count to settle;
> >> currently
> >> > > >> checked
> >> > > >> > > in
> >> > > >> > > > 1,
> >> > > >> > > > > > > slept
> >> > > >> > > > > > > > > for 0 ms, expecting minimum of 1, maximum of
> >> > 2147483647,
> >> > > >> > > timeout
> >> > > >> > > > of
> >> > > >> > > > > > > 4500
> >> > > >> > > > > > > > > ms, interval of 1500 ms.
> >> > > >> > > > > > > > > 2013-06-05 05:12:52,561 ERROR
> >> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> >> > > >> > > > > > > > > Region server hbase.rummycircle.com
> >> > ,60020,1369877672964
> >> > > >> > > > reported a
> >> > > >> > > > > > > fatal
> >> > > >> > > > > > > > > error:
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired
> >> > > >> > > > > > > > > 2013-06-05 05:12:53,970 INFO
> >> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > >> > > > > > > > > Waiting for region servers count to settle;
> >> currently
> >> > > >> checked
> >> > > >> > > in
> >> > > >> > > > 1,
> >> > > >> > > > > > > slept
> >> > > >> > > > > > > > > for 1506 ms, expecting minimum of 1, maximum of
> >> > > >>2147483647,
> >> > > >> > > > timeout
> >> > > >> > > > > > of
> >> > > >> > > > > > > > 4500
> >> > > >> > > > > > > > > ms, interval of 1500 ms.
> >> > > >> > > > > > > > > 2013-06-05 05:12:55,476 INFO
> >> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > >> > > > > > > > > Waiting for region servers count to settle;
> >> currently
> >> > > >> checked
> >> > > >> > > in
> >> > > >> > > > 1,
> >> > > >> > > > > > > slept
> >> > > >> > > > > > > > > for 3012 ms, expecting minimum of 1, maximum of
> >> > > >>2147483647,
> >> > > >> > > > timeout
> >> > > >> > > > > > of
> >> > > >> > > > > > > > 4500
> >> > > >> > > > > > > > > ms, interval of 1500 ms.
> >> > > >> > > > > > > > > 2013-06-05 05:12:56,981 INFO
> >> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > >> > > > > > > > > Finished waiting for region servers count to
> >> settle;
> >> > > >> checked
> >> > > >> > in
> >> > > >> > > > 1,
> >> > > >> > > > > > > slept
> >> > > >> > > > > > > > > for 4517 ms, expecting minimum of 1, maximum of
> >> > > >>2147483647,
> >> > > >> > > > master
> >> > > >> > > > > is
> >> > > >> > > > > > > > > running.
> >> > > >> > > > > > > > > 2013-06-05 05:12:57,019 INFO
> >> > > >> > > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker:
> >> Failed
> >> > > >> > > > verification
> >> > > >> > > > > > of
> >> > > >> > > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
> >> > > >> > > ,60020,1369877672964;
> >> > > >> > > > > > > > > java.io.EOFException
> >> > > >> > > > > > > > > 2013-06-05 05:17:52,302 WARN
> >> > > >> > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager:
> >> error
> >> > > >>while
> >> > > >> > > > > splitting
> >> > > >> > > > > > > > logs
> >> > > >> > > > > > > > > in [hdfs://
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > > >>
> >> > >
> >> >
> >>
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-
> >> > > >>splitting
> >> > > >> > > > > > > > ]
> >> > > >> > > > > > > > > installed = 19 but only 0 done
> >> > > >> > > > > > > > > 2013-06-05 05:17:52,321 FATAL
> >> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> >> > > >> > > > > > > > > master:60000-0x13ef31264d00000
> >> > > >> master:60000-0x13ef31264d00000
> >> > > >> > > > > > received
> >> > > >> > > > > > > > > expired from ZooKeeper, aborting
> >> > > >> > > > > > > > >
> >> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > >> > > > > > > > > KeeperErrorCode = Session expired
> >> > > >> > > > > > > > > java.io.IOException: Giving up after tries=1
> >> > > >> > > > > > > > > Caused by: java.lang.InterruptedException: sleep
> >> > > >> interrupted
> >> > > >> > > > > > > > > 2013-06-05 05:17:52,381 ERROR
> >> > > >> > > > > > > > >
> org.apache.hadoop.hbase.master.HMasterCommandLine:
> >> > > >>Failed
> >> > > >> to
> >> > > >> > > > start
> >> > > >> > > > > > > master
> >> > > >> > > > > > > > > java.lang.RuntimeException: HMaster Aborted
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > > > --
> >> > > >> > > > > > > > > Thanks and Regards,
> >> > > >> > > > > > > > > Vimal Jain
> >> > > >> > > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > > >
> >> > > >> > > > > > > > --
> >> > > >> > > > > > > > Thanks and Regards,
> >> > > >> > > > > > > > Vimal Jain
> >> > > >> > > > > > > >
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > --
> >> > > >> > > > > > Thanks and Regards,
> >> > > >> > > > > > Vimal Jain
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > >
> >> > > >> > > > --
> >> > > >> > > > Thanks and Regards,
> >> > > >> > > > Vimal Jain
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > --
> >> > > >> > Thanks and Regards,
> >> > > >> > Vimal Jain
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > > >
> >> > > >--
> >> > > >Thanks and Regards,
> >> > > >Vimal Jain
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Kevin O'Dell
> >> > Systems Engineer, Cloudera
> >> >
> >>
> >
> >
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Azuryy Yu <az...@gmail.com>.

And, please check your namenode log.


On Thu, Jun 6, 2013 at 1:20 PM, Azuryy Yu <az...@gmail.com> wrote:

> Can you reproduce the problem? if yes,
>
> add the following in your hbase-env.sh
>
> export HBASE_MASTER_OPTS="-verbose:gc -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails -Xloggc:$HBASE_LOG_DIR/hmaster_gc.log
> $HBASE_MASTER_OPTS"
>
> export HBASE_REGIONSERVER_OPTS="-verbose:gc -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails -Xloggc:$HBASE_LOG_DIR/hmaster_gc.log
> $HBASE_REGIONSERVER_OPTS"
>
> then, you will got GC log, I just guess this problem was lead with GC.
>
>
>
> On Thu, Jun 6, 2013 at 10:53 AM, Vimal Jain <vk...@gmail.com> wrote:
>
>> Hi Azuryy/Ted,
>> Can you please help here...
>> On Jun 5, 2013 7:23 PM, "Kevin O'dell" <ke...@cloudera.com> wrote:
>>
>> > No!
>> >
>> > Just kidding, you can unsubscribe by going to the Apache site:
>> >
>> > http://hbase.apache.org/mail-lists.html
>> >
>> >
>> > On Wed, Jun 5, 2013 at 9:34 AM, Joseph Coleman <
>> > joe.coleman@infinitecampus.com> wrote:
>> >
>> > > Please remove me from this list
>> > >
>> > >
>> > > On 6/5/13 8:32 AM, "Vimal Jain" <vk...@gmail.com> wrote:
>> > >
>> > > >Ok.
>> > > >I dont have any batch read/write to hbase.
>> > > >
>> > > >
>> > > >On Wed, Jun 5, 2013 at 6:08 PM, Azuryy Yu <az...@gmail.com>
>> wrote:
>> > > >
>> > > >> gc log cannot get by default. need some configuration. do you have
>> > some
>> > > >> batch read or write to hbase?
>> > > >>
>> > > >> --Send from my Sony mobile.
>> > > >> On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>> > > >>
>> > > >> > I dont have GC logs.Do you get it by default  or it has to be
>> > > >>configured
>> > > >> ?
>> > > >> > After i came to know about crash , i checked which all processes
>> are
>> > > >> > running using "jps"
>> > > >> > It displayed 4 processes ,
>> "namenode","datanode","secondarynamenode"
>> > > >>and
>> > > >> > "HQuorumpeer".
>> > > >> > So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and
>> then i
>> > > >> stopped
>> > > >> > hbase by running $HBASE_HOME/bin/stop-hbase.sh
>> > > >> >
>> > > >> >
>> > > >> > On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com>
>> > wrote:
>> > > >> >
>> > > >> > > do you have GC log? and what you did during crash? and whats
>> your
>> > gc
>> > > >> > > options?
>> > > >> > >
>> > > >> > > for the dn error, thats net work issue generally, because dn
>> > > >>received
>> > > >> an
>> > > >> > > incomplete packet.
>> > > >> > >
>> > > >> > > --Send from my Sony mobile.
>> > > >> > > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>> > > >> > >
>> > > >> > > > Yes.
>> > > >> > > > Thats true.
>> > > >> > > > There are some errors in all 3 logs during same period , i.e.
>> > > >>data ,
>> > > >> > > master
>> > > >> > > > and region.
>> > > >> > > > But i am unable to deduce the exact cause of error.
>> > > >> > > > Can you please help in detecting the problem ?
>> > > >> > > >
>> > > >> > > > So far i am suspecting following :-
>> > > >> > > > I have 1GB heap (default) allocated for all 3 processes ,
>> i.e.
>> > > >> > > > Master,Region,Zookeeper.
>> > > >> > > > Both  Master and Region took more time for GC ( as inferred
>> from
>> > > >> lines
>> > > >> > in
>> > > >> > > > logs like "slept more time than configured one" etc ) .
>> > > >> > > > Due to this there was  zookeeper connection time out for both
>> > > >>Master
>> > > >> > and
>> > > >> > > > Region and hence both went down.
>> > > >> > > >
>> > > >> > > > I am newbie to Hbase and hence may be my findings are not
>> > correct.
>> > > >> > > > I want to be 100 % sure before increasing heap space for both
>> > > >>Master
>> > > >> > and
>> > > >> > > > Region ( Both around 2GB) to solve this.
>> > > >> > > > At present i have restarted the cluster with default heap
>> space
>> > > >>only
>> > > >> (
>> > > >> > > 1GB
>> > > >> > > > ).
>> > > >> > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <
>> azuryyyu@gmail.com>
>> > > >> wrote:
>> > > >> > > >
>> > > >> > > > > there have errors in your dats node log, and the error time
>> > > >>match
>> > > >> > with
>> > > >> > > rs
>> > > >> > > > > log error time.
>> > > >> > > > >
>> > > >> > > > > --Send from my Sony mobile.
>> > > >> > > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com>
>> > wrote:
>> > > >> > > > >
>> > > >> > > > > > I don't think so , as i dont find any issues in data node
>> > > >>logs.
>> > > >> > > > > > Also there are lot of exceptions like "session expired" ,
>> > > >>"slept
>> > > >> > more
>> > > >> > > > > than
>> > > >> > > > > > configured time" . what are these ?
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <
>> > azuryyyu@gmail.com
>> > > >
>> > > >> > > wrote:
>> > > >> > > > > >
>> > > >> > > > > > > Because your data node 192.168.20.30 broke down. which
>> > > >>leads to
>> > > >> > RS
>> > > >> > > > > down.
>> > > >> > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain
>> > > >><vk...@gmail.com>
>> > > >> > > wrote:
>> > > >> > > > > > >
>> > > >> > > > > > > > Here is the complete log:
>> > > >> > > > > > > >
>> > > >> > > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
>> > > >> > > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
>> > > >> > > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <
>> > > >> vkjk89@gmail.com>
>> > > >> > > > > wrote:
>> > > >> > > > > > > >
>> > > >> > > > > > > > > Hi,
>> > > >> > > > > > > > > I have set up Hbase in pseudo-distributed mode.
>> > > >> > > > > > > > > It was working fine for 6 days , but suddenly today
>> > > >>morning
>> > > >> > > both
>> > > >> > > > > > > HMaster
>> > > >> > > > > > > > > and Hregion process went down.
>> > > >> > > > > > > > > I checked in logs of both hadoop and hbase.
>> > > >> > > > > > > > > Please help here.
>> > > >> > > > > > > > > Here are the snippets :-
>> > > >> > > > > > > > >
>> > > >> > > > > > > > > *Datanode logs:*
>> > > >> > > > > > > > > 2013-06-05 05:12:51,436 INFO
>> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > > >>Exception
>> > > >> in
>> > > >> > > > > > > > receiveBlock
>> > > >> > > > > > > > > for block blk_1597245478875608321_2818
>> > > >> java.io.EOFException:
>> > > >> > > > while
>> > > >> > > > > > > trying
>> > > >> > > > > > > > > to read 2347 bytes
>> > > >> > > > > > > > > 2013-06-05 05:12:51,442 INFO
>> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > > >>writeBlock
>> > > >> > > > > > > > > blk_1597245478875608321_2818 received exception
>> > > >> > > > > java.io.EOFException:
>> > > >> > > > > > > > while
>> > > >> > > > > > > > > trying to read 2347 bytes
>> > > >> > > > > > > > > 2013-06-05 05:12:51,442 ERROR
>> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > > >> > > > > > DatanodeRegistration(
>> > > >> > > > > > > > > 192.168.20.30:50010,
>> > > >> > > > > > > > >
>> > > >>storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
>> > > >> > > > > > > > infoPort=50075,
>> > > >> > > > > > > > > ipcPort=50020):DataXceiver
>> > > >> > > > > > > > > java.io.EOFException: while trying to read 2347
>> bytes
>> > > >> > > > > > > > >
>> > > >> > > > > > > > >
>> > > >> > > > > > > > > *HRegion logs:*
>> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4694929ms instead of 3000ms, this is likely
>> due
>> > > >>to a
>> > > >> > long
>> > > >> > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:51,045 WARN
>> > > >> > org.apache.hadoop.hdfs.DFSClient:
>> > > >> > > > > > > > > DFSOutputStream ResponseProcessor exception  for
>> block
>> > > >> > > > > > > > >
>> > > >> blk_1597245478875608321_2818java.net.SocketTimeoutException:
>> > > >> > > > 63000
>> > > >> > > > > > > millis
>> > > >> > > > > > > > > timeout while waiting for channel to be ready for
>> > read.
>> > > >>ch
>> > > >> :
>> > > >> > > > > > > > > java.nio.channels.SocketChannel[connected local=/
>> > > >> > > > > 192.168.20.30:44333
>> > > >> > > > > > > > remote=/
>> > > >> > > > > > > > > 192.168.20.30:50010]
>> > > >> > > > > > > > > 2013-06-05 05:12:51,046 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 11695345ms instead of 10000000ms, this is
>> likely
>> > > >>due
>> > > >> > to a
>> > > >> > > > > long
>> > > >> > > > > > > > > garbage collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:51,048 WARN
>> > > >> > org.apache.hadoop.hdfs.DFSClient:
>> > > >> > > > > Error
>> > > >> > > > > > > > > Recovery for block blk_1597245478875608321_2818 bad
>> > > >> > datanode[0]
>> > > >> > > > > > > > > 192.168.20.30:50010
>> > > >> > > > > > > > > 2013-06-05 05:12:51,075 WARN
>> > > >> > org.apache.hadoop.hdfs.DFSClient:
>> > > >> > > > > Error
>> > > >> > > > > > > > while
>> > > >> > > > > > > > > syncing
>> > > >> > > > > > > > > java.io.IOException: All datanodes
>> > 192.168.20.30:50010
>> > > >>are
>> > > >> > > bad.
>> > > >> > > > > > > > > Aborting...
>> > > >> > > > > > > > >     at
>> > > >> > > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
>> > > >>Client.java:3096)
>> > > >> > > > > > > > > 2013-06-05 05:12:51,110 FATAL
>> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Could
>> > not
>> > > >> > sync.
>> > > >> > > > > > > Requesting
>> > > >> > > > > > > > > close of hlog
>> > > >> > > > > > > > > java.io.IOException: Reflection
>> > > >> > > > > > > > > Caused by:
>> java.lang.reflect.InvocationTargetException
>> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
>> > > >>closed
>> > > >> > > > > > > > > 2013-06-05 05:12:51,180 FATAL
>> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Could
>> > not
>> > > >> > sync.
>> > > >> > > > > > > Requesting
>> > > >> > > > > > > > > close of hlog
>> > > >> > > > > > > > > java.io.IOException: Reflection
>> > > >> > > > > > > > > Caused by:
>> java.lang.reflect.InvocationTargetException
>> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
>> > > >>closed
>> > > >> > > > > > > > > 2013-06-05 05:12:51,183 ERROR
>> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Failed
>> > > >>close
>> > > >> > of
>> > > >> > > > HLog
>> > > >> > > > > > > > writer
>> > > >> > > > > > > > > java.io.IOException: Reflection
>> > > >> > > > > > > > > Caused by:
>> java.lang.reflect.InvocationTargetException
>> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
>> > > >>closed
>> > > >> > > > > > > > > 2013-06-05 05:12:51,184 WARN
>> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Riding
>> > > >>over
>> > > >> > HLog
>> > > >> > > > > close
>> > > >> > > > > > > > > failure! error count=1
>> > > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
>> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> > > >> ABORTING
>> > > >> > > > region
>> > > >> > > > > > > > server
>> > > >> > > > > > > > > hbase.rummycircle.com,60020,1369877672964:
>> > > >> > > > > > > > > regionserver:60020-0x13ef31264d00001
>> > > >> > > > > > > regionserver:60020-0x13ef31264d00001
>> > > >> > > > > > > > > received expired from ZooKeeper, aborting
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired
>> > > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
>> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> > > >> > > RegionServer
>> > > >> > > > > > abort:
>> > > >> > > > > > > > > loaded coprocessors are: []
>> > > >> > > > > > > > > 2013-06-05 05:12:52,621 INFO
>> > > >> > > > > > > > >
>> org.apache.hadoop.hbase.regionserver.SplitLogWorker:
>> > > >> > > > SplitLogWorker
>> > > >> > > > > > > > > interrupted while waiting for task, exiting:
>> > > >> > > > > > > > java.lang.InterruptedException
>> > > >> > > > > > > > > java.io.InterruptedIOException: Aborting
>> compaction of
>> > > >> store
>> > > >> > > > > cfp_info
>> > > >> > > > > > > in
>> > > >> > > > > > > > > region
>> > > >> > > > > > >
>> > > >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
>> > > >> > > > > > > > > because user requested stop.
>> > > >> > > > > > > > > 2013-06-05 05:12:53,425 WARN
>> > > >> > > > > > > > >
>> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > >> > > Possibly
>> > > >> > > > > > > > transient
>> > > >> > > > > > > > > ZooKeeper exception:
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > >> > > > > > hbase.rummycircle.com
>> > > >> > > > > > > > > ,60020,1369877672964
>> > > >> > > > > > > > > 2013-06-05 05:12:55,426 WARN
>> > > >> > > > > > > > >
>> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > >> > > Possibly
>> > > >> > > > > > > > transient
>> > > >> > > > > > > > > ZooKeeper exception:
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > >> > > > > > hbase.rummycircle.com
>> > > >> > > > > > > > > ,60020,1369877672964
>> > > >> > > > > > > > > 2013-06-05 05:12:59,427 WARN
>> > > >> > > > > > > > >
>> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > >> > > Possibly
>> > > >> > > > > > > > transient
>> > > >> > > > > > > > > ZooKeeper exception:
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > >> > > > > > hbase.rummycircle.com
>> > > >> > > > > > > > > ,60020,1369877672964
>> > > >> > > > > > > > > 2013-06-05 05:13:07,427 WARN
>> > > >> > > > > > > > >
>> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > >> > > Possibly
>> > > >> > > > > > > > transient
>> > > >> > > > > > > > > ZooKeeper exception:
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > >> > > > > > hbase.rummycircle.com
>> > > >> > > > > > > > > ,60020,1369877672964
>> > > >> > > > > > > > > 2013-06-05 05:13:07,427 ERROR
>> > > >> > > > > > > > >
>> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > >> > > ZooKeeper
>> > > >> > > > > > > delete
>> > > >> > > > > > > > > failed after 3 retries
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > >> > > > > > hbase.rummycircle.com
>> > > >> > > > > > > > > ,60020,1369877672964
>> > > >> > > > > > > > >     at
>> > > >> > > > > > > > >
>> > > >> > > > >
>> > > >>
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>> > > >> > > > > > > > >     at
>> > > >> > > > > > > >
>> > > >> > > >
>> > > >>org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> > > >> > > > > > > > > 2013-06-05 05:13:07,436 ERROR
>> > > >> > org.apache.hadoop.hdfs.DFSClient:
>> > > >> > > > > > > Exception
>> > > >> > > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
>> > > >> > > > > ,60020,1369877672964/
>> > > >> > > > > > > > > hbase.rummycircle.com
>> > > >> %2C60020%2C1369877672964.1370382721642
>> > > >> > :
>> > > >> > > > > > > > > java.io.IOException: All datanodes
>> > 192.168.20.30:50010
>> > > >>are
>> > > >> > > bad.
>> > > >> > > > > > > > > Aborting...
>> > > >> > > > > > > > > java.io.IOException: All datanodes
>> > 192.168.20.30:50010
>> > > >>are
>> > > >> > > bad.
>> > > >> > > > > > > > > Aborting...
>> > > >> > > > > > > > >     at
>> > > >> > > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > >
>> >
>> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
>> > > >>Client.java:3096)
>> > > >> > > > > > > > >
>> > > >> > > > > > > > >
>> > > >> > > > > > > > > *HMaster logs:*
>> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4702394ms instead of 10000ms, this is likely
>> due
>> > > >>to a
>> > > >> > > long
>> > > >> > > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4988731ms instead of 300000ms, this is likely
>> > due
>> > > >>to
>> > > >> a
>> > > >> > > long
>> > > >> > > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4988726ms instead of 300000ms, this is likely
>> > due
>> > > >>to
>> > > >> a
>> > > >> > > long
>> > > >> > > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4698291ms instead of 10000ms, this is likely
>> due
>> > > >>to a
>> > > >> > > long
>> > > >> > > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:50,711 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4694502ms instead of 1000ms, this is likely
>> due
>> > > >>to a
>> > > >> > long
>> > > >> > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:50,714 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4694492ms instead of 1000ms, this is likely
>> due
>> > > >>to a
>> > > >> > long
>> > > >> > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:50,715 WARN
>> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > >> > > > > We
>> > > >> > > > > > > > > slept 4695589ms instead of 60000ms, this is likely
>> due
>> > > >>to a
>> > > >> > > long
>> > > >> > > > > > > garbage
>> > > >> > > > > > > > > collecting pause and it's usually bad, see
>> > > >> > > > > > > > >
>> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > >> > > > > > > > > 2013-06-05 05:12:52,263 FATAL
>> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
>> > > >> > > > > > > > > Master server abort: loaded coprocessors are: []
>> > > >> > > > > > > > > 2013-06-05 05:12:52,465 INFO
>> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > >> > > > > > > > > Waiting for region servers count to settle;
>> currently
>> > > >> checked
>> > > >> > > in
>> > > >> > > > 1,
>> > > >> > > > > > > slept
>> > > >> > > > > > > > > for 0 ms, expecting minimum of 1, maximum of
>> > 2147483647,
>> > > >> > > timeout
>> > > >> > > > of
>> > > >> > > > > > > 4500
>> > > >> > > > > > > > > ms, interval of 1500 ms.
>> > > >> > > > > > > > > 2013-06-05 05:12:52,561 ERROR
>> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
>> > > >> > > > > > > > > Region server hbase.rummycircle.com
>> > ,60020,1369877672964
>> > > >> > > > reported a
>> > > >> > > > > > > fatal
>> > > >> > > > > > > > > error:
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired
>> > > >> > > > > > > > > 2013-06-05 05:12:53,970 INFO
>> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > >> > > > > > > > > Waiting for region servers count to settle;
>> currently
>> > > >> checked
>> > > >> > > in
>> > > >> > > > 1,
>> > > >> > > > > > > slept
>> > > >> > > > > > > > > for 1506 ms, expecting minimum of 1, maximum of
>> > > >>2147483647,
>> > > >> > > > timeout
>> > > >> > > > > > of
>> > > >> > > > > > > > 4500
>> > > >> > > > > > > > > ms, interval of 1500 ms.
>> > > >> > > > > > > > > 2013-06-05 05:12:55,476 INFO
>> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > >> > > > > > > > > Waiting for region servers count to settle;
>> currently
>> > > >> checked
>> > > >> > > in
>> > > >> > > > 1,
>> > > >> > > > > > > slept
>> > > >> > > > > > > > > for 3012 ms, expecting minimum of 1, maximum of
>> > > >>2147483647,
>> > > >> > > > timeout
>> > > >> > > > > > of
>> > > >> > > > > > > > 4500
>> > > >> > > > > > > > > ms, interval of 1500 ms.
>> > > >> > > > > > > > > 2013-06-05 05:12:56,981 INFO
>> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > >> > > > > > > > > Finished waiting for region servers count to
>> settle;
>> > > >> checked
>> > > >> > in
>> > > >> > > > 1,
>> > > >> > > > > > > slept
>> > > >> > > > > > > > > for 4517 ms, expecting minimum of 1, maximum of
>> > > >>2147483647,
>> > > >> > > > master
>> > > >> > > > > is
>> > > >> > > > > > > > > running.
>> > > >> > > > > > > > > 2013-06-05 05:12:57,019 INFO
>> > > >> > > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker:
>> Failed
>> > > >> > > > verification
>> > > >> > > > > > of
>> > > >> > > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
>> > > >> > > ,60020,1369877672964;
>> > > >> > > > > > > > > java.io.EOFException
>> > > >> > > > > > > > > 2013-06-05 05:17:52,302 WARN
>> > > >> > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager:
>> error
>> > > >>while
>> > > >> > > > > splitting
>> > > >> > > > > > > > logs
>> > > >> > > > > > > > > in [hdfs://
>> > > >> > > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >>
>> > >
>> >
>> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-
>> > > >>splitting
>> > > >> > > > > > > > ]
>> > > >> > > > > > > > > installed = 19 but only 0 done
>> > > >> > > > > > > > > 2013-06-05 05:17:52,321 FATAL
>> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
>> > > >> > > > > > > > > master:60000-0x13ef31264d00000
>> > > >> master:60000-0x13ef31264d00000
>> > > >> > > > > > received
>> > > >> > > > > > > > > expired from ZooKeeper, aborting
>> > > >> > > > > > > > >
>> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > >> > > > > > > > > KeeperErrorCode = Session expired
>> > > >> > > > > > > > > java.io.IOException: Giving up after tries=1
>> > > >> > > > > > > > > Caused by: java.lang.InterruptedException: sleep
>> > > >> interrupted
>> > > >> > > > > > > > > 2013-06-05 05:17:52,381 ERROR
>> > > >> > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine:
>> > > >>Failed
>> > > >> to
>> > > >> > > > start
>> > > >> > > > > > > master
>> > > >> > > > > > > > > java.lang.RuntimeException: HMaster Aborted
>> > > >> > > > > > > > >
>> > > >> > > > > > > > >
>> > > >> > > > > > > > >
>> > > >> > > > > > > > > --
>> > > >> > > > > > > > > Thanks and Regards,
>> > > >> > > > > > > > > Vimal Jain
>> > > >> > > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > >
>> > > >> > > > > > > > --
>> > > >> > > > > > > > Thanks and Regards,
>> > > >> > > > > > > > Vimal Jain
>> > > >> > > > > > > >
>> > > >> > > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > >
>> > > >> > > > > > --
>> > > >> > > > > > Thanks and Regards,
>> > > >> > > > > > Vimal Jain
>> > > >> > > > > >
>> > > >> > > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > >
>> > > >> > > > --
>> > > >> > > > Thanks and Regards,
>> > > >> > > > Vimal Jain
>> > > >> > > >
>> > > >> > >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > --
>> > > >> > Thanks and Regards,
>> > > >> > Vimal Jain
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > >--
>> > > >Thanks and Regards,
>> > > >Vimal Jain
>> > >
>> > >
>> >
>> >
>> > --
>> > Kevin O'Dell
>> > Systems Engineer, Cloudera
>> >
>>
>
>

Re: HMaster and HRegionServer going down

Posted by Azuryy Yu <az...@gmail.com>.

Can you reproduce the problem? if yes,

add the following in your hbase-env.sh

export HBASE_MASTER_OPTS="-verbose:gc -XX:+PrintGCDateStamps
-XX:+PrintGCDetails -Xloggc:$HBASE_LOG_DIR/hmaster_gc.log
$HBASE_MASTER_OPTS"

export HBASE_REGIONSERVER_OPTS="-verbose:gc -XX:+PrintGCDateStamps
-XX:+PrintGCDetails -Xloggc:$HBASE_LOG_DIR/hmaster_gc.log
$HBASE_REGIONSERVER_OPTS"

then, you will got GC log, I just guess this problem was lead with GC.



On Thu, Jun 6, 2013 at 10:53 AM, Vimal Jain <vk...@gmail.com> wrote:

> Hi Azuryy/Ted,
> Can you please help here...
> On Jun 5, 2013 7:23 PM, "Kevin O'dell" <ke...@cloudera.com> wrote:
>
> > No!
> >
> > Just kidding, you can unsubscribe by going to the Apache site:
> >
> > http://hbase.apache.org/mail-lists.html
> >
> >
> > On Wed, Jun 5, 2013 at 9:34 AM, Joseph Coleman <
> > joe.coleman@infinitecampus.com> wrote:
> >
> > > Please remove me from this list
> > >
> > >
> > > On 6/5/13 8:32 AM, "Vimal Jain" <vk...@gmail.com> wrote:
> > >
> > > >Ok.
> > > >I dont have any batch read/write to hbase.
> > > >
> > > >
> > > >On Wed, Jun 5, 2013 at 6:08 PM, Azuryy Yu <az...@gmail.com> wrote:
> > > >
> > > >> gc log cannot get by default. need some configuration. do you have
> > some
> > > >> batch read or write to hbase?
> > > >>
> > > >> --Send from my Sony mobile.
> > > >> On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > > >>
> > > >> > I dont have GC logs.Do you get it by default  or it has to be
> > > >>configured
> > > >> ?
> > > >> > After i came to know about crash , i checked which all processes
> are
> > > >> > running using "jps"
> > > >> > It displayed 4 processes ,
> "namenode","datanode","secondarynamenode"
> > > >>and
> > > >> > "HQuorumpeer".
> > > >> > So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and then
> i
> > > >> stopped
> > > >> > hbase by running $HBASE_HOME/bin/stop-hbase.sh
> > > >> >
> > > >> >
> > > >> > On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com>
> > wrote:
> > > >> >
> > > >> > > do you have GC log? and what you did during crash? and whats
> your
> > gc
> > > >> > > options?
> > > >> > >
> > > >> > > for the dn error, thats net work issue generally, because dn
> > > >>received
> > > >> an
> > > >> > > incomplete packet.
> > > >> > >
> > > >> > > --Send from my Sony mobile.
> > > >> > > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > > >> > >
> > > >> > > > Yes.
> > > >> > > > Thats true.
> > > >> > > > There are some errors in all 3 logs during same period , i.e.
> > > >>data ,
> > > >> > > master
> > > >> > > > and region.
> > > >> > > > But i am unable to deduce the exact cause of error.
> > > >> > > > Can you please help in detecting the problem ?
> > > >> > > >
> > > >> > > > So far i am suspecting following :-
> > > >> > > > I have 1GB heap (default) allocated for all 3 processes , i.e.
> > > >> > > > Master,Region,Zookeeper.
> > > >> > > > Both  Master and Region took more time for GC ( as inferred
> from
> > > >> lines
> > > >> > in
> > > >> > > > logs like "slept more time than configured one" etc ) .
> > > >> > > > Due to this there was  zookeeper connection time out for both
> > > >>Master
> > > >> > and
> > > >> > > > Region and hence both went down.
> > > >> > > >
> > > >> > > > I am newbie to Hbase and hence may be my findings are not
> > correct.
> > > >> > > > I want to be 100 % sure before increasing heap space for both
> > > >>Master
> > > >> > and
> > > >> > > > Region ( Both around 2GB) to solve this.
> > > >> > > > At present i have restarted the cluster with default heap
> space
> > > >>only
> > > >> (
> > > >> > > 1GB
> > > >> > > > ).
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <azuryyyu@gmail.com
> >
> > > >> wrote:
> > > >> > > >
> > > >> > > > > there have errors in your dats node log, and the error time
> > > >>match
> > > >> > with
> > > >> > > rs
> > > >> > > > > log error time.
> > > >> > > > >
> > > >> > > > > --Send from my Sony mobile.
> > > >> > > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com>
> > wrote:
> > > >> > > > >
> > > >> > > > > > I don't think so , as i dont find any issues in data node
> > > >>logs.
> > > >> > > > > > Also there are lot of exceptions like "session expired" ,
> > > >>"slept
> > > >> > more
> > > >> > > > > than
> > > >> > > > > > configured time" . what are these ?
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <
> > azuryyyu@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Because your data node 192.168.20.30 broke down. which
> > > >>leads to
> > > >> > RS
> > > >> > > > > down.
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain
> > > >><vk...@gmail.com>
> > > >> > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Here is the complete log:
> > > >> > > > > > > >
> > > >> > > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > >> > > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > >> > > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <
> > > >> vkjk89@gmail.com>
> > > >> > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi,
> > > >> > > > > > > > > I have set up Hbase in pseudo-distributed mode.
> > > >> > > > > > > > > It was working fine for 6 days , but suddenly today
> > > >>morning
> > > >> > > both
> > > >> > > > > > > HMaster
> > > >> > > > > > > > > and Hregion process went down.
> > > >> > > > > > > > > I checked in logs of both hadoop and hbase.
> > > >> > > > > > > > > Please help here.
> > > >> > > > > > > > > Here are the snippets :-
> > > >> > > > > > > > >
> > > >> > > > > > > > > *Datanode logs:*
> > > >> > > > > > > > > 2013-06-05 05:12:51,436 INFO
> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > >>Exception
> > > >> in
> > > >> > > > > > > > receiveBlock
> > > >> > > > > > > > > for block blk_1597245478875608321_2818
> > > >> java.io.EOFException:
> > > >> > > > while
> > > >> > > > > > > trying
> > > >> > > > > > > > > to read 2347 bytes
> > > >> > > > > > > > > 2013-06-05 05:12:51,442 INFO
> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > >>writeBlock
> > > >> > > > > > > > > blk_1597245478875608321_2818 received exception
> > > >> > > > > java.io.EOFException:
> > > >> > > > > > > > while
> > > >> > > > > > > > > trying to read 2347 bytes
> > > >> > > > > > > > > 2013-06-05 05:12:51,442 ERROR
> > > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > >> > > > > > DatanodeRegistration(
> > > >> > > > > > > > > 192.168.20.30:50010,
> > > >> > > > > > > > >
> > > >>storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > >> > > > > > > > infoPort=50075,
> > > >> > > > > > > > > ipcPort=50020):DataXceiver
> > > >> > > > > > > > > java.io.EOFException: while trying to read 2347
> bytes
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > *HRegion logs:*
> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4694929ms instead of 3000ms, this is likely
> due
> > > >>to a
> > > >> > long
> > > >> > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:51,045 WARN
> > > >> > org.apache.hadoop.hdfs.DFSClient:
> > > >> > > > > > > > > DFSOutputStream ResponseProcessor exception  for
> block
> > > >> > > > > > > > >
> > > >> blk_1597245478875608321_2818java.net.SocketTimeoutException:
> > > >> > > > 63000
> > > >> > > > > > > millis
> > > >> > > > > > > > > timeout while waiting for channel to be ready for
> > read.
> > > >>ch
> > > >> :
> > > >> > > > > > > > > java.nio.channels.SocketChannel[connected local=/
> > > >> > > > > 192.168.20.30:44333
> > > >> > > > > > > > remote=/
> > > >> > > > > > > > > 192.168.20.30:50010]
> > > >> > > > > > > > > 2013-06-05 05:12:51,046 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 11695345ms instead of 10000000ms, this is
> likely
> > > >>due
> > > >> > to a
> > > >> > > > > long
> > > >> > > > > > > > > garbage collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:51,048 WARN
> > > >> > org.apache.hadoop.hdfs.DFSClient:
> > > >> > > > > Error
> > > >> > > > > > > > > Recovery for block blk_1597245478875608321_2818 bad
> > > >> > datanode[0]
> > > >> > > > > > > > > 192.168.20.30:50010
> > > >> > > > > > > > > 2013-06-05 05:12:51,075 WARN
> > > >> > org.apache.hadoop.hdfs.DFSClient:
> > > >> > > > > Error
> > > >> > > > > > > > while
> > > >> > > > > > > > > syncing
> > > >> > > > > > > > > java.io.IOException: All datanodes
> > 192.168.20.30:50010
> > > >>are
> > > >> > > bad.
> > > >> > > > > > > > > Aborting...
> > > >> > > > > > > > >     at
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> > > >>Client.java:3096)
> > > >> > > > > > > > > 2013-06-05 05:12:51,110 FATAL
> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could
> > not
> > > >> > sync.
> > > >> > > > > > > Requesting
> > > >> > > > > > > > > close of hlog
> > > >> > > > > > > > > java.io.IOException: Reflection
> > > >> > > > > > > > > Caused by:
> java.lang.reflect.InvocationTargetException
> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> > > >>closed
> > > >> > > > > > > > > 2013-06-05 05:12:51,180 FATAL
> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could
> > not
> > > >> > sync.
> > > >> > > > > > > Requesting
> > > >> > > > > > > > > close of hlog
> > > >> > > > > > > > > java.io.IOException: Reflection
> > > >> > > > > > > > > Caused by:
> java.lang.reflect.InvocationTargetException
> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> > > >>closed
> > > >> > > > > > > > > 2013-06-05 05:12:51,183 ERROR
> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
> Failed
> > > >>close
> > > >> > of
> > > >> > > > HLog
> > > >> > > > > > > > writer
> > > >> > > > > > > > > java.io.IOException: Reflection
> > > >> > > > > > > > > Caused by:
> java.lang.reflect.InvocationTargetException
> > > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> > > >>closed
> > > >> > > > > > > > > 2013-06-05 05:12:51,184 WARN
> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog:
> Riding
> > > >>over
> > > >> > HLog
> > > >> > > > > close
> > > >> > > > > > > > > failure! error count=1
> > > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > >> ABORTING
> > > >> > > > region
> > > >> > > > > > > > server
> > > >> > > > > > > > > hbase.rummycircle.com,60020,1369877672964:
> > > >> > > > > > > > > regionserver:60020-0x13ef31264d00001
> > > >> > > > > > > regionserver:60020-0x13ef31264d00001
> > > >> > > > > > > > > received expired from ZooKeeper, aborting
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired
> > > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > >> > > RegionServer
> > > >> > > > > > abort:
> > > >> > > > > > > > > loaded coprocessors are: []
> > > >> > > > > > > > > 2013-06-05 05:12:52,621 INFO
> > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> > > >> > > > SplitLogWorker
> > > >> > > > > > > > > interrupted while waiting for task, exiting:
> > > >> > > > > > > > java.lang.InterruptedException
> > > >> > > > > > > > > java.io.InterruptedIOException: Aborting compaction
> of
> > > >> store
> > > >> > > > > cfp_info
> > > >> > > > > > > in
> > > >> > > > > > > > > region
> > > >> > > > > > >
> > > >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > >> > > > > > > > > because user requested stop.
> > > >> > > > > > > > > 2013-06-05 05:12:53,425 WARN
> > > >> > > > > > > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > >> > > Possibly
> > > >> > > > > > > > transient
> > > >> > > > > > > > > ZooKeeper exception:
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > >> > > > > > hbase.rummycircle.com
> > > >> > > > > > > > > ,60020,1369877672964
> > > >> > > > > > > > > 2013-06-05 05:12:55,426 WARN
> > > >> > > > > > > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > >> > > Possibly
> > > >> > > > > > > > transient
> > > >> > > > > > > > > ZooKeeper exception:
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > >> > > > > > hbase.rummycircle.com
> > > >> > > > > > > > > ,60020,1369877672964
> > > >> > > > > > > > > 2013-06-05 05:12:59,427 WARN
> > > >> > > > > > > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > >> > > Possibly
> > > >> > > > > > > > transient
> > > >> > > > > > > > > ZooKeeper exception:
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > >> > > > > > hbase.rummycircle.com
> > > >> > > > > > > > > ,60020,1369877672964
> > > >> > > > > > > > > 2013-06-05 05:13:07,427 WARN
> > > >> > > > > > > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > >> > > Possibly
> > > >> > > > > > > > transient
> > > >> > > > > > > > > ZooKeeper exception:
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > >> > > > > > hbase.rummycircle.com
> > > >> > > > > > > > > ,60020,1369877672964
> > > >> > > > > > > > > 2013-06-05 05:13:07,427 ERROR
> > > >> > > > > > > > >
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > >> > > ZooKeeper
> > > >> > > > > > > delete
> > > >> > > > > > > > > failed after 3 retries
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > >> > > > > > hbase.rummycircle.com
> > > >> > > > > > > > > ,60020,1369877672964
> > > >> > > > > > > > >     at
> > > >> > > > > > > > >
> > > >> > > > >
> > > >>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > >> > > > > > > > >     at
> > > >> > > > > > > >
> > > >> > > >
> > > >>org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > >> > > > > > > > > 2013-06-05 05:13:07,436 ERROR
> > > >> > org.apache.hadoop.hdfs.DFSClient:
> > > >> > > > > > > Exception
> > > >> > > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> > > >> > > > > ,60020,1369877672964/
> > > >> > > > > > > > > hbase.rummycircle.com
> > > >> %2C60020%2C1369877672964.1370382721642
> > > >> > :
> > > >> > > > > > > > > java.io.IOException: All datanodes
> > 192.168.20.30:50010
> > > >>are
> > > >> > > bad.
> > > >> > > > > > > > > Aborting...
> > > >> > > > > > > > > java.io.IOException: All datanodes
> > 192.168.20.30:50010
> > > >>are
> > > >> > > bad.
> > > >> > > > > > > > > Aborting...
> > > >> > > > > > > > >     at
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> > > >>Client.java:3096)
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > *HMaster logs:*
> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4702394ms instead of 10000ms, this is likely
> due
> > > >>to a
> > > >> > > long
> > > >> > > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4988731ms instead of 300000ms, this is likely
> > due
> > > >>to
> > > >> a
> > > >> > > long
> > > >> > > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4988726ms instead of 300000ms, this is likely
> > due
> > > >>to
> > > >> a
> > > >> > > long
> > > >> > > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4698291ms instead of 10000ms, this is likely
> due
> > > >>to a
> > > >> > > long
> > > >> > > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:50,711 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4694502ms instead of 1000ms, this is likely
> due
> > > >>to a
> > > >> > long
> > > >> > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:50,714 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4694492ms instead of 1000ms, this is likely
> due
> > > >>to a
> > > >> > long
> > > >> > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:50,715 WARN
> > > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > >> > > > > We
> > > >> > > > > > > > > slept 4695589ms instead of 60000ms, this is likely
> due
> > > >>to a
> > > >> > > long
> > > >> > > > > > > garbage
> > > >> > > > > > > > > collecting pause and it's usually bad, see
> > > >> > > > > > > > >
> > > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > >> > > > > > > > > 2013-06-05 05:12:52,263 FATAL
> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> > > >> > > > > > > > > Master server abort: loaded coprocessors are: []
> > > >> > > > > > > > > 2013-06-05 05:12:52,465 INFO
> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > >> > > > > > > > > Waiting for region servers count to settle;
> currently
> > > >> checked
> > > >> > > in
> > > >> > > > 1,
> > > >> > > > > > > slept
> > > >> > > > > > > > > for 0 ms, expecting minimum of 1, maximum of
> > 2147483647,
> > > >> > > timeout
> > > >> > > > of
> > > >> > > > > > > 4500
> > > >> > > > > > > > > ms, interval of 1500 ms.
> > > >> > > > > > > > > 2013-06-05 05:12:52,561 ERROR
> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> > > >> > > > > > > > > Region server hbase.rummycircle.com
> > ,60020,1369877672964
> > > >> > > > reported a
> > > >> > > > > > > fatal
> > > >> > > > > > > > > error:
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired
> > > >> > > > > > > > > 2013-06-05 05:12:53,970 INFO
> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > >> > > > > > > > > Waiting for region servers count to settle;
> currently
> > > >> checked
> > > >> > > in
> > > >> > > > 1,
> > > >> > > > > > > slept
> > > >> > > > > > > > > for 1506 ms, expecting minimum of 1, maximum of
> > > >>2147483647,
> > > >> > > > timeout
> > > >> > > > > > of
> > > >> > > > > > > > 4500
> > > >> > > > > > > > > ms, interval of 1500 ms.
> > > >> > > > > > > > > 2013-06-05 05:12:55,476 INFO
> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > >> > > > > > > > > Waiting for region servers count to settle;
> currently
> > > >> checked
> > > >> > > in
> > > >> > > > 1,
> > > >> > > > > > > slept
> > > >> > > > > > > > > for 3012 ms, expecting minimum of 1, maximum of
> > > >>2147483647,
> > > >> > > > timeout
> > > >> > > > > > of
> > > >> > > > > > > > 4500
> > > >> > > > > > > > > ms, interval of 1500 ms.
> > > >> > > > > > > > > 2013-06-05 05:12:56,981 INFO
> > > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > >> > > > > > > > > Finished waiting for region servers count to settle;
> > > >> checked
> > > >> > in
> > > >> > > > 1,
> > > >> > > > > > > slept
> > > >> > > > > > > > > for 4517 ms, expecting minimum of 1, maximum of
> > > >>2147483647,
> > > >> > > > master
> > > >> > > > > is
> > > >> > > > > > > > > running.
> > > >> > > > > > > > > 2013-06-05 05:12:57,019 INFO
> > > >> > > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker:
> Failed
> > > >> > > > verification
> > > >> > > > > > of
> > > >> > > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
> > > >> > > ,60020,1369877672964;
> > > >> > > > > > > > > java.io.EOFException
> > > >> > > > > > > > > 2013-06-05 05:17:52,302 WARN
> > > >> > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager:
> error
> > > >>while
> > > >> > > > > splitting
> > > >> > > > > > > > logs
> > > >> > > > > > > > > in [hdfs://
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-
> > > >>splitting
> > > >> > > > > > > > ]
> > > >> > > > > > > > > installed = 19 but only 0 done
> > > >> > > > > > > > > 2013-06-05 05:17:52,321 FATAL
> > > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> > > >> > > > > > > > > master:60000-0x13ef31264d00000
> > > >> master:60000-0x13ef31264d00000
> > > >> > > > > > received
> > > >> > > > > > > > > expired from ZooKeeper, aborting
> > > >> > > > > > > > >
> > > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > >> > > > > > > > > KeeperErrorCode = Session expired
> > > >> > > > > > > > > java.io.IOException: Giving up after tries=1
> > > >> > > > > > > > > Caused by: java.lang.InterruptedException: sleep
> > > >> interrupted
> > > >> > > > > > > > > 2013-06-05 05:17:52,381 ERROR
> > > >> > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine:
> > > >>Failed
> > > >> to
> > > >> > > > start
> > > >> > > > > > > master
> > > >> > > > > > > > > java.lang.RuntimeException: HMaster Aborted
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > --
> > > >> > > > > > > > > Thanks and Regards,
> > > >> > > > > > > > > Vimal Jain
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > --
> > > >> > > > > > > > Thanks and Regards,
> > > >> > > > > > > > Vimal Jain
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Thanks and Regards,
> > > >> > > > > > Vimal Jain
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Thanks and Regards,
> > > >> > > > Vimal Jain
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Thanks and Regards,
> > > >> > Vimal Jain
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > > >--
> > > >Thanks and Regards,
> > > >Vimal Jain
> > >
> > >
> >
> >
> > --
> > Kevin O'Dell
> > Systems Engineer, Cloudera
> >
>

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

Hi Azuryy/Ted,
Can you please help here...
On Jun 5, 2013 7:23 PM, "Kevin O'dell" <ke...@cloudera.com> wrote:

> No!
>
> Just kidding, you can unsubscribe by going to the Apache site:
>
> http://hbase.apache.org/mail-lists.html
>
>
> On Wed, Jun 5, 2013 at 9:34 AM, Joseph Coleman <
> joe.coleman@infinitecampus.com> wrote:
>
> > Please remove me from this list
> >
> >
> > On 6/5/13 8:32 AM, "Vimal Jain" <vk...@gmail.com> wrote:
> >
> > >Ok.
> > >I dont have any batch read/write to hbase.
> > >
> > >
> > >On Wed, Jun 5, 2013 at 6:08 PM, Azuryy Yu <az...@gmail.com> wrote:
> > >
> > >> gc log cannot get by default. need some configuration. do you have
> some
> > >> batch read or write to hbase?
> > >>
> > >> --Send from my Sony mobile.
> > >> On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > >>
> > >> > I dont have GC logs.Do you get it by default  or it has to be
> > >>configured
> > >> ?
> > >> > After i came to know about crash , i checked which all processes are
> > >> > running using "jps"
> > >> > It displayed 4 processes , "namenode","datanode","secondarynamenode"
> > >>and
> > >> > "HQuorumpeer".
> > >> > So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and then i
> > >> stopped
> > >> > hbase by running $HBASE_HOME/bin/stop-hbase.sh
> > >> >
> > >> >
> > >> > On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com>
> wrote:
> > >> >
> > >> > > do you have GC log? and what you did during crash? and whats your
> gc
> > >> > > options?
> > >> > >
> > >> > > for the dn error, thats net work issue generally, because dn
> > >>received
> > >> an
> > >> > > incomplete packet.
> > >> > >
> > >> > > --Send from my Sony mobile.
> > >> > > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > >> > >
> > >> > > > Yes.
> > >> > > > Thats true.
> > >> > > > There are some errors in all 3 logs during same period , i.e.
> > >>data ,
> > >> > > master
> > >> > > > and region.
> > >> > > > But i am unable to deduce the exact cause of error.
> > >> > > > Can you please help in detecting the problem ?
> > >> > > >
> > >> > > > So far i am suspecting following :-
> > >> > > > I have 1GB heap (default) allocated for all 3 processes , i.e.
> > >> > > > Master,Region,Zookeeper.
> > >> > > > Both  Master and Region took more time for GC ( as inferred from
> > >> lines
> > >> > in
> > >> > > > logs like "slept more time than configured one" etc ) .
> > >> > > > Due to this there was  zookeeper connection time out for both
> > >>Master
> > >> > and
> > >> > > > Region and hence both went down.
> > >> > > >
> > >> > > > I am newbie to Hbase and hence may be my findings are not
> correct.
> > >> > > > I want to be 100 % sure before increasing heap space for both
> > >>Master
> > >> > and
> > >> > > > Region ( Both around 2GB) to solve this.
> > >> > > > At present i have restarted the cluster with default heap space
> > >>only
> > >> (
> > >> > > 1GB
> > >> > > > ).
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > there have errors in your dats node log, and the error time
> > >>match
> > >> > with
> > >> > > rs
> > >> > > > > log error time.
> > >> > > > >
> > >> > > > > --Send from my Sony mobile.
> > >> > > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com>
> wrote:
> > >> > > > >
> > >> > > > > > I don't think so , as i dont find any issues in data node
> > >>logs.
> > >> > > > > > Also there are lot of exceptions like "session expired" ,
> > >>"slept
> > >> > more
> > >> > > > > than
> > >> > > > > > configured time" . what are these ?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <
> azuryyyu@gmail.com
> > >
> > >> > > wrote:
> > >> > > > > >
> > >> > > > > > > Because your data node 192.168.20.30 broke down. which
> > >>leads to
> > >> > RS
> > >> > > > > down.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain
> > >><vk...@gmail.com>
> > >> > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Here is the complete log:
> > >> > > > > > > >
> > >> > > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
> > >> > > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > >> > > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <
> > >> vkjk89@gmail.com>
> > >> > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi,
> > >> > > > > > > > > I have set up Hbase in pseudo-distributed mode.
> > >> > > > > > > > > It was working fine for 6 days , but suddenly today
> > >>morning
> > >> > > both
> > >> > > > > > > HMaster
> > >> > > > > > > > > and Hregion process went down.
> > >> > > > > > > > > I checked in logs of both hadoop and hbase.
> > >> > > > > > > > > Please help here.
> > >> > > > > > > > > Here are the snippets :-
> > >> > > > > > > > >
> > >> > > > > > > > > *Datanode logs:*
> > >> > > > > > > > > 2013-06-05 05:12:51,436 INFO
> > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >>Exception
> > >> in
> > >> > > > > > > > receiveBlock
> > >> > > > > > > > > for block blk_1597245478875608321_2818
> > >> java.io.EOFException:
> > >> > > > while
> > >> > > > > > > trying
> > >> > > > > > > > > to read 2347 bytes
> > >> > > > > > > > > 2013-06-05 05:12:51,442 INFO
> > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >>writeBlock
> > >> > > > > > > > > blk_1597245478875608321_2818 received exception
> > >> > > > > java.io.EOFException:
> > >> > > > > > > > while
> > >> > > > > > > > > trying to read 2347 bytes
> > >> > > > > > > > > 2013-06-05 05:12:51,442 ERROR
> > >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> > > > > > DatanodeRegistration(
> > >> > > > > > > > > 192.168.20.30:50010,
> > >> > > > > > > > >
> > >>storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > >> > > > > > > > infoPort=50075,
> > >> > > > > > > > > ipcPort=50020):DataXceiver
> > >> > > > > > > > > java.io.EOFException: while trying to read 2347 bytes
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > *HRegion logs:*
> > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4694929ms instead of 3000ms, this is likely due
> > >>to a
> > >> > long
> > >> > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:51,045 WARN
> > >> > org.apache.hadoop.hdfs.DFSClient:
> > >> > > > > > > > > DFSOutputStream ResponseProcessor exception  for block
> > >> > > > > > > > >
> > >> blk_1597245478875608321_2818java.net.SocketTimeoutException:
> > >> > > > 63000
> > >> > > > > > > millis
> > >> > > > > > > > > timeout while waiting for channel to be ready for
> read.
> > >>ch
> > >> :
> > >> > > > > > > > > java.nio.channels.SocketChannel[connected local=/
> > >> > > > > 192.168.20.30:44333
> > >> > > > > > > > remote=/
> > >> > > > > > > > > 192.168.20.30:50010]
> > >> > > > > > > > > 2013-06-05 05:12:51,046 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 11695345ms instead of 10000000ms, this is likely
> > >>due
> > >> > to a
> > >> > > > > long
> > >> > > > > > > > > garbage collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:51,048 WARN
> > >> > org.apache.hadoop.hdfs.DFSClient:
> > >> > > > > Error
> > >> > > > > > > > > Recovery for block blk_1597245478875608321_2818 bad
> > >> > datanode[0]
> > >> > > > > > > > > 192.168.20.30:50010
> > >> > > > > > > > > 2013-06-05 05:12:51,075 WARN
> > >> > org.apache.hadoop.hdfs.DFSClient:
> > >> > > > > Error
> > >> > > > > > > > while
> > >> > > > > > > > > syncing
> > >> > > > > > > > > java.io.IOException: All datanodes
> 192.168.20.30:50010
> > >>are
> > >> > > bad.
> > >> > > > > > > > > Aborting...
> > >> > > > > > > > >     at
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> > >>Client.java:3096)
> > >> > > > > > > > > 2013-06-05 05:12:51,110 FATAL
> > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could
> not
> > >> > sync.
> > >> > > > > > > Requesting
> > >> > > > > > > > > close of hlog
> > >> > > > > > > > > java.io.IOException: Reflection
> > >> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> > >>closed
> > >> > > > > > > > > 2013-06-05 05:12:51,180 FATAL
> > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could
> not
> > >> > sync.
> > >> > > > > > > Requesting
> > >> > > > > > > > > close of hlog
> > >> > > > > > > > > java.io.IOException: Reflection
> > >> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> > >>closed
> > >> > > > > > > > > 2013-06-05 05:12:51,183 ERROR
> > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed
> > >>close
> > >> > of
> > >> > > > HLog
> > >> > > > > > > > writer
> > >> > > > > > > > > java.io.IOException: Reflection
> > >> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> > >>closed
> > >> > > > > > > > > 2013-06-05 05:12:51,184 WARN
> > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding
> > >>over
> > >> > HLog
> > >> > > > > close
> > >> > > > > > > > > failure! error count=1
> > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > >> ABORTING
> > >> > > > region
> > >> > > > > > > > server
> > >> > > > > > > > > hbase.rummycircle.com,60020,1369877672964:
> > >> > > > > > > > > regionserver:60020-0x13ef31264d00001
> > >> > > > > > > regionserver:60020-0x13ef31264d00001
> > >> > > > > > > > > received expired from ZooKeeper, aborting
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired
> > >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > >> > > RegionServer
> > >> > > > > > abort:
> > >> > > > > > > > > loaded coprocessors are: []
> > >> > > > > > > > > 2013-06-05 05:12:52,621 INFO
> > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> > >> > > > SplitLogWorker
> > >> > > > > > > > > interrupted while waiting for task, exiting:
> > >> > > > > > > > java.lang.InterruptedException
> > >> > > > > > > > > java.io.InterruptedIOException: Aborting compaction of
> > >> store
> > >> > > > > cfp_info
> > >> > > > > > > in
> > >> > > > > > > > > region
> > >> > > > > > >
> > >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > >> > > > > > > > > because user requested stop.
> > >> > > > > > > > > 2013-06-05 05:12:53,425 WARN
> > >> > > > > > > > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > >> > > Possibly
> > >> > > > > > > > transient
> > >> > > > > > > > > ZooKeeper exception:
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > >> > > > > > hbase.rummycircle.com
> > >> > > > > > > > > ,60020,1369877672964
> > >> > > > > > > > > 2013-06-05 05:12:55,426 WARN
> > >> > > > > > > > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > >> > > Possibly
> > >> > > > > > > > transient
> > >> > > > > > > > > ZooKeeper exception:
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > >> > > > > > hbase.rummycircle.com
> > >> > > > > > > > > ,60020,1369877672964
> > >> > > > > > > > > 2013-06-05 05:12:59,427 WARN
> > >> > > > > > > > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > >> > > Possibly
> > >> > > > > > > > transient
> > >> > > > > > > > > ZooKeeper exception:
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > >> > > > > > hbase.rummycircle.com
> > >> > > > > > > > > ,60020,1369877672964
> > >> > > > > > > > > 2013-06-05 05:13:07,427 WARN
> > >> > > > > > > > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > >> > > Possibly
> > >> > > > > > > > transient
> > >> > > > > > > > > ZooKeeper exception:
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > >> > > > > > hbase.rummycircle.com
> > >> > > > > > > > > ,60020,1369877672964
> > >> > > > > > > > > 2013-06-05 05:13:07,427 ERROR
> > >> > > > > > > > >
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > >> > > ZooKeeper
> > >> > > > > > > delete
> > >> > > > > > > > > failed after 3 retries
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > >> > > > > > hbase.rummycircle.com
> > >> > > > > > > > > ,60020,1369877672964
> > >> > > > > > > > >     at
> > >> > > > > > > > >
> > >> > > > >
> > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > >> > > > > > > > >     at
> > >> > > > > > > >
> > >> > > >
> > >>org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > >> > > > > > > > > 2013-06-05 05:13:07,436 ERROR
> > >> > org.apache.hadoop.hdfs.DFSClient:
> > >> > > > > > > Exception
> > >> > > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> > >> > > > > ,60020,1369877672964/
> > >> > > > > > > > > hbase.rummycircle.com
> > >> %2C60020%2C1369877672964.1370382721642
> > >> > :
> > >> > > > > > > > > java.io.IOException: All datanodes
> 192.168.20.30:50010
> > >>are
> > >> > > bad.
> > >> > > > > > > > > Aborting...
> > >> > > > > > > > > java.io.IOException: All datanodes
> 192.168.20.30:50010
> > >>are
> > >> > > bad.
> > >> > > > > > > > > Aborting...
> > >> > > > > > > > >     at
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> > >>Client.java:3096)
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > *HMaster logs:*
> > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4702394ms instead of 10000ms, this is likely due
> > >>to a
> > >> > > long
> > >> > > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4988731ms instead of 300000ms, this is likely
> due
> > >>to
> > >> a
> > >> > > long
> > >> > > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4988726ms instead of 300000ms, this is likely
> due
> > >>to
> > >> a
> > >> > > long
> > >> > > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4698291ms instead of 10000ms, this is likely due
> > >>to a
> > >> > > long
> > >> > > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:50,711 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4694502ms instead of 1000ms, this is likely due
> > >>to a
> > >> > long
> > >> > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:50,714 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4694492ms instead of 1000ms, this is likely due
> > >>to a
> > >> > long
> > >> > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:50,715 WARN
> > >> > > > org.apache.hadoop.hbase.util.Sleeper:
> > >> > > > > We
> > >> > > > > > > > > slept 4695589ms instead of 60000ms, this is likely due
> > >>to a
> > >> > > long
> > >> > > > > > > garbage
> > >> > > > > > > > > collecting pause and it's usually bad, see
> > >> > > > > > > > >
> > >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > >> > > > > > > > > 2013-06-05 05:12:52,263 FATAL
> > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> > >> > > > > > > > > Master server abort: loaded coprocessors are: []
> > >> > > > > > > > > 2013-06-05 05:12:52,465 INFO
> > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > >> > > > > > > > > Waiting for region servers count to settle; currently
> > >> checked
> > >> > > in
> > >> > > > 1,
> > >> > > > > > > slept
> > >> > > > > > > > > for 0 ms, expecting minimum of 1, maximum of
> 2147483647,
> > >> > > timeout
> > >> > > > of
> > >> > > > > > > 4500
> > >> > > > > > > > > ms, interval of 1500 ms.
> > >> > > > > > > > > 2013-06-05 05:12:52,561 ERROR
> > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> > >> > > > > > > > > Region server hbase.rummycircle.com
> ,60020,1369877672964
> > >> > > > reported a
> > >> > > > > > > fatal
> > >> > > > > > > > > error:
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired
> > >> > > > > > > > > 2013-06-05 05:12:53,970 INFO
> > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > >> > > > > > > > > Waiting for region servers count to settle; currently
> > >> checked
> > >> > > in
> > >> > > > 1,
> > >> > > > > > > slept
> > >> > > > > > > > > for 1506 ms, expecting minimum of 1, maximum of
> > >>2147483647,
> > >> > > > timeout
> > >> > > > > > of
> > >> > > > > > > > 4500
> > >> > > > > > > > > ms, interval of 1500 ms.
> > >> > > > > > > > > 2013-06-05 05:12:55,476 INFO
> > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > >> > > > > > > > > Waiting for region servers count to settle; currently
> > >> checked
> > >> > > in
> > >> > > > 1,
> > >> > > > > > > slept
> > >> > > > > > > > > for 3012 ms, expecting minimum of 1, maximum of
> > >>2147483647,
> > >> > > > timeout
> > >> > > > > > of
> > >> > > > > > > > 4500
> > >> > > > > > > > > ms, interval of 1500 ms.
> > >> > > > > > > > > 2013-06-05 05:12:56,981 INFO
> > >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > >> > > > > > > > > Finished waiting for region servers count to settle;
> > >> checked
> > >> > in
> > >> > > > 1,
> > >> > > > > > > slept
> > >> > > > > > > > > for 4517 ms, expecting minimum of 1, maximum of
> > >>2147483647,
> > >> > > > master
> > >> > > > > is
> > >> > > > > > > > > running.
> > >> > > > > > > > > 2013-06-05 05:12:57,019 INFO
> > >> > > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> > >> > > > verification
> > >> > > > > > of
> > >> > > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
> > >> > > ,60020,1369877672964;
> > >> > > > > > > > > java.io.EOFException
> > >> > > > > > > > > 2013-06-05 05:17:52,302 WARN
> > >> > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error
> > >>while
> > >> > > > > splitting
> > >> > > > > > > > logs
> > >> > > > > > > > > in [hdfs://
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >>
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-
> > >>splitting
> > >> > > > > > > > ]
> > >> > > > > > > > > installed = 19 but only 0 done
> > >> > > > > > > > > 2013-06-05 05:17:52,321 FATAL
> > >> > > > > org.apache.hadoop.hbase.master.HMaster:
> > >> > > > > > > > > master:60000-0x13ef31264d00000
> > >> master:60000-0x13ef31264d00000
> > >> > > > > > received
> > >> > > > > > > > > expired from ZooKeeper, aborting
> > >> > > > > > > > >
> > >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > >> > > > > > > > > KeeperErrorCode = Session expired
> > >> > > > > > > > > java.io.IOException: Giving up after tries=1
> > >> > > > > > > > > Caused by: java.lang.InterruptedException: sleep
> > >> interrupted
> > >> > > > > > > > > 2013-06-05 05:17:52,381 ERROR
> > >> > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine:
> > >>Failed
> > >> to
> > >> > > > start
> > >> > > > > > > master
> > >> > > > > > > > > java.lang.RuntimeException: HMaster Aborted
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > > Thanks and Regards,
> > >> > > > > > > > > Vimal Jain
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > --
> > >> > > > > > > > Thanks and Regards,
> > >> > > > > > > > Vimal Jain
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Thanks and Regards,
> > >> > > > > > Vimal Jain
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Thanks and Regards,
> > >> > > > Vimal Jain
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Thanks and Regards,
> > >> > Vimal Jain
> > >> >
> > >>
> > >
> > >
> > >
> > >--
> > >Thanks and Regards,
> > >Vimal Jain
> >
> >
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>

Re: HMaster and HRegionServer going down

Posted by Kevin O'dell <ke...@cloudera.com>.

No!

Just kidding, you can unsubscribe by going to the Apache site:

http://hbase.apache.org/mail-lists.html


On Wed, Jun 5, 2013 at 9:34 AM, Joseph Coleman <
joe.coleman@infinitecampus.com> wrote:

> Please remove me from this list
>
>
> On 6/5/13 8:32 AM, "Vimal Jain" <vk...@gmail.com> wrote:
>
> >Ok.
> >I dont have any batch read/write to hbase.
> >
> >
> >On Wed, Jun 5, 2013 at 6:08 PM, Azuryy Yu <az...@gmail.com> wrote:
> >
> >> gc log cannot get by default. need some configuration. do you have some
> >> batch read or write to hbase?
> >>
> >> --Send from my Sony mobile.
> >> On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> >>
> >> > I dont have GC logs.Do you get it by default  or it has to be
> >>configured
> >> ?
> >> > After i came to know about crash , i checked which all processes are
> >> > running using "jps"
> >> > It displayed 4 processes , "namenode","datanode","secondarynamenode"
> >>and
> >> > "HQuorumpeer".
> >> > So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and then i
> >> stopped
> >> > hbase by running $HBASE_HOME/bin/stop-hbase.sh
> >> >
> >> >
> >> > On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com> wrote:
> >> >
> >> > > do you have GC log? and what you did during crash? and whats your gc
> >> > > options?
> >> > >
> >> > > for the dn error, thats net work issue generally, because dn
> >>received
> >> an
> >> > > incomplete packet.
> >> > >
> >> > > --Send from my Sony mobile.
> >> > > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> >> > >
> >> > > > Yes.
> >> > > > Thats true.
> >> > > > There are some errors in all 3 logs during same period , i.e.
> >>data ,
> >> > > master
> >> > > > and region.
> >> > > > But i am unable to deduce the exact cause of error.
> >> > > > Can you please help in detecting the problem ?
> >> > > >
> >> > > > So far i am suspecting following :-
> >> > > > I have 1GB heap (default) allocated for all 3 processes , i.e.
> >> > > > Master,Region,Zookeeper.
> >> > > > Both  Master and Region took more time for GC ( as inferred from
> >> lines
> >> > in
> >> > > > logs like "slept more time than configured one" etc ) .
> >> > > > Due to this there was  zookeeper connection time out for both
> >>Master
> >> > and
> >> > > > Region and hence both went down.
> >> > > >
> >> > > > I am newbie to Hbase and hence may be my findings are not correct.
> >> > > > I want to be 100 % sure before increasing heap space for both
> >>Master
> >> > and
> >> > > > Region ( Both around 2GB) to solve this.
> >> > > > At present i have restarted the cluster with default heap space
> >>only
> >> (
> >> > > 1GB
> >> > > > ).
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > there have errors in your dats node log, and the error time
> >>match
> >> > with
> >> > > rs
> >> > > > > log error time.
> >> > > > >
> >> > > > > --Send from my Sony mobile.
> >> > > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> >> > > > >
> >> > > > > > I don't think so , as i dont find any issues in data node
> >>logs.
> >> > > > > > Also there are lot of exceptions like "session expired" ,
> >>"slept
> >> > more
> >> > > > > than
> >> > > > > > configured time" . what are these ?
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <azuryyyu@gmail.com
> >
> >> > > wrote:
> >> > > > > >
> >> > > > > > > Because your data node 192.168.20.30 broke down. which
> >>leads to
> >> > RS
> >> > > > > down.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain
> >><vk...@gmail.com>
> >> > > wrote:
> >> > > > > > >
> >> > > > > > > > Here is the complete log:
> >> > > > > > > >
> >> > > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
> >> > > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> >> > > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <
> >> vkjk89@gmail.com>
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi,
> >> > > > > > > > > I have set up Hbase in pseudo-distributed mode.
> >> > > > > > > > > It was working fine for 6 days , but suddenly today
> >>morning
> >> > > both
> >> > > > > > > HMaster
> >> > > > > > > > > and Hregion process went down.
> >> > > > > > > > > I checked in logs of both hadoop and hbase.
> >> > > > > > > > > Please help here.
> >> > > > > > > > > Here are the snippets :-
> >> > > > > > > > >
> >> > > > > > > > > *Datanode logs:*
> >> > > > > > > > > 2013-06-05 05:12:51,436 INFO
> >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>Exception
> >> in
> >> > > > > > > > receiveBlock
> >> > > > > > > > > for block blk_1597245478875608321_2818
> >> java.io.EOFException:
> >> > > > while
> >> > > > > > > trying
> >> > > > > > > > > to read 2347 bytes
> >> > > > > > > > > 2013-06-05 05:12:51,442 INFO
> >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>writeBlock
> >> > > > > > > > > blk_1597245478875608321_2818 received exception
> >> > > > > java.io.EOFException:
> >> > > > > > > > while
> >> > > > > > > > > trying to read 2347 bytes
> >> > > > > > > > > 2013-06-05 05:12:51,442 ERROR
> >> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> > > > > > DatanodeRegistration(
> >> > > > > > > > > 192.168.20.30:50010,
> >> > > > > > > > >
> >>storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> >> > > > > > > > infoPort=50075,
> >> > > > > > > > > ipcPort=50020):DataXceiver
> >> > > > > > > > > java.io.EOFException: while trying to read 2347 bytes
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > *HRegion logs:*
> >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4694929ms instead of 3000ms, this is likely due
> >>to a
> >> > long
> >> > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:51,045 WARN
> >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > > > > > > > DFSOutputStream ResponseProcessor exception  for block
> >> > > > > > > > >
> >> blk_1597245478875608321_2818java.net.SocketTimeoutException:
> >> > > > 63000
> >> > > > > > > millis
> >> > > > > > > > > timeout while waiting for channel to be ready for read.
> >>ch
> >> :
> >> > > > > > > > > java.nio.channels.SocketChannel[connected local=/
> >> > > > > 192.168.20.30:44333
> >> > > > > > > > remote=/
> >> > > > > > > > > 192.168.20.30:50010]
> >> > > > > > > > > 2013-06-05 05:12:51,046 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 11695345ms instead of 10000000ms, this is likely
> >>due
> >> > to a
> >> > > > > long
> >> > > > > > > > > garbage collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:51,048 WARN
> >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > > > Error
> >> > > > > > > > > Recovery for block blk_1597245478875608321_2818 bad
> >> > datanode[0]
> >> > > > > > > > > 192.168.20.30:50010
> >> > > > > > > > > 2013-06-05 05:12:51,075 WARN
> >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > > > Error
> >> > > > > > > > while
> >> > > > > > > > > syncing
> >> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010
> >>are
> >> > > bad.
> >> > > > > > > > > Aborting...
> >> > > > > > > > >     at
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> >>Client.java:3096)
> >> > > > > > > > > 2013-06-05 05:12:51,110 FATAL
> >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
> >> > sync.
> >> > > > > > > Requesting
> >> > > > > > > > > close of hlog
> >> > > > > > > > > java.io.IOException: Reflection
> >> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> >>closed
> >> > > > > > > > > 2013-06-05 05:12:51,180 FATAL
> >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
> >> > sync.
> >> > > > > > > Requesting
> >> > > > > > > > > close of hlog
> >> > > > > > > > > java.io.IOException: Reflection
> >> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> >>closed
> >> > > > > > > > > 2013-06-05 05:12:51,183 ERROR
> >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed
> >>close
> >> > of
> >> > > > HLog
> >> > > > > > > > writer
> >> > > > > > > > > java.io.IOException: Reflection
> >> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> >> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
> >>closed
> >> > > > > > > > > 2013-06-05 05:12:51,184 WARN
> >> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding
> >>over
> >> > HLog
> >> > > > > close
> >> > > > > > > > > failure! error count=1
> >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> ABORTING
> >> > > > region
> >> > > > > > > > server
> >> > > > > > > > > hbase.rummycircle.com,60020,1369877672964:
> >> > > > > > > > > regionserver:60020-0x13ef31264d00001
> >> > > > > > > regionserver:60020-0x13ef31264d00001
> >> > > > > > > > > received expired from ZooKeeper, aborting
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired
> >> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> >> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> > > RegionServer
> >> > > > > > abort:
> >> > > > > > > > > loaded coprocessors are: []
> >> > > > > > > > > 2013-06-05 05:12:52,621 INFO
> >> > > > > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> >> > > > SplitLogWorker
> >> > > > > > > > > interrupted while waiting for task, exiting:
> >> > > > > > > > java.lang.InterruptedException
> >> > > > > > > > > java.io.InterruptedIOException: Aborting compaction of
> >> store
> >> > > > > cfp_info
> >> > > > > > > in
> >> > > > > > > > > region
> >> > > > > > >
> >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> >> > > > > > > > > because user requested stop.
> >> > > > > > > > > 2013-06-05 05:12:53,425 WARN
> >> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > Possibly
> >> > > > > > > > transient
> >> > > > > > > > > ZooKeeper exception:
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > > > > hbase.rummycircle.com
> >> > > > > > > > > ,60020,1369877672964
> >> > > > > > > > > 2013-06-05 05:12:55,426 WARN
> >> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > Possibly
> >> > > > > > > > transient
> >> > > > > > > > > ZooKeeper exception:
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > > > > hbase.rummycircle.com
> >> > > > > > > > > ,60020,1369877672964
> >> > > > > > > > > 2013-06-05 05:12:59,427 WARN
> >> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > Possibly
> >> > > > > > > > transient
> >> > > > > > > > > ZooKeeper exception:
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > > > > hbase.rummycircle.com
> >> > > > > > > > > ,60020,1369877672964
> >> > > > > > > > > 2013-06-05 05:13:07,427 WARN
> >> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > Possibly
> >> > > > > > > > transient
> >> > > > > > > > > ZooKeeper exception:
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > > > > hbase.rummycircle.com
> >> > > > > > > > > ,60020,1369877672964
> >> > > > > > > > > 2013-06-05 05:13:07,427 ERROR
> >> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> >> > > ZooKeeper
> >> > > > > > > delete
> >> > > > > > > > > failed after 3 retries
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> >> > > > > > hbase.rummycircle.com
> >> > > > > > > > > ,60020,1369877672964
> >> > > > > > > > >     at
> >> > > > > > > > >
> >> > > > >
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >> > > > > > > > >     at
> >> > > > > > > >
> >> > > >
> >>org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >> > > > > > > > > 2013-06-05 05:13:07,436 ERROR
> >> > org.apache.hadoop.hdfs.DFSClient:
> >> > > > > > > Exception
> >> > > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> >> > > > > ,60020,1369877672964/
> >> > > > > > > > > hbase.rummycircle.com
> >> %2C60020%2C1369877672964.1370382721642
> >> > :
> >> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010
> >>are
> >> > > bad.
> >> > > > > > > > > Aborting...
> >> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010
> >>are
> >> > > bad.
> >> > > > > > > > > Aborting...
> >> > > > > > > > >     at
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
> >>Client.java:3096)
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > *HMaster logs:*
> >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4702394ms instead of 10000ms, this is likely due
> >>to a
> >> > > long
> >> > > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4988731ms instead of 300000ms, this is likely due
> >>to
> >> a
> >> > > long
> >> > > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4988726ms instead of 300000ms, this is likely due
> >>to
> >> a
> >> > > long
> >> > > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4698291ms instead of 10000ms, this is likely due
> >>to a
> >> > > long
> >> > > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:50,711 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4694502ms instead of 1000ms, this is likely due
> >>to a
> >> > long
> >> > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:50,714 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4694492ms instead of 1000ms, this is likely due
> >>to a
> >> > long
> >> > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:50,715 WARN
> >> > > > org.apache.hadoop.hbase.util.Sleeper:
> >> > > > > We
> >> > > > > > > > > slept 4695589ms instead of 60000ms, this is likely due
> >>to a
> >> > > long
> >> > > > > > > garbage
> >> > > > > > > > > collecting pause and it's usually bad, see
> >> > > > > > > > >
> >> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >> > > > > > > > > 2013-06-05 05:12:52,263 FATAL
> >> > > > > org.apache.hadoop.hbase.master.HMaster:
> >> > > > > > > > > Master server abort: loaded coprocessors are: []
> >> > > > > > > > > 2013-06-05 05:12:52,465 INFO
> >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > > > > > > > Waiting for region servers count to settle; currently
> >> checked
> >> > > in
> >> > > > 1,
> >> > > > > > > slept
> >> > > > > > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647,
> >> > > timeout
> >> > > > of
> >> > > > > > > 4500
> >> > > > > > > > > ms, interval of 1500 ms.
> >> > > > > > > > > 2013-06-05 05:12:52,561 ERROR
> >> > > > > org.apache.hadoop.hbase.master.HMaster:
> >> > > > > > > > > Region server hbase.rummycircle.com,60020,1369877672964
> >> > > > reported a
> >> > > > > > > fatal
> >> > > > > > > > > error:
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired
> >> > > > > > > > > 2013-06-05 05:12:53,970 INFO
> >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > > > > > > > Waiting for region servers count to settle; currently
> >> checked
> >> > > in
> >> > > > 1,
> >> > > > > > > slept
> >> > > > > > > > > for 1506 ms, expecting minimum of 1, maximum of
> >>2147483647,
> >> > > > timeout
> >> > > > > > of
> >> > > > > > > > 4500
> >> > > > > > > > > ms, interval of 1500 ms.
> >> > > > > > > > > 2013-06-05 05:12:55,476 INFO
> >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > > > > > > > Waiting for region servers count to settle; currently
> >> checked
> >> > > in
> >> > > > 1,
> >> > > > > > > slept
> >> > > > > > > > > for 3012 ms, expecting minimum of 1, maximum of
> >>2147483647,
> >> > > > timeout
> >> > > > > > of
> >> > > > > > > > 4500
> >> > > > > > > > > ms, interval of 1500 ms.
> >> > > > > > > > > 2013-06-05 05:12:56,981 INFO
> >> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> >> > > > > > > > > Finished waiting for region servers count to settle;
> >> checked
> >> > in
> >> > > > 1,
> >> > > > > > > slept
> >> > > > > > > > > for 4517 ms, expecting minimum of 1, maximum of
> >>2147483647,
> >> > > > master
> >> > > > > is
> >> > > > > > > > > running.
> >> > > > > > > > > 2013-06-05 05:12:57,019 INFO
> >> > > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> >> > > > verification
> >> > > > > > of
> >> > > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
> >> > > ,60020,1369877672964;
> >> > > > > > > > > java.io.EOFException
> >> > > > > > > > > 2013-06-05 05:17:52,302 WARN
> >> > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error
> >>while
> >> > > > > splitting
> >> > > > > > > > logs
> >> > > > > > > > > in [hdfs://
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >>
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-
> >>splitting
> >> > > > > > > > ]
> >> > > > > > > > > installed = 19 but only 0 done
> >> > > > > > > > > 2013-06-05 05:17:52,321 FATAL
> >> > > > > org.apache.hadoop.hbase.master.HMaster:
> >> > > > > > > > > master:60000-0x13ef31264d00000
> >> master:60000-0x13ef31264d00000
> >> > > > > > received
> >> > > > > > > > > expired from ZooKeeper, aborting
> >> > > > > > > > >
> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> > > > > > > > > KeeperErrorCode = Session expired
> >> > > > > > > > > java.io.IOException: Giving up after tries=1
> >> > > > > > > > > Caused by: java.lang.InterruptedException: sleep
> >> interrupted
> >> > > > > > > > > 2013-06-05 05:17:52,381 ERROR
> >> > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine:
> >>Failed
> >> to
> >> > > > start
> >> > > > > > > master
> >> > > > > > > > > java.lang.RuntimeException: HMaster Aborted
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Thanks and Regards,
> >> > > > > > > > > Vimal Jain
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Thanks and Regards,
> >> > > > > > > > Vimal Jain
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Thanks and Regards,
> >> > > > > > Vimal Jain
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks and Regards,
> >> > > > Vimal Jain
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks and Regards,
> >> > Vimal Jain
> >> >
> >>
> >
> >
> >
> >--
> >Thanks and Regards,
> >Vimal Jain
>
>


-- 
Kevin O'Dell
Systems Engineer, Cloudera

Re: HMaster and HRegionServer going down

Posted by Joseph Coleman <jo...@infinitecampus.com>.

Please remove me from this list


On 6/5/13 8:32 AM, "Vimal Jain" <vk...@gmail.com> wrote:

>Ok.
>I dont have any batch read/write to hbase.
>
>
>On Wed, Jun 5, 2013 at 6:08 PM, Azuryy Yu <az...@gmail.com> wrote:
>
>> gc log cannot get by default. need some configuration. do you have some
>> batch read or write to hbase?
>>
>> --Send from my Sony mobile.
>> On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>>
>> > I dont have GC logs.Do you get it by default  or it has to be
>>configured
>> ?
>> > After i came to know about crash , i checked which all processes are
>> > running using "jps"
>> > It displayed 4 processes , "namenode","datanode","secondarynamenode"
>>and
>> > "HQuorumpeer".
>> > So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and then i
>> stopped
>> > hbase by running $HBASE_HOME/bin/stop-hbase.sh
>> >
>> >
>> > On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com> wrote:
>> >
>> > > do you have GC log? and what you did during crash? and whats your gc
>> > > options?
>> > >
>> > > for the dn error, thats net work issue generally, because dn
>>received
>> an
>> > > incomplete packet.
>> > >
>> > > --Send from my Sony mobile.
>> > > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>> > >
>> > > > Yes.
>> > > > Thats true.
>> > > > There are some errors in all 3 logs during same period , i.e.
>>data ,
>> > > master
>> > > > and region.
>> > > > But i am unable to deduce the exact cause of error.
>> > > > Can you please help in detecting the problem ?
>> > > >
>> > > > So far i am suspecting following :-
>> > > > I have 1GB heap (default) allocated for all 3 processes , i.e.
>> > > > Master,Region,Zookeeper.
>> > > > Both  Master and Region took more time for GC ( as inferred from
>> lines
>> > in
>> > > > logs like "slept more time than configured one" etc ) .
>> > > > Due to this there was  zookeeper connection time out for both
>>Master
>> > and
>> > > > Region and hence both went down.
>> > > >
>> > > > I am newbie to Hbase and hence may be my findings are not correct.
>> > > > I want to be 100 % sure before increasing heap space for both
>>Master
>> > and
>> > > > Region ( Both around 2GB) to solve this.
>> > > > At present i have restarted the cluster with default heap space
>>only
>> (
>> > > 1GB
>> > > > ).
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com>
>> wrote:
>> > > >
>> > > > > there have errors in your dats node log, and the error time
>>match
>> > with
>> > > rs
>> > > > > log error time.
>> > > > >
>> > > > > --Send from my Sony mobile.
>> > > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>> > > > >
>> > > > > > I don't think so , as i dont find any issues in data node
>>logs.
>> > > > > > Also there are lot of exceptions like "session expired" ,
>>"slept
>> > more
>> > > > > than
>> > > > > > configured time" . what are these ?
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com>
>> > > wrote:
>> > > > > >
>> > > > > > > Because your data node 192.168.20.30 broke down. which
>>leads to
>> > RS
>> > > > > down.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain
>><vk...@gmail.com>
>> > > wrote:
>> > > > > > >
>> > > > > > > > Here is the complete log:
>> > > > > > > >
>> > > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
>> > > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
>> > > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <
>> vkjk89@gmail.com>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > > I have set up Hbase in pseudo-distributed mode.
>> > > > > > > > > It was working fine for 6 days , but suddenly today
>>morning
>> > > both
>> > > > > > > HMaster
>> > > > > > > > > and Hregion process went down.
>> > > > > > > > > I checked in logs of both hadoop and hbase.
>> > > > > > > > > Please help here.
>> > > > > > > > > Here are the snippets :-
>> > > > > > > > >
>> > > > > > > > > *Datanode logs:*
>> > > > > > > > > 2013-06-05 05:12:51,436 INFO
>> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
>>Exception
>> in
>> > > > > > > > receiveBlock
>> > > > > > > > > for block blk_1597245478875608321_2818
>> java.io.EOFException:
>> > > > while
>> > > > > > > trying
>> > > > > > > > > to read 2347 bytes
>> > > > > > > > > 2013-06-05 05:12:51,442 INFO
>> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
>>writeBlock
>> > > > > > > > > blk_1597245478875608321_2818 received exception
>> > > > > java.io.EOFException:
>> > > > > > > > while
>> > > > > > > > > trying to read 2347 bytes
>> > > > > > > > > 2013-06-05 05:12:51,442 ERROR
>> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > > > > > DatanodeRegistration(
>> > > > > > > > > 192.168.20.30:50010,
>> > > > > > > > > 
>>storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
>> > > > > > > > infoPort=50075,
>> > > > > > > > > ipcPort=50020):DataXceiver
>> > > > > > > > > java.io.EOFException: while trying to read 2347 bytes
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > *HRegion logs:*
>> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4694929ms instead of 3000ms, this is likely due
>>to a
>> > long
>> > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:51,045 WARN
>> > org.apache.hadoop.hdfs.DFSClient:
>> > > > > > > > > DFSOutputStream ResponseProcessor exception  for block
>> > > > > > > > >
>> blk_1597245478875608321_2818java.net.SocketTimeoutException:
>> > > > 63000
>> > > > > > > millis
>> > > > > > > > > timeout while waiting for channel to be ready for read.
>>ch
>> :
>> > > > > > > > > java.nio.channels.SocketChannel[connected local=/
>> > > > > 192.168.20.30:44333
>> > > > > > > > remote=/
>> > > > > > > > > 192.168.20.30:50010]
>> > > > > > > > > 2013-06-05 05:12:51,046 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 11695345ms instead of 10000000ms, this is likely
>>due
>> > to a
>> > > > > long
>> > > > > > > > > garbage collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:51,048 WARN
>> > org.apache.hadoop.hdfs.DFSClient:
>> > > > > Error
>> > > > > > > > > Recovery for block blk_1597245478875608321_2818 bad
>> > datanode[0]
>> > > > > > > > > 192.168.20.30:50010
>> > > > > > > > > 2013-06-05 05:12:51,075 WARN
>> > org.apache.hadoop.hdfs.DFSClient:
>> > > > > Error
>> > > > > > > > while
>> > > > > > > > > syncing
>> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010
>>are
>> > > bad.
>> > > > > > > > > Aborting...
>> > > > > > > > >     at
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> 
>>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
>>Client.java:3096)
>> > > > > > > > > 2013-06-05 05:12:51,110 FATAL
>> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
>> > sync.
>> > > > > > > Requesting
>> > > > > > > > > close of hlog
>> > > > > > > > > java.io.IOException: Reflection
>> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
>> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
>>closed
>> > > > > > > > > 2013-06-05 05:12:51,180 FATAL
>> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
>> > sync.
>> > > > > > > Requesting
>> > > > > > > > > close of hlog
>> > > > > > > > > java.io.IOException: Reflection
>> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
>> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
>>closed
>> > > > > > > > > 2013-06-05 05:12:51,183 ERROR
>> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed
>>close
>> > of
>> > > > HLog
>> > > > > > > > writer
>> > > > > > > > > java.io.IOException: Reflection
>> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
>> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is
>>closed
>> > > > > > > > > 2013-06-05 05:12:51,184 WARN
>> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding
>>over
>> > HLog
>> > > > > close
>> > > > > > > > > failure! error count=1
>> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
>> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> ABORTING
>> > > > region
>> > > > > > > > server
>> > > > > > > > > hbase.rummycircle.com,60020,1369877672964:
>> > > > > > > > > regionserver:60020-0x13ef31264d00001
>> > > > > > > regionserver:60020-0x13ef31264d00001
>> > > > > > > > > received expired from ZooKeeper, aborting
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired
>> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
>> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
>> > > RegionServer
>> > > > > > abort:
>> > > > > > > > > loaded coprocessors are: []
>> > > > > > > > > 2013-06-05 05:12:52,621 INFO
>> > > > > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
>> > > > SplitLogWorker
>> > > > > > > > > interrupted while waiting for task, exiting:
>> > > > > > > > java.lang.InterruptedException
>> > > > > > > > > java.io.InterruptedIOException: Aborting compaction of
>> store
>> > > > > cfp_info
>> > > > > > > in
>> > > > > > > > > region
>> > > > > > >
>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
>> > > > > > > > > because user requested stop.
>> > > > > > > > > 2013-06-05 05:12:53,425 WARN
>> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > Possibly
>> > > > > > > > transient
>> > > > > > > > > ZooKeeper exception:
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > > > > hbase.rummycircle.com
>> > > > > > > > > ,60020,1369877672964
>> > > > > > > > > 2013-06-05 05:12:55,426 WARN
>> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > Possibly
>> > > > > > > > transient
>> > > > > > > > > ZooKeeper exception:
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > > > > hbase.rummycircle.com
>> > > > > > > > > ,60020,1369877672964
>> > > > > > > > > 2013-06-05 05:12:59,427 WARN
>> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > Possibly
>> > > > > > > > transient
>> > > > > > > > > ZooKeeper exception:
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > > > > hbase.rummycircle.com
>> > > > > > > > > ,60020,1369877672964
>> > > > > > > > > 2013-06-05 05:13:07,427 WARN
>> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > Possibly
>> > > > > > > > transient
>> > > > > > > > > ZooKeeper exception:
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > > > > hbase.rummycircle.com
>> > > > > > > > > ,60020,1369877672964
>> > > > > > > > > 2013-06-05 05:13:07,427 ERROR
>> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
>> > > ZooKeeper
>> > > > > > > delete
>> > > > > > > > > failed after 3 retries
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
>> > > > > > hbase.rummycircle.com
>> > > > > > > > > ,60020,1369877672964
>> > > > > > > > >     at
>> > > > > > > > >
>> > > > >
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>> > > > > > > > >     at
>> > > > > > > >
>> > > > 
>>org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> > > > > > > > > 2013-06-05 05:13:07,436 ERROR
>> > org.apache.hadoop.hdfs.DFSClient:
>> > > > > > > Exception
>> > > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
>> > > > > ,60020,1369877672964/
>> > > > > > > > > hbase.rummycircle.com
>> %2C60020%2C1369877672964.1370382721642
>> > :
>> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010
>>are
>> > > bad.
>> > > > > > > > > Aborting...
>> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010
>>are
>> > > bad.
>> > > > > > > > > Aborting...
>> > > > > > > > >     at
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> 
>>org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
>>Client.java:3096)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > *HMaster logs:*
>> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4702394ms instead of 10000ms, this is likely due
>>to a
>> > > long
>> > > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4988731ms instead of 300000ms, this is likely due
>>to
>> a
>> > > long
>> > > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4988726ms instead of 300000ms, this is likely due
>>to
>> a
>> > > long
>> > > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:50,701 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4698291ms instead of 10000ms, this is likely due
>>to a
>> > > long
>> > > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:50,711 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4694502ms instead of 1000ms, this is likely due
>>to a
>> > long
>> > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:50,714 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4694492ms instead of 1000ms, this is likely due
>>to a
>> > long
>> > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:50,715 WARN
>> > > > org.apache.hadoop.hbase.util.Sleeper:
>> > > > > We
>> > > > > > > > > slept 4695589ms instead of 60000ms, this is likely due
>>to a
>> > > long
>> > > > > > > garbage
>> > > > > > > > > collecting pause and it's usually bad, see
>> > > > > > > > >
>> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> > > > > > > > > 2013-06-05 05:12:52,263 FATAL
>> > > > > org.apache.hadoop.hbase.master.HMaster:
>> > > > > > > > > Master server abort: loaded coprocessors are: []
>> > > > > > > > > 2013-06-05 05:12:52,465 INFO
>> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > > > > > > > Waiting for region servers count to settle; currently
>> checked
>> > > in
>> > > > 1,
>> > > > > > > slept
>> > > > > > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647,
>> > > timeout
>> > > > of
>> > > > > > > 4500
>> > > > > > > > > ms, interval of 1500 ms.
>> > > > > > > > > 2013-06-05 05:12:52,561 ERROR
>> > > > > org.apache.hadoop.hbase.master.HMaster:
>> > > > > > > > > Region server hbase.rummycircle.com,60020,1369877672964
>> > > > reported a
>> > > > > > > fatal
>> > > > > > > > > error:
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired
>> > > > > > > > > 2013-06-05 05:12:53,970 INFO
>> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > > > > > > > Waiting for region servers count to settle; currently
>> checked
>> > > in
>> > > > 1,
>> > > > > > > slept
>> > > > > > > > > for 1506 ms, expecting minimum of 1, maximum of
>>2147483647,
>> > > > timeout
>> > > > > > of
>> > > > > > > > 4500
>> > > > > > > > > ms, interval of 1500 ms.
>> > > > > > > > > 2013-06-05 05:12:55,476 INFO
>> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > > > > > > > Waiting for region servers count to settle; currently
>> checked
>> > > in
>> > > > 1,
>> > > > > > > slept
>> > > > > > > > > for 3012 ms, expecting minimum of 1, maximum of
>>2147483647,
>> > > > timeout
>> > > > > > of
>> > > > > > > > 4500
>> > > > > > > > > ms, interval of 1500 ms.
>> > > > > > > > > 2013-06-05 05:12:56,981 INFO
>> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
>> > > > > > > > > Finished waiting for region servers count to settle;
>> checked
>> > in
>> > > > 1,
>> > > > > > > slept
>> > > > > > > > > for 4517 ms, expecting minimum of 1, maximum of
>>2147483647,
>> > > > master
>> > > > > is
>> > > > > > > > > running.
>> > > > > > > > > 2013-06-05 05:12:57,019 INFO
>> > > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
>> > > > verification
>> > > > > > of
>> > > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
>> > > ,60020,1369877672964;
>> > > > > > > > > java.io.EOFException
>> > > > > > > > > 2013-06-05 05:17:52,302 WARN
>> > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error
>>while
>> > > > > splitting
>> > > > > > > > logs
>> > > > > > > > > in [hdfs://
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> 
>>192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-
>>splitting
>> > > > > > > > ]
>> > > > > > > > > installed = 19 but only 0 done
>> > > > > > > > > 2013-06-05 05:17:52,321 FATAL
>> > > > > org.apache.hadoop.hbase.master.HMaster:
>> > > > > > > > > master:60000-0x13ef31264d00000
>> master:60000-0x13ef31264d00000
>> > > > > > received
>> > > > > > > > > expired from ZooKeeper, aborting
>> > > > > > > > >
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > > > > > > > > KeeperErrorCode = Session expired
>> > > > > > > > > java.io.IOException: Giving up after tries=1
>> > > > > > > > > Caused by: java.lang.InterruptedException: sleep
>> interrupted
>> > > > > > > > > 2013-06-05 05:17:52,381 ERROR
>> > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine:
>>Failed
>> to
>> > > > start
>> > > > > > > master
>> > > > > > > > > java.lang.RuntimeException: HMaster Aborted
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Thanks and Regards,
>> > > > > > > > > Vimal Jain
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Thanks and Regards,
>> > > > > > > > Vimal Jain
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Thanks and Regards,
>> > > > > > Vimal Jain
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Thanks and Regards,
>> > > > Vimal Jain
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Thanks and Regards,
>> > Vimal Jain
>> >
>>
>
>
>
>-- 
>Thanks and Regards,
>Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

Ok.
I dont have any batch read/write to hbase.


On Wed, Jun 5, 2013 at 6:08 PM, Azuryy Yu <az...@gmail.com> wrote:

> gc log cannot get by default. need some configuration. do you have some
> batch read or write to hbase?
>
> --Send from my Sony mobile.
> On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>
> > I dont have GC logs.Do you get it by default  or it has to be configured
> ?
> > After i came to know about crash , i checked which all processes are
> > running using "jps"
> > It displayed 4 processes , "namenode","datanode","secondarynamenode" and
> > "HQuorumpeer".
> > So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and then i
> stopped
> > hbase by running $HBASE_HOME/bin/stop-hbase.sh
> >
> >
> > On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com> wrote:
> >
> > > do you have GC log? and what you did during crash? and whats your gc
> > > options?
> > >
> > > for the dn error, thats net work issue generally, because dn received
> an
> > > incomplete packet.
> > >
> > > --Send from my Sony mobile.
> > > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > >
> > > > Yes.
> > > > Thats true.
> > > > There are some errors in all 3 logs during same period , i.e. data ,
> > > master
> > > > and region.
> > > > But i am unable to deduce the exact cause of error.
> > > > Can you please help in detecting the problem ?
> > > >
> > > > So far i am suspecting following :-
> > > > I have 1GB heap (default) allocated for all 3 processes , i.e.
> > > > Master,Region,Zookeeper.
> > > > Both  Master and Region took more time for GC ( as inferred from
> lines
> > in
> > > > logs like "slept more time than configured one" etc ) .
> > > > Due to this there was  zookeeper connection time out for both Master
> > and
> > > > Region and hence both went down.
> > > >
> > > > I am newbie to Hbase and hence may be my findings are not correct.
> > > > I want to be 100 % sure before increasing heap space for both Master
> > and
> > > > Region ( Both around 2GB) to solve this.
> > > > At present i have restarted the cluster with default heap space only
> (
> > > 1GB
> > > > ).
> > > >
> > > >
> > > >
> > > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com>
> wrote:
> > > >
> > > > > there have errors in your dats node log, and the error time match
> > with
> > > rs
> > > > > log error time.
> > > > >
> > > > > --Send from my Sony mobile.
> > > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > > > >
> > > > > > I don't think so , as i dont find any issues in data node logs.
> > > > > > Also there are lot of exceptions like "session expired" , "slept
> > more
> > > > > than
> > > > > > configured time" . what are these ?
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Because your data node 192.168.20.30 broke down. which leads to
> > RS
> > > > > down.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Here is the complete log:
> > > > > > > >
> > > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <
> vkjk89@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > I have set up Hbase in pseudo-distributed mode.
> > > > > > > > > It was working fine for 6 days , but suddenly today morning
> > > both
> > > > > > > HMaster
> > > > > > > > > and Hregion process went down.
> > > > > > > > > I checked in logs of both hadoop and hbase.
> > > > > > > > > Please help here.
> > > > > > > > > Here are the snippets :-
> > > > > > > > >
> > > > > > > > > *Datanode logs:*
> > > > > > > > > 2013-06-05 05:12:51,436 INFO
> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception
> in
> > > > > > > > receiveBlock
> > > > > > > > > for block blk_1597245478875608321_2818
> java.io.EOFException:
> > > > while
> > > > > > > trying
> > > > > > > > > to read 2347 bytes
> > > > > > > > > 2013-06-05 05:12:51,442 INFO
> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > > > > > > blk_1597245478875608321_2818 received exception
> > > > > java.io.EOFException:
> > > > > > > > while
> > > > > > > > > trying to read 2347 bytes
> > > > > > > > > 2013-06-05 05:12:51,442 ERROR
> > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > > > > DatanodeRegistration(
> > > > > > > > > 192.168.20.30:50010,
> > > > > > > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > > > > > > infoPort=50075,
> > > > > > > > > ipcPort=50020):DataXceiver
> > > > > > > > > java.io.EOFException: while trying to read 2347 bytes
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *HRegion logs:*
> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4694929ms instead of 3000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:51,045 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > > > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > > > > >
> blk_1597245478875608321_2818java.net.SocketTimeoutException:
> > > > 63000
> > > > > > > millis
> > > > > > > > > timeout while waiting for channel to be ready for read. ch
> :
> > > > > > > > > java.nio.channels.SocketChannel[connected local=/
> > > > > 192.168.20.30:44333
> > > > > > > > remote=/
> > > > > > > > > 192.168.20.30:50010]
> > > > > > > > > 2013-06-05 05:12:51,046 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 11695345ms instead of 10000000ms, this is likely due
> > to a
> > > > > long
> > > > > > > > > garbage collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:51,048 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > > Error
> > > > > > > > > Recovery for block blk_1597245478875608321_2818 bad
> > datanode[0]
> > > > > > > > > 192.168.20.30:50010
> > > > > > > > > 2013-06-05 05:12:51,075 WARN
> > org.apache.hadoop.hdfs.DFSClient:
> > > > > Error
> > > > > > > > while
> > > > > > > > > syncing
> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> > > bad.
> > > > > > > > > Aborting...
> > > > > > > > >     at
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > > > > 2013-06-05 05:12:51,110 FATAL
> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
> > sync.
> > > > > > > Requesting
> > > > > > > > > close of hlog
> > > > > > > > > java.io.IOException: Reflection
> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > > > 2013-06-05 05:12:51,180 FATAL
> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
> > sync.
> > > > > > > Requesting
> > > > > > > > > close of hlog
> > > > > > > > > java.io.IOException: Reflection
> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > > > 2013-06-05 05:12:51,183 ERROR
> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close
> > of
> > > > HLog
> > > > > > > > writer
> > > > > > > > > java.io.IOException: Reflection
> > > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > > > 2013-06-05 05:12:51,184 WARN
> > > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over
> > HLog
> > > > > close
> > > > > > > > > failure! error count=1
> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> ABORTING
> > > > region
> > > > > > > > server
> > > > > > > > > hbase.rummycircle.com,60020,1369877672964:
> > > > > > > > > regionserver:60020-0x13ef31264d00001
> > > > > > > regionserver:60020-0x13ef31264d00001
> > > > > > > > > received expired from ZooKeeper, aborting
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired
> > > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > > RegionServer
> > > > > > abort:
> > > > > > > > > loaded coprocessors are: []
> > > > > > > > > 2013-06-05 05:12:52,621 INFO
> > > > > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> > > > SplitLogWorker
> > > > > > > > > interrupted while waiting for task, exiting:
> > > > > > > > java.lang.InterruptedException
> > > > > > > > > java.io.InterruptedIOException: Aborting compaction of
> store
> > > > > cfp_info
> > > > > > > in
> > > > > > > > > region
> > > > > > >
> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > > > > > > because user requested stop.
> > > > > > > > > 2013-06-05 05:12:53,425 WARN
> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > Possibly
> > > > > > > > transient
> > > > > > > > > ZooKeeper exception:
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > > hbase.rummycircle.com
> > > > > > > > > ,60020,1369877672964
> > > > > > > > > 2013-06-05 05:12:55,426 WARN
> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > Possibly
> > > > > > > > transient
> > > > > > > > > ZooKeeper exception:
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > > hbase.rummycircle.com
> > > > > > > > > ,60020,1369877672964
> > > > > > > > > 2013-06-05 05:12:59,427 WARN
> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > Possibly
> > > > > > > > transient
> > > > > > > > > ZooKeeper exception:
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > > hbase.rummycircle.com
> > > > > > > > > ,60020,1369877672964
> > > > > > > > > 2013-06-05 05:13:07,427 WARN
> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > Possibly
> > > > > > > > transient
> > > > > > > > > ZooKeeper exception:
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > > hbase.rummycircle.com
> > > > > > > > > ,60020,1369877672964
> > > > > > > > > 2013-06-05 05:13:07,427 ERROR
> > > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > > ZooKeeper
> > > > > > > delete
> > > > > > > > > failed after 3 retries
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > > hbase.rummycircle.com
> > > > > > > > > ,60020,1369877672964
> > > > > > > > >     at
> > > > > > > > >
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > > > > > > >     at
> > > > > > > >
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > > > 2013-06-05 05:13:07,436 ERROR
> > org.apache.hadoop.hdfs.DFSClient:
> > > > > > > Exception
> > > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> > > > > ,60020,1369877672964/
> > > > > > > > > hbase.rummycircle.com
> %2C60020%2C1369877672964.1370382721642
> > :
> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> > > bad.
> > > > > > > > > Aborting...
> > > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> > > bad.
> > > > > > > > > Aborting...
> > > > > > > > >     at
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *HMaster logs:*
> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4702394ms instead of 10000ms, this is likely due to a
> > > long
> > > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4988731ms instead of 300000ms, this is likely due to
> a
> > > long
> > > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4988726ms instead of 300000ms, this is likely due to
> a
> > > long
> > > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4698291ms instead of 10000ms, this is likely due to a
> > > long
> > > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:50,711 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4694502ms instead of 1000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:50,714 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4694492ms instead of 1000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:50,715 WARN
> > > > org.apache.hadoop.hbase.util.Sleeper:
> > > > > We
> > > > > > > > > slept 4695589ms instead of 60000ms, this is likely due to a
> > > long
> > > > > > > garbage
> > > > > > > > > collecting pause and it's usually bad, see
> > > > > > > > >
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > > 2013-06-05 05:12:52,263 FATAL
> > > > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > > > Master server abort: loaded coprocessors are: []
> > > > > > > > > 2013-06-05 05:12:52,465 INFO
> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > > Waiting for region servers count to settle; currently
> checked
> > > in
> > > > 1,
> > > > > > > slept
> > > > > > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647,
> > > timeout
> > > > of
> > > > > > > 4500
> > > > > > > > > ms, interval of 1500 ms.
> > > > > > > > > 2013-06-05 05:12:52,561 ERROR
> > > > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > > > Region server hbase.rummycircle.com,60020,1369877672964
> > > > reported a
> > > > > > > fatal
> > > > > > > > > error:
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired
> > > > > > > > > 2013-06-05 05:12:53,970 INFO
> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > > Waiting for region servers count to settle; currently
> checked
> > > in
> > > > 1,
> > > > > > > slept
> > > > > > > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647,
> > > > timeout
> > > > > > of
> > > > > > > > 4500
> > > > > > > > > ms, interval of 1500 ms.
> > > > > > > > > 2013-06-05 05:12:55,476 INFO
> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > > Waiting for region servers count to settle; currently
> checked
> > > in
> > > > 1,
> > > > > > > slept
> > > > > > > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647,
> > > > timeout
> > > > > > of
> > > > > > > > 4500
> > > > > > > > > ms, interval of 1500 ms.
> > > > > > > > > 2013-06-05 05:12:56,981 INFO
> > > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > > Finished waiting for region servers count to settle;
> checked
> > in
> > > > 1,
> > > > > > > slept
> > > > > > > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647,
> > > > master
> > > > > is
> > > > > > > > > running.
> > > > > > > > > 2013-06-05 05:12:57,019 INFO
> > > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> > > > verification
> > > > > > of
> > > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
> > > ,60020,1369877672964;
> > > > > > > > > java.io.EOFException
> > > > > > > > > 2013-06-05 05:17:52,302 WARN
> > > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error while
> > > > > splitting
> > > > > > > > logs
> > > > > > > > > in [hdfs://
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > > > > > > ]
> > > > > > > > > installed = 19 but only 0 done
> > > > > > > > > 2013-06-05 05:17:52,321 FATAL
> > > > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > > > master:60000-0x13ef31264d00000
> master:60000-0x13ef31264d00000
> > > > > > received
> > > > > > > > > expired from ZooKeeper, aborting
> > > > > > > > >
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > > KeeperErrorCode = Session expired
> > > > > > > > > java.io.IOException: Giving up after tries=1
> > > > > > > > > Caused by: java.lang.InterruptedException: sleep
> interrupted
> > > > > > > > > 2013-06-05 05:17:52,381 ERROR
> > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed
> to
> > > > start
> > > > > > > master
> > > > > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks and Regards,
> > > > > > > > > Vimal Jain
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks and Regards,
> > > > > > > > Vimal Jain
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks and Regards,
> > > > > > Vimal Jain
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks and Regards,
> > > > Vimal Jain
> > > >
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
> >
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Azuryy Yu <az...@gmail.com>.

gc log cannot get by default. need some configuration. do you have some
batch read or write to hbase?

--Send from my Sony mobile.
On Jun 5, 2013 8:25 PM, "Vimal Jain" <vk...@gmail.com> wrote:

> I dont have GC logs.Do you get it by default  or it has to be configured ?
> After i came to know about crash , i checked which all processes are
> running using "jps"
> It displayed 4 processes , "namenode","datanode","secondarynamenode" and
> "HQuorumpeer".
> So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and then i stopped
> hbase by running $HBASE_HOME/bin/stop-hbase.sh
>
>
> On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com> wrote:
>
> > do you have GC log? and what you did during crash? and whats your gc
> > options?
> >
> > for the dn error, thats net work issue generally, because dn received an
> > incomplete packet.
> >
> > --Send from my Sony mobile.
> > On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> >
> > > Yes.
> > > Thats true.
> > > There are some errors in all 3 logs during same period , i.e. data ,
> > master
> > > and region.
> > > But i am unable to deduce the exact cause of error.
> > > Can you please help in detecting the problem ?
> > >
> > > So far i am suspecting following :-
> > > I have 1GB heap (default) allocated for all 3 processes , i.e.
> > > Master,Region,Zookeeper.
> > > Both  Master and Region took more time for GC ( as inferred from lines
> in
> > > logs like "slept more time than configured one" etc ) .
> > > Due to this there was  zookeeper connection time out for both Master
> and
> > > Region and hence both went down.
> > >
> > > I am newbie to Hbase and hence may be my findings are not correct.
> > > I want to be 100 % sure before increasing heap space for both Master
> and
> > > Region ( Both around 2GB) to solve this.
> > > At present i have restarted the cluster with default heap space only (
> > 1GB
> > > ).
> > >
> > >
> > >
> > > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com> wrote:
> > >
> > > > there have errors in your dats node log, and the error time match
> with
> > rs
> > > > log error time.
> > > >
> > > > --Send from my Sony mobile.
> > > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > > >
> > > > > I don't think so , as i dont find any issues in data node logs.
> > > > > Also there are lot of exceptions like "session expired" , "slept
> more
> > > > than
> > > > > configured time" . what are these ?
> > > > >
> > > > >
> > > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com>
> > wrote:
> > > > >
> > > > > > Because your data node 192.168.20.30 broke down. which leads to
> RS
> > > > down.
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Here is the complete log:
> > > > > > >
> > > > > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > > > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > I have set up Hbase in pseudo-distributed mode.
> > > > > > > > It was working fine for 6 days , but suddenly today morning
> > both
> > > > > > HMaster
> > > > > > > > and Hregion process went down.
> > > > > > > > I checked in logs of both hadoop and hbase.
> > > > > > > > Please help here.
> > > > > > > > Here are the snippets :-
> > > > > > > >
> > > > > > > > *Datanode logs:*
> > > > > > > > 2013-06-05 05:12:51,436 INFO
> > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > > > > > > receiveBlock
> > > > > > > > for block blk_1597245478875608321_2818 java.io.EOFException:
> > > while
> > > > > > trying
> > > > > > > > to read 2347 bytes
> > > > > > > > 2013-06-05 05:12:51,442 INFO
> > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > > > > > blk_1597245478875608321_2818 received exception
> > > > java.io.EOFException:
> > > > > > > while
> > > > > > > > trying to read 2347 bytes
> > > > > > > > 2013-06-05 05:12:51,442 ERROR
> > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > > > DatanodeRegistration(
> > > > > > > > 192.168.20.30:50010,
> > > > > > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > > > > > infoPort=50075,
> > > > > > > > ipcPort=50020):DataXceiver
> > > > > > > > java.io.EOFException: while trying to read 2347 bytes
> > > > > > > >
> > > > > > > >
> > > > > > > > *HRegion logs:*
> > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4694929ms instead of 3000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:51,045 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > > > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > > > > blk_1597245478875608321_2818java.net.SocketTimeoutException:
> > > 63000
> > > > > > millis
> > > > > > > > timeout while waiting for channel to be ready for read. ch :
> > > > > > > > java.nio.channels.SocketChannel[connected local=/
> > > > 192.168.20.30:44333
> > > > > > > remote=/
> > > > > > > > 192.168.20.30:50010]
> > > > > > > > 2013-06-05 05:12:51,046 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 11695345ms instead of 10000000ms, this is likely due
> to a
> > > > long
> > > > > > > > garbage collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:51,048 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > > Error
> > > > > > > > Recovery for block blk_1597245478875608321_2818 bad
> datanode[0]
> > > > > > > > 192.168.20.30:50010
> > > > > > > > 2013-06-05 05:12:51,075 WARN
> org.apache.hadoop.hdfs.DFSClient:
> > > > Error
> > > > > > > while
> > > > > > > > syncing
> > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> > bad.
> > > > > > > > Aborting...
> > > > > > > >     at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > > > 2013-06-05 05:12:51,110 FATAL
> > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
> sync.
> > > > > > Requesting
> > > > > > > > close of hlog
> > > > > > > > java.io.IOException: Reflection
> > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > > 2013-06-05 05:12:51,180 FATAL
> > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not
> sync.
> > > > > > Requesting
> > > > > > > > close of hlog
> > > > > > > > java.io.IOException: Reflection
> > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > > 2013-06-05 05:12:51,183 ERROR
> > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close
> of
> > > HLog
> > > > > > > writer
> > > > > > > > java.io.IOException: Reflection
> > > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > > 2013-06-05 05:12:51,184 WARN
> > > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over
> HLog
> > > > close
> > > > > > > > failure! error count=1
> > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
> > > region
> > > > > > > server
> > > > > > > > hbase.rummycircle.com,60020,1369877672964:
> > > > > > > > regionserver:60020-0x13ef31264d00001
> > > > > > regionserver:60020-0x13ef31264d00001
> > > > > > > > received expired from ZooKeeper, aborting
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired
> > > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> > RegionServer
> > > > > abort:
> > > > > > > > loaded coprocessors are: []
> > > > > > > > 2013-06-05 05:12:52,621 INFO
> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> > > SplitLogWorker
> > > > > > > > interrupted while waiting for task, exiting:
> > > > > > > java.lang.InterruptedException
> > > > > > > > java.io.InterruptedIOException: Aborting compaction of store
> > > > cfp_info
> > > > > > in
> > > > > > > > region
> > > > > > event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > > > > > because user requested stop.
> > > > > > > > 2013-06-05 05:12:53,425 WARN
> > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > Possibly
> > > > > > > transient
> > > > > > > > ZooKeeper exception:
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > hbase.rummycircle.com
> > > > > > > > ,60020,1369877672964
> > > > > > > > 2013-06-05 05:12:55,426 WARN
> > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > Possibly
> > > > > > > transient
> > > > > > > > ZooKeeper exception:
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > hbase.rummycircle.com
> > > > > > > > ,60020,1369877672964
> > > > > > > > 2013-06-05 05:12:59,427 WARN
> > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > Possibly
> > > > > > > transient
> > > > > > > > ZooKeeper exception:
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > hbase.rummycircle.com
> > > > > > > > ,60020,1369877672964
> > > > > > > > 2013-06-05 05:13:07,427 WARN
> > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > Possibly
> > > > > > > transient
> > > > > > > > ZooKeeper exception:
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > hbase.rummycircle.com
> > > > > > > > ,60020,1369877672964
> > > > > > > > 2013-06-05 05:13:07,427 ERROR
> > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> > ZooKeeper
> > > > > > delete
> > > > > > > > failed after 3 retries
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > > hbase.rummycircle.com
> > > > > > > > ,60020,1369877672964
> > > > > > > >     at
> > > > > > > >
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > > > > > >     at
> > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > > 2013-06-05 05:13:07,436 ERROR
> org.apache.hadoop.hdfs.DFSClient:
> > > > > > Exception
> > > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> > > > ,60020,1369877672964/
> > > > > > > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642
> :
> > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> > bad.
> > > > > > > > Aborting...
> > > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> > bad.
> > > > > > > > Aborting...
> > > > > > > >     at
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > > >
> > > > > > > >
> > > > > > > > *HMaster logs:*
> > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4702394ms instead of 10000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4988731ms instead of 300000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4988726ms instead of 300000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:50,701 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4698291ms instead of 10000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:50,711 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4694502ms instead of 1000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:50,714 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4694492ms instead of 1000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:50,715 WARN
> > > org.apache.hadoop.hbase.util.Sleeper:
> > > > We
> > > > > > > > slept 4695589ms instead of 60000ms, this is likely due to a
> > long
> > > > > > garbage
> > > > > > > > collecting pause and it's usually bad, see
> > > > > > > >
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > > 2013-06-05 05:12:52,263 FATAL
> > > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > > Master server abort: loaded coprocessors are: []
> > > > > > > > 2013-06-05 05:12:52,465 INFO
> > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > Waiting for region servers count to settle; currently checked
> > in
> > > 1,
> > > > > > slept
> > > > > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647,
> > timeout
> > > of
> > > > > > 4500
> > > > > > > > ms, interval of 1500 ms.
> > > > > > > > 2013-06-05 05:12:52,561 ERROR
> > > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > > Region server hbase.rummycircle.com,60020,1369877672964
> > > reported a
> > > > > > fatal
> > > > > > > > error:
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired
> > > > > > > > 2013-06-05 05:12:53,970 INFO
> > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > Waiting for region servers count to settle; currently checked
> > in
> > > 1,
> > > > > > slept
> > > > > > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647,
> > > timeout
> > > > > of
> > > > > > > 4500
> > > > > > > > ms, interval of 1500 ms.
> > > > > > > > 2013-06-05 05:12:55,476 INFO
> > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > Waiting for region servers count to settle; currently checked
> > in
> > > 1,
> > > > > > slept
> > > > > > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647,
> > > timeout
> > > > > of
> > > > > > > 4500
> > > > > > > > ms, interval of 1500 ms.
> > > > > > > > 2013-06-05 05:12:56,981 INFO
> > > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > > Finished waiting for region servers count to settle; checked
> in
> > > 1,
> > > > > > slept
> > > > > > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647,
> > > master
> > > > is
> > > > > > > > running.
> > > > > > > > 2013-06-05 05:12:57,019 INFO
> > > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> > > verification
> > > > > of
> > > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
> > ,60020,1369877672964;
> > > > > > > > java.io.EOFException
> > > > > > > > 2013-06-05 05:17:52,302 WARN
> > > > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error while
> > > > splitting
> > > > > > > logs
> > > > > > > > in [hdfs://
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > > > > > ]
> > > > > > > > installed = 19 but only 0 done
> > > > > > > > 2013-06-05 05:17:52,321 FATAL
> > > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> > > > > received
> > > > > > > > expired from ZooKeeper, aborting
> > > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > > KeeperErrorCode = Session expired
> > > > > > > > java.io.IOException: Giving up after tries=1
> > > > > > > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > > > > > > 2013-06-05 05:17:52,381 ERROR
> > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to
> > > start
> > > > > > master
> > > > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks and Regards,
> > > > > > > > Vimal Jain
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks and Regards,
> > > > > > > Vimal Jain
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and Regards,
> > > > > Vimal Jain
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Vimal Jain
> > >
> >
>
>
>
> --
> Thanks and Regards,
> Vimal Jain
>

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

I dont have GC logs.Do you get it by default  or it has to be configured ?
After i came to know about crash , i checked which all processes are
running using "jps"
It displayed 4 processes , "namenode","datanode","secondarynamenode" and
"HQuorumpeer".
So i stopped dfs by running $HADOOP_HOME/bin/stop-dfs.sh and then i stopped
hbase by running $HBASE_HOME/bin/stop-hbase.sh


On Wed, Jun 5, 2013 at 5:49 PM, Azuryy Yu <az...@gmail.com> wrote:

> do you have GC log? and what you did during crash? and whats your gc
> options?
>
> for the dn error, thats net work issue generally, because dn received an
> incomplete packet.
>
> --Send from my Sony mobile.
> On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>
> > Yes.
> > Thats true.
> > There are some errors in all 3 logs during same period , i.e. data ,
> master
> > and region.
> > But i am unable to deduce the exact cause of error.
> > Can you please help in detecting the problem ?
> >
> > So far i am suspecting following :-
> > I have 1GB heap (default) allocated for all 3 processes , i.e.
> > Master,Region,Zookeeper.
> > Both  Master and Region took more time for GC ( as inferred from lines in
> > logs like "slept more time than configured one" etc ) .
> > Due to this there was  zookeeper connection time out for both Master and
> > Region and hence both went down.
> >
> > I am newbie to Hbase and hence may be my findings are not correct.
> > I want to be 100 % sure before increasing heap space for both Master and
> > Region ( Both around 2GB) to solve this.
> > At present i have restarted the cluster with default heap space only (
> 1GB
> > ).
> >
> >
> >
> > On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com> wrote:
> >
> > > there have errors in your dats node log, and the error time match with
> rs
> > > log error time.
> > >
> > > --Send from my Sony mobile.
> > > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> > >
> > > > I don't think so , as i dont find any issues in data node logs.
> > > > Also there are lot of exceptions like "session expired" , "slept more
> > > than
> > > > configured time" . what are these ?
> > > >
> > > >
> > > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com>
> wrote:
> > > >
> > > > > Because your data node 192.168.20.30 broke down. which leads to RS
> > > down.
> > > > >
> > > > >
> > > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com>
> wrote:
> > > > >
> > > > > > Here is the complete log:
> > > > > >
> > > > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > > I have set up Hbase in pseudo-distributed mode.
> > > > > > > It was working fine for 6 days , but suddenly today morning
> both
> > > > > HMaster
> > > > > > > and Hregion process went down.
> > > > > > > I checked in logs of both hadoop and hbase.
> > > > > > > Please help here.
> > > > > > > Here are the snippets :-
> > > > > > >
> > > > > > > *Datanode logs:*
> > > > > > > 2013-06-05 05:12:51,436 INFO
> > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > > > > > receiveBlock
> > > > > > > for block blk_1597245478875608321_2818 java.io.EOFException:
> > while
> > > > > trying
> > > > > > > to read 2347 bytes
> > > > > > > 2013-06-05 05:12:51,442 INFO
> > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > > > > blk_1597245478875608321_2818 received exception
> > > java.io.EOFException:
> > > > > > while
> > > > > > > trying to read 2347 bytes
> > > > > > > 2013-06-05 05:12:51,442 ERROR
> > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > > DatanodeRegistration(
> > > > > > > 192.168.20.30:50010,
> > > > > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > > > > infoPort=50075,
> > > > > > > ipcPort=50020):DataXceiver
> > > > > > > java.io.EOFException: while trying to read 2347 bytes
> > > > > > >
> > > > > > >
> > > > > > > *HRegion logs:*
> > > > > > > 2013-06-05 05:12:50,701 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4694929ms instead of 3000ms, this is likely due to a long
> > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > > > blk_1597245478875608321_2818java.net.SocketTimeoutException:
> > 63000
> > > > > millis
> > > > > > > timeout while waiting for channel to be ready for read. ch :
> > > > > > > java.nio.channels.SocketChannel[connected local=/
> > > 192.168.20.30:44333
> > > > > > remote=/
> > > > > > > 192.168.20.30:50010]
> > > > > > > 2013-06-05 05:12:51,046 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 11695345ms instead of 10000000ms, this is likely due to a
> > > long
> > > > > > > garbage collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient:
> > > Error
> > > > > > > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > > > > > > 192.168.20.30:50010
> > > > > > > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient:
> > > Error
> > > > > > while
> > > > > > > syncing
> > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> bad.
> > > > > > > Aborting...
> > > > > > >     at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > > 2013-06-05 05:12:51,110 FATAL
> > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > > > Requesting
> > > > > > > close of hlog
> > > > > > > java.io.IOException: Reflection
> > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > 2013-06-05 05:12:51,180 FATAL
> > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > > > Requesting
> > > > > > > close of hlog
> > > > > > > java.io.IOException: Reflection
> > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > 2013-06-05 05:12:51,183 ERROR
> > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of
> > HLog
> > > > > > writer
> > > > > > > java.io.IOException: Reflection
> > > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > > 2013-06-05 05:12:51,184 WARN
> > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> > > close
> > > > > > > failure! error count=1
> > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
> > region
> > > > > > server
> > > > > > > hbase.rummycircle.com,60020,1369877672964:
> > > > > > > regionserver:60020-0x13ef31264d00001
> > > > > regionserver:60020-0x13ef31264d00001
> > > > > > > received expired from ZooKeeper, aborting
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired
> > > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer:
> RegionServer
> > > > abort:
> > > > > > > loaded coprocessors are: []
> > > > > > > 2013-06-05 05:12:52,621 INFO
> > > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> > SplitLogWorker
> > > > > > > interrupted while waiting for task, exiting:
> > > > > > java.lang.InterruptedException
> > > > > > > java.io.InterruptedIOException: Aborting compaction of store
> > > cfp_info
> > > > > in
> > > > > > > region
> > > > > event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > > > > because user requested stop.
> > > > > > > 2013-06-05 05:12:53,425 WARN
> > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> Possibly
> > > > > > transient
> > > > > > > ZooKeeper exception:
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > hbase.rummycircle.com
> > > > > > > ,60020,1369877672964
> > > > > > > 2013-06-05 05:12:55,426 WARN
> > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> Possibly
> > > > > > transient
> > > > > > > ZooKeeper exception:
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > hbase.rummycircle.com
> > > > > > > ,60020,1369877672964
> > > > > > > 2013-06-05 05:12:59,427 WARN
> > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> Possibly
> > > > > > transient
> > > > > > > ZooKeeper exception:
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > hbase.rummycircle.com
> > > > > > > ,60020,1369877672964
> > > > > > > 2013-06-05 05:13:07,427 WARN
> > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> Possibly
> > > > > > transient
> > > > > > > ZooKeeper exception:
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > hbase.rummycircle.com
> > > > > > > ,60020,1369877672964
> > > > > > > 2013-06-05 05:13:07,427 ERROR
> > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper:
> ZooKeeper
> > > > > delete
> > > > > > > failed after 3 retries
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > > hbase.rummycircle.com
> > > > > > > ,60020,1369877672964
> > > > > > >     at
> > > > > > >
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > > > > >     at
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> > > > > Exception
> > > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> > > ,60020,1369877672964/
> > > > > > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> bad.
> > > > > > > Aborting...
> > > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are
> bad.
> > > > > > > Aborting...
> > > > > > >     at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > >
> > > > > > >
> > > > > > > *HMaster logs:*
> > > > > > > 2013-06-05 05:12:50,701 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4702394ms instead of 10000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:50,701 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4988731ms instead of 300000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:50,701 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4988726ms instead of 300000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:50,701 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4698291ms instead of 10000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:50,711 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4694502ms instead of 1000ms, this is likely due to a long
> > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:50,714 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4694492ms instead of 1000ms, this is likely due to a long
> > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:50,715 WARN
> > org.apache.hadoop.hbase.util.Sleeper:
> > > We
> > > > > > > slept 4695589ms instead of 60000ms, this is likely due to a
> long
> > > > > garbage
> > > > > > > collecting pause and it's usually bad, see
> > > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > > 2013-06-05 05:12:52,263 FATAL
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > Master server abort: loaded coprocessors are: []
> > > > > > > 2013-06-05 05:12:52,465 INFO
> > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > Waiting for region servers count to settle; currently checked
> in
> > 1,
> > > > > slept
> > > > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> > of
> > > > > 4500
> > > > > > > ms, interval of 1500 ms.
> > > > > > > 2013-06-05 05:12:52,561 ERROR
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > Region server hbase.rummycircle.com,60020,1369877672964
> > reported a
> > > > > fatal
> > > > > > > error:
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired
> > > > > > > 2013-06-05 05:12:53,970 INFO
> > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > Waiting for region servers count to settle; currently checked
> in
> > 1,
> > > > > slept
> > > > > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647,
> > timeout
> > > > of
> > > > > > 4500
> > > > > > > ms, interval of 1500 ms.
> > > > > > > 2013-06-05 05:12:55,476 INFO
> > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > Waiting for region servers count to settle; currently checked
> in
> > 1,
> > > > > slept
> > > > > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647,
> > timeout
> > > > of
> > > > > > 4500
> > > > > > > ms, interval of 1500 ms.
> > > > > > > 2013-06-05 05:12:56,981 INFO
> > > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > > Finished waiting for region servers count to settle; checked in
> > 1,
> > > > > slept
> > > > > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647,
> > master
> > > is
> > > > > > > running.
> > > > > > > 2013-06-05 05:12:57,019 INFO
> > > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> > verification
> > > > of
> > > > > > > -ROOT-,,0 at address=hbase.rummycircle.com
> ,60020,1369877672964;
> > > > > > > java.io.EOFException
> > > > > > > 2013-06-05 05:17:52,302 WARN
> > > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error while
> > > splitting
> > > > > > logs
> > > > > > > in [hdfs://
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > > > > ]
> > > > > > > installed = 19 but only 0 done
> > > > > > > 2013-06-05 05:17:52,321 FATAL
> > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> > > > received
> > > > > > > expired from ZooKeeper, aborting
> > > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > > KeeperErrorCode = Session expired
> > > > > > > java.io.IOException: Giving up after tries=1
> > > > > > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > > > > > 2013-06-05 05:17:52,381 ERROR
> > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to
> > start
> > > > > master
> > > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks and Regards,
> > > > > > > Vimal Jain
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks and Regards,
> > > > > > Vimal Jain
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks and Regards,
> > > > Vimal Jain
> > > >
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
> >
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Azuryy Yu <az...@gmail.com>.

do you have GC log? and what you did during crash? and whats your gc
options?

for the dn error, thats net work issue generally, because dn received an
incomplete packet.

--Send from my Sony mobile.
On Jun 5, 2013 8:10 PM, "Vimal Jain" <vk...@gmail.com> wrote:

> Yes.
> Thats true.
> There are some errors in all 3 logs during same period , i.e. data , master
> and region.
> But i am unable to deduce the exact cause of error.
> Can you please help in detecting the problem ?
>
> So far i am suspecting following :-
> I have 1GB heap (default) allocated for all 3 processes , i.e.
> Master,Region,Zookeeper.
> Both  Master and Region took more time for GC ( as inferred from lines in
> logs like "slept more time than configured one" etc ) .
> Due to this there was  zookeeper connection time out for both Master and
> Region and hence both went down.
>
> I am newbie to Hbase and hence may be my findings are not correct.
> I want to be 100 % sure before increasing heap space for both Master and
> Region ( Both around 2GB) to solve this.
> At present i have restarted the cluster with default heap space only ( 1GB
> ).
>
>
>
> On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com> wrote:
>
> > there have errors in your dats node log, and the error time match with rs
> > log error time.
> >
> > --Send from my Sony mobile.
> > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:
> >
> > > I don't think so , as i dont find any issues in data node logs.
> > > Also there are lot of exceptions like "session expired" , "slept more
> > than
> > > configured time" . what are these ?
> > >
> > >
> > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
> > >
> > > > Because your data node 192.168.20.30 broke down. which leads to RS
> > down.
> > > >
> > > >
> > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
> > > >
> > > > > Here is the complete log:
> > > > >
> > > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > > >
> > > > >
> > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I have set up Hbase in pseudo-distributed mode.
> > > > > > It was working fine for 6 days , but suddenly today morning both
> > > > HMaster
> > > > > > and Hregion process went down.
> > > > > > I checked in logs of both hadoop and hbase.
> > > > > > Please help here.
> > > > > > Here are the snippets :-
> > > > > >
> > > > > > *Datanode logs:*
> > > > > > 2013-06-05 05:12:51,436 INFO
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > > > > receiveBlock
> > > > > > for block blk_1597245478875608321_2818 java.io.EOFException:
> while
> > > > trying
> > > > > > to read 2347 bytes
> > > > > > 2013-06-05 05:12:51,442 INFO
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > > > blk_1597245478875608321_2818 received exception
> > java.io.EOFException:
> > > > > while
> > > > > > trying to read 2347 bytes
> > > > > > 2013-06-05 05:12:51,442 ERROR
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > DatanodeRegistration(
> > > > > > 192.168.20.30:50010,
> > > > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > > > infoPort=50075,
> > > > > > ipcPort=50020):DataXceiver
> > > > > > java.io.EOFException: while trying to read 2347 bytes
> > > > > >
> > > > > >
> > > > > > *HRegion logs:*
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4694929ms instead of 3000ms, this is likely due to a long
> > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > > blk_1597245478875608321_2818java.net.SocketTimeoutException:
> 63000
> > > > millis
> > > > > > timeout while waiting for channel to be ready for read. ch :
> > > > > > java.nio.channels.SocketChannel[connected local=/
> > 192.168.20.30:44333
> > > > > remote=/
> > > > > > 192.168.20.30:50010]
> > > > > > 2013-06-05 05:12:51,046 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 11695345ms instead of 10000000ms, this is likely due to a
> > long
> > > > > > garbage collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient:
> > Error
> > > > > > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > > > > > 192.168.20.30:50010
> > > > > > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient:
> > Error
> > > > > while
> > > > > > syncing
> > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > > Aborting...
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > 2013-06-05 05:12:51,110 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > > Requesting
> > > > > > close of hlog
> > > > > > java.io.IOException: Reflection
> > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > 2013-06-05 05:12:51,180 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > > Requesting
> > > > > > close of hlog
> > > > > > java.io.IOException: Reflection
> > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > 2013-06-05 05:12:51,183 ERROR
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of
> HLog
> > > > > writer
> > > > > > java.io.IOException: Reflection
> > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > 2013-06-05 05:12:51,184 WARN
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> > close
> > > > > > failure! error count=1
> > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
> region
> > > > > server
> > > > > > hbase.rummycircle.com,60020,1369877672964:
> > > > > > regionserver:60020-0x13ef31264d00001
> > > > regionserver:60020-0x13ef31264d00001
> > > > > > received expired from ZooKeeper, aborting
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired
> > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> > > abort:
> > > > > > loaded coprocessors are: []
> > > > > > 2013-06-05 05:12:52,621 INFO
> > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> SplitLogWorker
> > > > > > interrupted while waiting for task, exiting:
> > > > > java.lang.InterruptedException
> > > > > > java.io.InterruptedIOException: Aborting compaction of store
> > cfp_info
> > > > in
> > > > > > region
> > > > event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > > > because user requested stop.
> > > > > > 2013-06-05 05:12:53,425 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:12:55,426 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:12:59,427 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:13:07,427 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:13:07,427 ERROR
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> > > > delete
> > > > > > failed after 3 retries
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > >     at
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > > > >     at
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> > > > Exception
> > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> > ,60020,1369877672964/
> > > > > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > > Aborting...
> > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > > Aborting...
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > >
> > > > > >
> > > > > > *HMaster logs:*
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4702394ms instead of 10000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4988731ms instead of 300000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4988726ms instead of 300000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4698291ms instead of 10000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,711 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4694502ms instead of 1000ms, this is likely due to a long
> > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,714 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4694492ms instead of 1000ms, this is likely due to a long
> > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,715 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4695589ms instead of 60000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:52,263 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Master server abort: loaded coprocessors are: []
> > > > > > 2013-06-05 05:12:52,465 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Waiting for region servers count to settle; currently checked in
> 1,
> > > > slept
> > > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> > > > 4500
> > > > > > ms, interval of 1500 ms.
> > > > > > 2013-06-05 05:12:52,561 ERROR
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Region server hbase.rummycircle.com,60020,1369877672964
> reported a
> > > > fatal
> > > > > > error:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired
> > > > > > 2013-06-05 05:12:53,970 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Waiting for region servers count to settle; currently checked in
> 1,
> > > > slept
> > > > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> > > of
> > > > > 4500
> > > > > > ms, interval of 1500 ms.
> > > > > > 2013-06-05 05:12:55,476 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Waiting for region servers count to settle; currently checked in
> 1,
> > > > slept
> > > > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> > > of
> > > > > 4500
> > > > > > ms, interval of 1500 ms.
> > > > > > 2013-06-05 05:12:56,981 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Finished waiting for region servers count to settle; checked in
> 1,
> > > > slept
> > > > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647,
> master
> > is
> > > > > > running.
> > > > > > 2013-06-05 05:12:57,019 INFO
> > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> verification
> > > of
> > > > > > -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> > > > > > java.io.EOFException
> > > > > > 2013-06-05 05:17:52,302 WARN
> > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error while
> > splitting
> > > > > logs
> > > > > > in [hdfs://
> > > > > >
> > > > >
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > > > ]
> > > > > > installed = 19 but only 0 done
> > > > > > 2013-06-05 05:17:52,321 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> > > received
> > > > > > expired from ZooKeeper, aborting
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired
> > > > > > java.io.IOException: Giving up after tries=1
> > > > > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > > > > 2013-06-05 05:17:52,381 ERROR
> > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to
> start
> > > > master
> > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks and Regards,
> > > > > > Vimal Jain
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and Regards,
> > > > > Vimal Jain
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Vimal Jain
> > >
> >
>
>
>
> --
> Thanks and Regards,
> Vimal Jain
>

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

Yes.
Thats true.
There are some errors in all 3 logs during same period , i.e. data , master
and region.
But i am unable to deduce the exact cause of error.
Can you please help in detecting the problem ?

So far i am suspecting following :-
I have 1GB heap (default) allocated for all 3 processes , i.e.
Master,Region,Zookeeper.
Both  Master and Region took more time for GC ( as inferred from lines in
logs like "slept more time than configured one" etc ) .
Due to this there was  zookeeper connection time out for both Master and
Region and hence both went down.

I am newbie to Hbase and hence may be my findings are not correct.
I want to be 100 % sure before increasing heap space for both Master and
Region ( Both around 2GB) to solve this.
At present i have restarted the cluster with default heap space only ( 1GB
).



On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <az...@gmail.com> wrote:

> there have errors in your dats node log, and the error time match with rs
> log error time.
>
> --Send from my Sony mobile.
> On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:
>
> > I don't think so , as i dont find any issues in data node logs.
> > Also there are lot of exceptions like "session expired" , "slept more
> than
> > configured time" . what are these ?
> >
> >
> > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
> >
> > > Because your data node 192.168.20.30 broke down. which leads to RS
> down.
> > >
> > >
> > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
> > >
> > > > Here is the complete log:
> > > >
> > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > >
> > > >
> > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com>
> wrote:
> > > >
> > > > > Hi,
> > > > > I have set up Hbase in pseudo-distributed mode.
> > > > > It was working fine for 6 days , but suddenly today morning both
> > > HMaster
> > > > > and Hregion process went down.
> > > > > I checked in logs of both hadoop and hbase.
> > > > > Please help here.
> > > > > Here are the snippets :-
> > > > >
> > > > > *Datanode logs:*
> > > > > 2013-06-05 05:12:51,436 INFO
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > > > receiveBlock
> > > > > for block blk_1597245478875608321_2818 java.io.EOFException: while
> > > trying
> > > > > to read 2347 bytes
> > > > > 2013-06-05 05:12:51,442 INFO
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > > blk_1597245478875608321_2818 received exception
> java.io.EOFException:
> > > > while
> > > > > trying to read 2347 bytes
> > > > > 2013-06-05 05:12:51,442 ERROR
> > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > DatanodeRegistration(
> > > > > 192.168.20.30:50010,
> > > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > > infoPort=50075,
> > > > > ipcPort=50020):DataXceiver
> > > > > java.io.EOFException: while trying to read 2347 bytes
> > > > >
> > > > >
> > > > > *HRegion logs:*
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4694929ms instead of 3000ms, this is likely due to a long
> > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> > > millis
> > > > > timeout while waiting for channel to be ready for read. ch :
> > > > > java.nio.channels.SocketChannel[connected local=/
> 192.168.20.30:44333
> > > > remote=/
> > > > > 192.168.20.30:50010]
> > > > > 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 11695345ms instead of 10000000ms, this is likely due to a
> long
> > > > > garbage collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> > > > > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > > > > 192.168.20.30:50010
> > > > > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> > > > while
> > > > > syncing
> > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > Aborting...
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > 2013-06-05 05:12:51,110 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > Requesting
> > > > > close of hlog
> > > > > java.io.IOException: Reflection
> > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > 2013-06-05 05:12:51,180 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > Requesting
> > > > > close of hlog
> > > > > java.io.IOException: Reflection
> > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > 2013-06-05 05:12:51,183 ERROR
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> > > > writer
> > > > > java.io.IOException: Reflection
> > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > 2013-06-05 05:12:51,184 WARN
> > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> close
> > > > > failure! error count=1
> > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > > > server
> > > > > hbase.rummycircle.com,60020,1369877672964:
> > > > > regionserver:60020-0x13ef31264d00001
> > > regionserver:60020-0x13ef31264d00001
> > > > > received expired from ZooKeeper, aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> > abort:
> > > > > loaded coprocessors are: []
> > > > > 2013-06-05 05:12:52,621 INFO
> > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> > > > > interrupted while waiting for task, exiting:
> > > > java.lang.InterruptedException
> > > > > java.io.InterruptedIOException: Aborting compaction of store
> cfp_info
> > > in
> > > > > region
> > > event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > > because user requested stop.
> > > > > 2013-06-05 05:12:53,425 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:12:55,426 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:12:59,427 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:13:07,427 WARN
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > transient
> > > > > ZooKeeper exception:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > > 2013-06-05 05:13:07,427 ERROR
> > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> > > delete
> > > > > failed after 3 retries
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > hbase.rummycircle.com
> > > > > ,60020,1369877672964
> > > > >     at
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > > >     at
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> > > Exception
> > > > > closing file /hbase/.logs/hbase.rummycircle.com
> ,60020,1369877672964/
> > > > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > Aborting...
> > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > Aborting...
> > > > >     at
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > >
> > > > >
> > > > > *HMaster logs:*
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4702394ms instead of 10000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4988731ms instead of 300000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4988726ms instead of 300000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4698291ms instead of 10000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4694502ms instead of 1000ms, this is likely due to a long
> > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4694492ms instead of 1000ms, this is likely due to a long
> > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> > > > > slept 4695589ms instead of 60000ms, this is likely due to a long
> > > garbage
> > > > > collecting pause and it's usually bad, see
> > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > 2013-06-05 05:12:52,263 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Master server abort: loaded coprocessors are: []
> > > > > 2013-06-05 05:12:52,465 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Waiting for region servers count to settle; currently checked in 1,
> > > slept
> > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> > > 4500
> > > > > ms, interval of 1500 ms.
> > > > > 2013-06-05 05:12:52,561 ERROR
> org.apache.hadoop.hbase.master.HMaster:
> > > > > Region server hbase.rummycircle.com,60020,1369877672964 reported a
> > > fatal
> > > > > error:
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > 2013-06-05 05:12:53,970 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Waiting for region servers count to settle; currently checked in 1,
> > > slept
> > > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout
> > of
> > > > 4500
> > > > > ms, interval of 1500 ms.
> > > > > 2013-06-05 05:12:55,476 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Waiting for region servers count to settle; currently checked in 1,
> > > slept
> > > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout
> > of
> > > > 4500
> > > > > ms, interval of 1500 ms.
> > > > > 2013-06-05 05:12:56,981 INFO
> > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > Finished waiting for region servers count to settle; checked in 1,
> > > slept
> > > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647, master
> is
> > > > > running.
> > > > > 2013-06-05 05:12:57,019 INFO
> > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
> > of
> > > > > -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> > > > > java.io.EOFException
> > > > > 2013-06-05 05:17:52,302 WARN
> > > > > org.apache.hadoop.hbase.master.SplitLogManager: error while
> splitting
> > > > logs
> > > > > in [hdfs://
> > > > >
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > > ]
> > > > > installed = 19 but only 0 done
> > > > > 2013-06-05 05:17:52,321 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> > > > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> > received
> > > > > expired from ZooKeeper, aborting
> > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > KeeperErrorCode = Session expired
> > > > > java.io.IOException: Giving up after tries=1
> > > > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > > > 2013-06-05 05:17:52,381 ERROR
> > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> > > master
> > > > > java.lang.RuntimeException: HMaster Aborted
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and Regards,
> > > > > Vimal Jain
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks and Regards,
> > > > Vimal Jain
> > > >
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
> >
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Azuryy Yu <az...@gmail.com>.

there have errors in your dats node log, and the error time match with rs
log error time.

--Send from my Sony mobile.
On Jun 5, 2013 5:06 PM, "Vimal Jain" <vk...@gmail.com> wrote:

> I don't think so , as i dont find any issues in data node logs.
> Also there are lot of exceptions like "session expired" , "slept more than
> configured time" . what are these ?
>
>
> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
>
> > Because your data node 192.168.20.30 broke down. which leads to RS down.
> >
> >
> > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
> >
> > > Here is the complete log:
> > >
> > > http://bin.cakephp.org/saved/103001 - Hregion
> > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > http://bin.cakephp.org/saved/103002 - Datanode
> > >
> > >
> > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com> wrote:
> > >
> > > > Hi,
> > > > I have set up Hbase in pseudo-distributed mode.
> > > > It was working fine for 6 days , but suddenly today morning both
> > HMaster
> > > > and Hregion process went down.
> > > > I checked in logs of both hadoop and hbase.
> > > > Please help here.
> > > > Here are the snippets :-
> > > >
> > > > *Datanode logs:*
> > > > 2013-06-05 05:12:51,436 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > > receiveBlock
> > > > for block blk_1597245478875608321_2818 java.io.EOFException: while
> > trying
> > > > to read 2347 bytes
> > > > 2013-06-05 05:12:51,442 INFO
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > blk_1597245478875608321_2818 received exception java.io.EOFException:
> > > while
> > > > trying to read 2347 bytes
> > > > 2013-06-05 05:12:51,442 ERROR
> > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > > > 192.168.20.30:50010,
> > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > infoPort=50075,
> > > > ipcPort=50020):DataXceiver
> > > > java.io.EOFException: while trying to read 2347 bytes
> > > >
> > > >
> > > > *HRegion logs:*
> > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4694929ms instead of 3000ms, this is likely due to a long
> garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > DFSOutputStream ResponseProcessor exception  for block
> > > > blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> > millis
> > > > timeout while waiting for channel to be ready for read. ch :
> > > > java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333
> > > remote=/
> > > > 192.168.20.30:50010]
> > > > 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 11695345ms instead of 10000000ms, this is likely due to a long
> > > > garbage collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > > > 192.168.20.30:50010
> > > > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > while
> > > > syncing
> > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > Aborting...
> > > >     at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > 2013-06-05 05:12:51,110 FATAL
> > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > Requesting
> > > > close of hlog
> > > > java.io.IOException: Reflection
> > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > 2013-06-05 05:12:51,180 FATAL
> > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > Requesting
> > > > close of hlog
> > > > java.io.IOException: Reflection
> > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > 2013-06-05 05:12:51,183 ERROR
> > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> > > writer
> > > > java.io.IOException: Reflection
> > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > 2013-06-05 05:12:51,184 WARN
> > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
> > > > failure! error count=1
> > > > 2013-06-05 05:12:52,557 FATAL
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > > server
> > > > hbase.rummycircle.com,60020,1369877672964:
> > > > regionserver:60020-0x13ef31264d00001
> > regionserver:60020-0x13ef31264d00001
> > > > received expired from ZooKeeper, aborting
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired
> > > > 2013-06-05 05:12:52,557 FATAL
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> abort:
> > > > loaded coprocessors are: []
> > > > 2013-06-05 05:12:52,621 INFO
> > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> > > > interrupted while waiting for task, exiting:
> > > java.lang.InterruptedException
> > > > java.io.InterruptedIOException: Aborting compaction of store cfp_info
> > in
> > > > region
> > event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > because user requested stop.
> > > > 2013-06-05 05:12:53,425 WARN
> > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > transient
> > > > ZooKeeper exception:
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> > > > ,60020,1369877672964
> > > > 2013-06-05 05:12:55,426 WARN
> > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > transient
> > > > ZooKeeper exception:
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> > > > ,60020,1369877672964
> > > > 2013-06-05 05:12:59,427 WARN
> > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > transient
> > > > ZooKeeper exception:
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> > > > ,60020,1369877672964
> > > > 2013-06-05 05:13:07,427 WARN
> > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > transient
> > > > ZooKeeper exception:
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> > > > ,60020,1369877672964
> > > > 2013-06-05 05:13:07,427 ERROR
> > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> > delete
> > > > failed after 3 retries
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> > > > ,60020,1369877672964
> > > >     at
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > >     at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> > Exception
> > > > closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
> > > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > Aborting...
> > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > Aborting...
> > > >     at
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > >
> > > >
> > > > *HMaster logs:*
> > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4702394ms instead of 10000ms, this is likely due to a long
> > garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4988731ms instead of 300000ms, this is likely due to a long
> > garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4988726ms instead of 300000ms, this is likely due to a long
> > garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4698291ms instead of 10000ms, this is likely due to a long
> > garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4694502ms instead of 1000ms, this is likely due to a long
> garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4694492ms instead of 1000ms, this is likely due to a long
> garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > > slept 4695589ms instead of 60000ms, this is likely due to a long
> > garbage
> > > > collecting pause and it's usually bad, see
> > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > > Master server abort: loaded coprocessors are: []
> > > > 2013-06-05 05:12:52,465 INFO
> > > org.apache.hadoop.hbase.master.ServerManager:
> > > > Waiting for region servers count to settle; currently checked in 1,
> > slept
> > > > for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> > 4500
> > > > ms, interval of 1500 ms.
> > > > 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
> > > > Region server hbase.rummycircle.com,60020,1369877672964 reported a
> > fatal
> > > > error:
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired
> > > > 2013-06-05 05:12:53,970 INFO
> > > org.apache.hadoop.hbase.master.ServerManager:
> > > > Waiting for region servers count to settle; currently checked in 1,
> > slept
> > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> > > 4500
> > > > ms, interval of 1500 ms.
> > > > 2013-06-05 05:12:55,476 INFO
> > > org.apache.hadoop.hbase.master.ServerManager:
> > > > Waiting for region servers count to settle; currently checked in 1,
> > slept
> > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> > > 4500
> > > > ms, interval of 1500 ms.
> > > > 2013-06-05 05:12:56,981 INFO
> > > org.apache.hadoop.hbase.master.ServerManager:
> > > > Finished waiting for region servers count to settle; checked in 1,
> > slept
> > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
> > > > running.
> > > > 2013-06-05 05:12:57,019 INFO
> > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
> of
> > > > -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> > > > java.io.EOFException
> > > > 2013-06-05 05:17:52,302 WARN
> > > > org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
> > > logs
> > > > in [hdfs://
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > ]
> > > > installed = 19 but only 0 done
> > > > 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> received
> > > > expired from ZooKeeper, aborting
> > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > KeeperErrorCode = Session expired
> > > > java.io.IOException: Giving up after tries=1
> > > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > > 2013-06-05 05:17:52,381 ERROR
> > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> > master
> > > > java.lang.RuntimeException: HMaster Aborted
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks and Regards,
> > > > Vimal Jain
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Vimal Jain
> > >
> >
>
>
>
> --
> Thanks and Regards,
> Vimal Jain
>

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

I could not find any issue in namenode.
Here is the namenode log at time of the issue.

http://bin.cakephp.org/saved/103003


On Wed, Jun 5, 2013 at 5:00 PM, Ted Yu <yu...@gmail.com> wrote:

> Have you looked at NameNode log ?
>
> The snippet you posted seems to imply issue with data block placement.
>
> Cheers
>
> On Jun 5, 2013, at 4:12 AM, Vimal Jain <vk...@gmail.com> wrote:
>
> > I am running Hbase in pseudo distributed mode . So there is only one
> > machine involved.
> > I am using  Hadoop version - 1.1.2 , Hbase version - 0.94.7
> >
> >
> > On Wed, Jun 5, 2013 at 4:38 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> How many region servers / data nodes do you have ?
> >>
> >> What Hadoop / HBase version are you using ?
> >>
> >> Thanks
> >>
> >> On Jun 5, 2013, at 3:54 AM, Vimal Jain <vk...@gmail.com> wrote:
> >>
> >>> Yes.I did check those.
> >>> But i am not sure if those parameter setting is the issue  , as there
> are
> >>> some other exceptions in logs ( "DFSOutputStream ResponseProcessor
> >>> exception " etc . )
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>
> >>>> There are a few tips under :
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>
> >>>> Can you check ?
> >>>>
> >>>> Thanks
> >>>>
> >>>> On Jun 5, 2013, at 2:05 AM, Vimal Jain <vk...@gmail.com> wrote:
> >>>>
> >>>>> I don't think so , as i dont find any issues in data node logs.
> >>>>> Also there are lot of exceptions like "session expired" , "slept more
> >>>> than
> >>>>> configured time" . what are these ?
> >>>>>
> >>>>>
> >>>>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Because your data node 192.168.20.30 broke down. which leads to RS
> >> down.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> Here is the complete log:
> >>>>>>>
> >>>>>>> http://bin.cakephp.org/saved/103001 - Hregion
> >>>>>>> http://bin.cakephp.org/saved/103000 - Hmaster
> >>>>>>> http://bin.cakephp.org/saved/103002 - Datanode
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>> I have set up Hbase in pseudo-distributed mode.
> >>>>>>>> It was working fine for 6 days , but suddenly today morning both
> >>>>>> HMaster
> >>>>>>>> and Hregion process went down.
> >>>>>>>> I checked in logs of both hadoop and hbase.
> >>>>>>>> Please help here.
> >>>>>>>> Here are the snippets :-
> >>>>>>>>
> >>>>>>>> *Datanode logs:*
> >>>>>>>> 2013-06-05 05:12:51,436 INFO
> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> >>>>>>> receiveBlock
> >>>>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
> >>>>>> trying
> >>>>>>>> to read 2347 bytes
> >>>>>>>> 2013-06-05 05:12:51,442 INFO
> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >>>>>>>> blk_1597245478875608321_2818 received exception
> >> java.io.EOFException:
> >>>>>>> while
> >>>>>>>> trying to read 2347 bytes
> >>>>>>>> 2013-06-05 05:12:51,442 ERROR
> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(
> >>>>>>>> 192.168.20.30:50010,
> >>>>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> >>>>>>> infoPort=50075,
> >>>>>>>> ipcPort=50020):DataXceiver
> >>>>>>>> java.io.EOFException: while trying to read 2347 bytes
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *HRegion logs:*
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long
> >>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> >>>>>>>> DFSOutputStream ResponseProcessor exception  for block
> >>>>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> >>>>>> millis
> >>>>>>>> timeout while waiting for channel to be ready for read. ch :
> >>>>>>>> java.nio.channels.SocketChannel[connected local=/
> >> 192.168.20.30:44333
> >>>>>>> remote=/
> >>>>>>>> 192.168.20.30:50010]
> >>>>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a
> long
> >>>>>>>> garbage collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> >>>>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> >>>>>>>> 192.168.20.30:50010
> >>>>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> >>>>>>> while
> >>>>>>>> syncing
> >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>>>> Aborting...
> >>>>>>>>  at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>>>>> 2013-06-05 05:12:51,110 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >>>>>> Requesting
> >>>>>>>> close of hlog
> >>>>>>>> java.io.IOException: Reflection
> >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>>>> 2013-06-05 05:12:51,180 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >>>>>> Requesting
> >>>>>>>> close of hlog
> >>>>>>>> java.io.IOException: Reflection
> >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>>>> 2013-06-05 05:12:51,183 ERROR
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of
> HLog
> >>>>>>> writer
> >>>>>>>> java.io.IOException: Reflection
> >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>>>> 2013-06-05 05:12:51,184 WARN
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> >> close
> >>>>>>>> failure! error count=1
> >>>>>>>> 2013-06-05 05:12:52,557 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
> region
> >>>>>>> server
> >>>>>>>> hbase.rummycircle.com,60020,1369877672964:
> >>>>>>>> regionserver:60020-0x13ef31264d00001
> >>>>>> regionserver:60020-0x13ef31264d00001
> >>>>>>>> received expired from ZooKeeper, aborting
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired
> >>>>>>>> 2013-06-05 05:12:52,557 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> >>>> abort:
> >>>>>>>> loaded coprocessors are: []
> >>>>>>>> 2013-06-05 05:12:52,621 INFO
> >>>>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> SplitLogWorker
> >>>>>>>> interrupted while waiting for task, exiting:
> >>>>>>> java.lang.InterruptedException
> >>>>>>>> java.io.InterruptedIOException: Aborting compaction of store
> >> cfp_info
> >>>>>> in
> >>>>>>>> region
> >>>>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> >>>>>>>> because user requested stop.
> >>>>>>>> 2013-06-05 05:12:53,425 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:12:55,426 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:12:59,427 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:13:07,427 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:13:07,427 ERROR
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> >>>>>> delete
> >>>>>>>> failed after 3 retries
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>>  at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>>>>>  at
> >>>>>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> >>>>>> Exception
> >>>>>>>> closing file /hbase/.logs/hbase.rummycircle.com
> >> ,60020,1369877672964/
> >>>>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>>>> Aborting...
> >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>>>> Aborting...
> >>>>>>>>  at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *HMaster logs:*
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long
> >>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long
> >>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:52,263 FATAL
> >> org.apache.hadoop.hbase.master.HMaster:
> >>>>>>>> Master server abort: loaded coprocessors are: []
> >>>>>>>> 2013-06-05 05:12:52,465 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Waiting for region servers count to settle; currently checked in
> 1,
> >>>>>> slept
> >>>>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> >>>>>> 4500
> >>>>>>>> ms, interval of 1500 ms.
> >>>>>>>> 2013-06-05 05:12:52,561 ERROR
> >> org.apache.hadoop.hbase.master.HMaster:
> >>>>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported
> a
> >>>>>> fatal
> >>>>>>>> error:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired
> >>>>>>>> 2013-06-05 05:12:53,970 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Waiting for region servers count to settle; currently checked in
> 1,
> >>>>>> slept
> >>>>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> >> of
> >>>>>>> 4500
> >>>>>>>> ms, interval of 1500 ms.
> >>>>>>>> 2013-06-05 05:12:55,476 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Waiting for region servers count to settle; currently checked in
> 1,
> >>>>>> slept
> >>>>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> >> of
> >>>>>>> 4500
> >>>>>>>> ms, interval of 1500 ms.
> >>>>>>>> 2013-06-05 05:12:56,981 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Finished waiting for region servers count to settle; checked in 1,
> >>>>>> slept
> >>>>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master
> >> is
> >>>>>>>> running.
> >>>>>>>> 2013-06-05 05:12:57,019 INFO
> >>>>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> verification
> >> of
> >>>>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> >>>>>>>> java.io.EOFException
> >>>>>>>> 2013-06-05 05:17:52,302 WARN
> >>>>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while
> >> splitting
> >>>>>>> logs
> >>>>>>>> in [hdfs://
> >>
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> >>>>>>> ]
> >>>>>>>> installed = 19 but only 0 done
> >>>>>>>> 2013-06-05 05:17:52,321 FATAL
> >> org.apache.hadoop.hbase.master.HMaster:
> >>>>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> >> received
> >>>>>>>> expired from ZooKeeper, aborting
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired
> >>>>>>>> java.io.IOException: Giving up after tries=1
> >>>>>>>> Caused by: java.lang.InterruptedException: sleep interrupted
> >>>>>>>> 2013-06-05 05:17:52,381 ERROR
> >>>>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> >>>>>> master
> >>>>>>>> java.lang.RuntimeException: HMaster Aborted
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Thanks and Regards,
> >>>>>>>> Vimal Jain
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Thanks and Regards,
> >>>>>>> Vimal Jain
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks and Regards,
> >>>>> Vimal Jain
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards,
> >>> Vimal Jain
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Ted Yu <yu...@gmail.com>.

Have you looked at NameNode log ?

The snippet you posted seems to imply issue with data block placement. 

Cheers

On Jun 5, 2013, at 4:12 AM, Vimal Jain <vk...@gmail.com> wrote:

> I am running Hbase in pseudo distributed mode . So there is only one
> machine involved.
> I am using  Hadoop version - 1.1.2 , Hbase version - 0.94.7
> 
> 
> On Wed, Jun 5, 2013 at 4:38 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> How many region servers / data nodes do you have ?
>> 
>> What Hadoop / HBase version are you using ?
>> 
>> Thanks
>> 
>> On Jun 5, 2013, at 3:54 AM, Vimal Jain <vk...@gmail.com> wrote:
>> 
>>> Yes.I did check those.
>>> But i am not sure if those parameter setting is the issue  , as there are
>>> some other exceptions in logs ( "DFSOutputStream ResponseProcessor
>>> exception " etc . )
>>> 
>>> 
>>> On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>>> There are a few tips under :
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 
>>>> Can you check ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Jun 5, 2013, at 2:05 AM, Vimal Jain <vk...@gmail.com> wrote:
>>>> 
>>>>> I don't think so , as i dont find any issues in data node logs.
>>>>> Also there are lot of exceptions like "session expired" , "slept more
>>>> than
>>>>> configured time" . what are these ?
>>>>> 
>>>>> 
>>>>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
>>>>> 
>>>>>> Because your data node 192.168.20.30 broke down. which leads to RS
>> down.
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
>>>>>> 
>>>>>>> Here is the complete log:
>>>>>>> 
>>>>>>> http://bin.cakephp.org/saved/103001 - Hregion
>>>>>>> http://bin.cakephp.org/saved/103000 - Hmaster
>>>>>>> http://bin.cakephp.org/saved/103002 - Datanode
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I have set up Hbase in pseudo-distributed mode.
>>>>>>>> It was working fine for 6 days , but suddenly today morning both
>>>>>> HMaster
>>>>>>>> and Hregion process went down.
>>>>>>>> I checked in logs of both hadoop and hbase.
>>>>>>>> Please help here.
>>>>>>>> Here are the snippets :-
>>>>>>>> 
>>>>>>>> *Datanode logs:*
>>>>>>>> 2013-06-05 05:12:51,436 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>>>>>> receiveBlock
>>>>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
>>>>>> trying
>>>>>>>> to read 2347 bytes
>>>>>>>> 2013-06-05 05:12:51,442 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>>>>> blk_1597245478875608321_2818 received exception
>> java.io.EOFException:
>>>>>>> while
>>>>>>>> trying to read 2347 bytes
>>>>>>>> 2013-06-05 05:12:51,442 ERROR
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(
>>>>>>>> 192.168.20.30:50010,
>>>>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
>>>>>>> infoPort=50075,
>>>>>>>> ipcPort=50020):DataXceiver
>>>>>>>> java.io.EOFException: while trying to read 2347 bytes
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *HRegion logs:*
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long
>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
>>>>>>>> DFSOutputStream ResponseProcessor exception  for block
>>>>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
>>>>>> millis
>>>>>>>> timeout while waiting for channel to be ready for read. ch :
>>>>>>>> java.nio.channels.SocketChannel[connected local=/
>> 192.168.20.30:44333
>>>>>>> remote=/
>>>>>>>> 192.168.20.30:50010]
>>>>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
>>>>>>>> garbage collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
>>>>>>>> 192.168.20.30:50010
>>>>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>>>> while
>>>>>>>> syncing
>>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>>>> Aborting...
>>>>>>>>  at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>>>>>> 2013-06-05 05:12:51,110 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>>>>>> Requesting
>>>>>>>> close of hlog
>>>>>>>> java.io.IOException: Reflection
>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>>>> 2013-06-05 05:12:51,180 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>>>>>> Requesting
>>>>>>>> close of hlog
>>>>>>>> java.io.IOException: Reflection
>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>>>> 2013-06-05 05:12:51,183 ERROR
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
>>>>>>> writer
>>>>>>>> java.io.IOException: Reflection
>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>>>> 2013-06-05 05:12:51,184 WARN
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
>> close
>>>>>>>> failure! error count=1
>>>>>>>> 2013-06-05 05:12:52,557 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>>>> server
>>>>>>>> hbase.rummycircle.com,60020,1369877672964:
>>>>>>>> regionserver:60020-0x13ef31264d00001
>>>>>> regionserver:60020-0x13ef31264d00001
>>>>>>>> received expired from ZooKeeper, aborting
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired
>>>>>>>> 2013-06-05 05:12:52,557 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>> abort:
>>>>>>>> loaded coprocessors are: []
>>>>>>>> 2013-06-05 05:12:52,621 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
>>>>>>>> interrupted while waiting for task, exiting:
>>>>>>> java.lang.InterruptedException
>>>>>>>> java.io.InterruptedIOException: Aborting compaction of store
>> cfp_info
>>>>>> in
>>>>>>>> region
>>>>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
>>>>>>>> because user requested stop.
>>>>>>>> 2013-06-05 05:12:53,425 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:12:55,426 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:12:59,427 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:13:07,427 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:13:07,427 ERROR
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>>>> delete
>>>>>>>> failed after 3 retries
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>>  at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>>>>>  at
>>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>>> closing file /hbase/.logs/hbase.rummycircle.com
>> ,60020,1369877672964/
>>>>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
>>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>>>> Aborting...
>>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>>>> Aborting...
>>>>>>>>  at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *HMaster logs:*
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long
>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long
>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:52,263 FATAL
>> org.apache.hadoop.hbase.master.HMaster:
>>>>>>>> Master server abort: loaded coprocessors are: []
>>>>>>>> 2013-06-05 05:12:52,465 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>>>> slept
>>>>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>>>>> 4500
>>>>>>>> ms, interval of 1500 ms.
>>>>>>>> 2013-06-05 05:12:52,561 ERROR
>> org.apache.hadoop.hbase.master.HMaster:
>>>>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
>>>>>> fatal
>>>>>>>> error:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired
>>>>>>>> 2013-06-05 05:12:53,970 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>>>> slept
>>>>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout
>> of
>>>>>>> 4500
>>>>>>>> ms, interval of 1500 ms.
>>>>>>>> 2013-06-05 05:12:55,476 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>>>> slept
>>>>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout
>> of
>>>>>>> 4500
>>>>>>>> ms, interval of 1500 ms.
>>>>>>>> 2013-06-05 05:12:56,981 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Finished waiting for region servers count to settle; checked in 1,
>>>>>> slept
>>>>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master
>> is
>>>>>>>> running.
>>>>>>>> 2013-06-05 05:12:57,019 INFO
>>>>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
>> of
>>>>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
>>>>>>>> java.io.EOFException
>>>>>>>> 2013-06-05 05:17:52,302 WARN
>>>>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while
>> splitting
>>>>>>> logs
>>>>>>>> in [hdfs://
>> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
>>>>>>> ]
>>>>>>>> installed = 19 but only 0 done
>>>>>>>> 2013-06-05 05:17:52,321 FATAL
>> org.apache.hadoop.hbase.master.HMaster:
>>>>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
>> received
>>>>>>>> expired from ZooKeeper, aborting
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired
>>>>>>>> java.io.IOException: Giving up after tries=1
>>>>>>>> Caused by: java.lang.InterruptedException: sleep interrupted
>>>>>>>> 2013-06-05 05:17:52,381 ERROR
>>>>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
>>>>>> master
>>>>>>>> java.lang.RuntimeException: HMaster Aborted
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Thanks and Regards,
>>>>>>>> Vimal Jain
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks and Regards,
>>>>>>> Vimal Jain
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks and Regards,
>>>>> Vimal Jain
>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards,
>>> Vimal Jain
> 
> 
> 
> -- 
> Thanks and Regards,
> Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

I am running Hbase in pseudo distributed mode . So there is only one
machine involved.
I am using  Hadoop version - 1.1.2 , Hbase version - 0.94.7


On Wed, Jun 5, 2013 at 4:38 PM, Ted Yu <yu...@gmail.com> wrote:

> How many region servers / data nodes do you have ?
>
> What Hadoop / HBase version are you using ?
>
> Thanks
>
> On Jun 5, 2013, at 3:54 AM, Vimal Jain <vk...@gmail.com> wrote:
>
> > Yes.I did check those.
> > But i am not sure if those parameter setting is the issue  , as there are
> > some other exceptions in logs ( "DFSOutputStream ResponseProcessor
> > exception " etc . )
> >
> >
> > On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> >> There are a few tips under :
> >> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>
> >> Can you check ?
> >>
> >> Thanks
> >>
> >> On Jun 5, 2013, at 2:05 AM, Vimal Jain <vk...@gmail.com> wrote:
> >>
> >>> I don't think so , as i dont find any issues in data node logs.
> >>> Also there are lot of exceptions like "session expired" , "slept more
> >> than
> >>> configured time" . what are these ?
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
> >>>
> >>>> Because your data node 192.168.20.30 broke down. which leads to RS
> down.
> >>>>
> >>>>
> >>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
> >>>>
> >>>>> Here is the complete log:
> >>>>>
> >>>>> http://bin.cakephp.org/saved/103001 - Hregion
> >>>>> http://bin.cakephp.org/saved/103000 - Hmaster
> >>>>> http://bin.cakephp.org/saved/103002 - Datanode
> >>>>>
> >>>>>
> >>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>> I have set up Hbase in pseudo-distributed mode.
> >>>>>> It was working fine for 6 days , but suddenly today morning both
> >>>> HMaster
> >>>>>> and Hregion process went down.
> >>>>>> I checked in logs of both hadoop and hbase.
> >>>>>> Please help here.
> >>>>>> Here are the snippets :-
> >>>>>>
> >>>>>> *Datanode logs:*
> >>>>>> 2013-06-05 05:12:51,436 INFO
> >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> >>>>> receiveBlock
> >>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
> >>>> trying
> >>>>>> to read 2347 bytes
> >>>>>> 2013-06-05 05:12:51,442 INFO
> >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >>>>>> blk_1597245478875608321_2818 received exception
> java.io.EOFException:
> >>>>> while
> >>>>>> trying to read 2347 bytes
> >>>>>> 2013-06-05 05:12:51,442 ERROR
> >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> >>>>>> 192.168.20.30:50010,
> >>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> >>>>> infoPort=50075,
> >>>>>> ipcPort=50020):DataXceiver
> >>>>>> java.io.EOFException: while trying to read 2347 bytes
> >>>>>>
> >>>>>>
> >>>>>> *HRegion logs:*
> >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long
> >> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> >>>>>> DFSOutputStream ResponseProcessor exception  for block
> >>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> >>>> millis
> >>>>>> timeout while waiting for channel to be ready for read. ch :
> >>>>>> java.nio.channels.SocketChannel[connected local=/
> 192.168.20.30:44333
> >>>>> remote=/
> >>>>>> 192.168.20.30:50010]
> >>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
> >>>>>> garbage collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> >>>>>> 192.168.20.30:50010
> >>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>>>> while
> >>>>>> syncing
> >>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>> Aborting...
> >>>>>>   at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>>> 2013-06-05 05:12:51,110 FATAL
> >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >>>> Requesting
> >>>>>> close of hlog
> >>>>>> java.io.IOException: Reflection
> >>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>> 2013-06-05 05:12:51,180 FATAL
> >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >>>> Requesting
> >>>>>> close of hlog
> >>>>>> java.io.IOException: Reflection
> >>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>> 2013-06-05 05:12:51,183 ERROR
> >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> >>>>> writer
> >>>>>> java.io.IOException: Reflection
> >>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>> 2013-06-05 05:12:51,184 WARN
> >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> close
> >>>>>> failure! error count=1
> >>>>>> 2013-06-05 05:12:52,557 FATAL
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> >>>>> server
> >>>>>> hbase.rummycircle.com,60020,1369877672964:
> >>>>>> regionserver:60020-0x13ef31264d00001
> >>>> regionserver:60020-0x13ef31264d00001
> >>>>>> received expired from ZooKeeper, aborting
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired
> >>>>>> 2013-06-05 05:12:52,557 FATAL
> >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> >> abort:
> >>>>>> loaded coprocessors are: []
> >>>>>> 2013-06-05 05:12:52,621 INFO
> >>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> >>>>>> interrupted while waiting for task, exiting:
> >>>>> java.lang.InterruptedException
> >>>>>> java.io.InterruptedIOException: Aborting compaction of store
> cfp_info
> >>>> in
> >>>>>> region
> >>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> >>>>>> because user requested stop.
> >>>>>> 2013-06-05 05:12:53,425 WARN
> >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>> transient
> >>>>>> ZooKeeper exception:
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> >>>>>> ,60020,1369877672964
> >>>>>> 2013-06-05 05:12:55,426 WARN
> >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>> transient
> >>>>>> ZooKeeper exception:
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> >>>>>> ,60020,1369877672964
> >>>>>> 2013-06-05 05:12:59,427 WARN
> >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>> transient
> >>>>>> ZooKeeper exception:
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> >>>>>> ,60020,1369877672964
> >>>>>> 2013-06-05 05:13:07,427 WARN
> >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>> transient
> >>>>>> ZooKeeper exception:
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> >>>>>> ,60020,1369877672964
> >>>>>> 2013-06-05 05:13:07,427 ERROR
> >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> >>>> delete
> >>>>>> failed after 3 retries
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> hbase.rummycircle.com
> >>>>>> ,60020,1369877672964
> >>>>>>   at
> >>>>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>>>   at
> >>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> >>>> Exception
> >>>>>> closing file /hbase/.logs/hbase.rummycircle.com
> ,60020,1369877672964/
> >>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> >>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>> Aborting...
> >>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>> Aborting...
> >>>>>>   at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>>>
> >>>>>>
> >>>>>> *HMaster logs:*
> >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
> >>>> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
> >>>> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
> >>>> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
> >>>> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long
> >> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long
> >> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
> >>>> garbage
> >>>>>> collecting pause and it's usually bad, see
> >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>> 2013-06-05 05:12:52,263 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> >>>>>> Master server abort: loaded coprocessors are: []
> >>>>>> 2013-06-05 05:12:52,465 INFO
> >>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>> Waiting for region servers count to settle; currently checked in 1,
> >>>> slept
> >>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> >>>> 4500
> >>>>>> ms, interval of 1500 ms.
> >>>>>> 2013-06-05 05:12:52,561 ERROR
> org.apache.hadoop.hbase.master.HMaster:
> >>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
> >>>> fatal
> >>>>>> error:
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired
> >>>>>> 2013-06-05 05:12:53,970 INFO
> >>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>> Waiting for region servers count to settle; currently checked in 1,
> >>>> slept
> >>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> >>>>> 4500
> >>>>>> ms, interval of 1500 ms.
> >>>>>> 2013-06-05 05:12:55,476 INFO
> >>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>> Waiting for region servers count to settle; currently checked in 1,
> >>>> slept
> >>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> >>>>> 4500
> >>>>>> ms, interval of 1500 ms.
> >>>>>> 2013-06-05 05:12:56,981 INFO
> >>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>> Finished waiting for region servers count to settle; checked in 1,
> >>>> slept
> >>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master
> is
> >>>>>> running.
> >>>>>> 2013-06-05 05:12:57,019 INFO
> >>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
> of
> >>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> >>>>>> java.io.EOFException
> >>>>>> 2013-06-05 05:17:52,302 WARN
> >>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while
> splitting
> >>>>> logs
> >>>>>> in [hdfs://
> >>
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> >>>>> ]
> >>>>>> installed = 19 but only 0 done
> >>>>>> 2013-06-05 05:17:52,321 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> >>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> received
> >>>>>> expired from ZooKeeper, aborting
> >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>> KeeperErrorCode = Session expired
> >>>>>> java.io.IOException: Giving up after tries=1
> >>>>>> Caused by: java.lang.InterruptedException: sleep interrupted
> >>>>>> 2013-06-05 05:17:52,381 ERROR
> >>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> >>>> master
> >>>>>> java.lang.RuntimeException: HMaster Aborted
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Thanks and Regards,
> >>>>>> Vimal Jain
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks and Regards,
> >>>>> Vimal Jain
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards,
> >>> Vimal Jain
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Ted Yu <yu...@gmail.com>.

How many region servers / data nodes do you have ?

What Hadoop / HBase version are you using ?

Thanks

On Jun 5, 2013, at 3:54 AM, Vimal Jain <vk...@gmail.com> wrote:

> Yes.I did check those.
> But i am not sure if those parameter setting is the issue  , as there are
> some other exceptions in logs ( "DFSOutputStream ResponseProcessor
> exception " etc . )
> 
> 
> On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> There are a few tips under :
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> 
>> Can you check ?
>> 
>> Thanks
>> 
>> On Jun 5, 2013, at 2:05 AM, Vimal Jain <vk...@gmail.com> wrote:
>> 
>>> I don't think so , as i dont find any issues in data node logs.
>>> Also there are lot of exceptions like "session expired" , "slept more
>> than
>>> configured time" . what are these ?
>>> 
>>> 
>>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
>>> 
>>>> Because your data node 192.168.20.30 broke down. which leads to RS down.
>>>> 
>>>> 
>>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
>>>> 
>>>>> Here is the complete log:
>>>>> 
>>>>> http://bin.cakephp.org/saved/103001 - Hregion
>>>>> http://bin.cakephp.org/saved/103000 - Hmaster
>>>>> http://bin.cakephp.org/saved/103002 - Datanode
>>>>> 
>>>>> 
>>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> I have set up Hbase in pseudo-distributed mode.
>>>>>> It was working fine for 6 days , but suddenly today morning both
>>>> HMaster
>>>>>> and Hregion process went down.
>>>>>> I checked in logs of both hadoop and hbase.
>>>>>> Please help here.
>>>>>> Here are the snippets :-
>>>>>> 
>>>>>> *Datanode logs:*
>>>>>> 2013-06-05 05:12:51,436 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>>>> receiveBlock
>>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
>>>> trying
>>>>>> to read 2347 bytes
>>>>>> 2013-06-05 05:12:51,442 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>>> blk_1597245478875608321_2818 received exception java.io.EOFException:
>>>>> while
>>>>>> trying to read 2347 bytes
>>>>>> 2013-06-05 05:12:51,442 ERROR
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>>>>>> 192.168.20.30:50010,
>>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
>>>>> infoPort=50075,
>>>>>> ipcPort=50020):DataXceiver
>>>>>> java.io.EOFException: while trying to read 2347 bytes
>>>>>> 
>>>>>> 
>>>>>> *HRegion logs:*
>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long
>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
>>>>>> DFSOutputStream ResponseProcessor exception  for block
>>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
>>>> millis
>>>>>> timeout while waiting for channel to be ready for read. ch :
>>>>>> java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333
>>>>> remote=/
>>>>>> 192.168.20.30:50010]
>>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
>>>>>> garbage collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
>>>>>> 192.168.20.30:50010
>>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>> while
>>>>>> syncing
>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>> Aborting...
>>>>>>   at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>>>> 2013-06-05 05:12:51,110 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>>>> Requesting
>>>>>> close of hlog
>>>>>> java.io.IOException: Reflection
>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>> 2013-06-05 05:12:51,180 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>>>> Requesting
>>>>>> close of hlog
>>>>>> java.io.IOException: Reflection
>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>> 2013-06-05 05:12:51,183 ERROR
>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
>>>>> writer
>>>>>> java.io.IOException: Reflection
>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>> 2013-06-05 05:12:51,184 WARN
>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
>>>>>> failure! error count=1
>>>>>> 2013-06-05 05:12:52,557 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>> server
>>>>>> hbase.rummycircle.com,60020,1369877672964:
>>>>>> regionserver:60020-0x13ef31264d00001
>>>> regionserver:60020-0x13ef31264d00001
>>>>>> received expired from ZooKeeper, aborting
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired
>>>>>> 2013-06-05 05:12:52,557 FATAL
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>> abort:
>>>>>> loaded coprocessors are: []
>>>>>> 2013-06-05 05:12:52,621 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
>>>>>> interrupted while waiting for task, exiting:
>>>>> java.lang.InterruptedException
>>>>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info
>>>> in
>>>>>> region
>>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
>>>>>> because user requested stop.
>>>>>> 2013-06-05 05:12:53,425 WARN
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>>> ZooKeeper exception:
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>>>> ,60020,1369877672964
>>>>>> 2013-06-05 05:12:55,426 WARN
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>>> ZooKeeper exception:
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>>>> ,60020,1369877672964
>>>>>> 2013-06-05 05:12:59,427 WARN
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>>> ZooKeeper exception:
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>>>> ,60020,1369877672964
>>>>>> 2013-06-05 05:13:07,427 WARN
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>> transient
>>>>>> ZooKeeper exception:
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>>>> ,60020,1369877672964
>>>>>> 2013-06-05 05:13:07,427 ERROR
>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>> delete
>>>>>> failed after 3 retries
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>>>> ,60020,1369877672964
>>>>>>   at
>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>>>   at
>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
>>>> Exception
>>>>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
>>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>> Aborting...
>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>> Aborting...
>>>>>>   at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>>>> 
>>>>>> 
>>>>>> *HMaster logs:*
>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
>>>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
>>>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
>>>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
>>>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long
>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long
>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
>>>> garbage
>>>>>> collecting pause and it's usually bad, see
>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>>>> Master server abort: loaded coprocessors are: []
>>>>>> 2013-06-05 05:12:52,465 INFO
>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>> slept
>>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>>> 4500
>>>>>> ms, interval of 1500 ms.
>>>>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
>>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
>>>> fatal
>>>>>> error:
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired
>>>>>> 2013-06-05 05:12:53,970 INFO
>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>> slept
>>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>>>> 4500
>>>>>> ms, interval of 1500 ms.
>>>>>> 2013-06-05 05:12:55,476 INFO
>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>> slept
>>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>>>> 4500
>>>>>> ms, interval of 1500 ms.
>>>>>> 2013-06-05 05:12:56,981 INFO
>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>> Finished waiting for region servers count to settle; checked in 1,
>>>> slept
>>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
>>>>>> running.
>>>>>> 2013-06-05 05:12:57,019 INFO
>>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
>>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
>>>>>> java.io.EOFException
>>>>>> 2013-06-05 05:17:52,302 WARN
>>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
>>>>> logs
>>>>>> in [hdfs://
>> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
>>>>> ]
>>>>>> installed = 19 but only 0 done
>>>>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
>>>>>> expired from ZooKeeper, aborting
>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>> KeeperErrorCode = Session expired
>>>>>> java.io.IOException: Giving up after tries=1
>>>>>> Caused by: java.lang.InterruptedException: sleep interrupted
>>>>>> 2013-06-05 05:17:52,381 ERROR
>>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
>>>> master
>>>>>> java.lang.RuntimeException: HMaster Aborted
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Vimal Jain
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks and Regards,
>>>>> Vimal Jain
>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards,
>>> Vimal Jain
> 
> 
> 
> -- 
> Thanks and Regards,
> Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

Yes.I did check those.
But i am not sure if those parameter setting is the issue  , as there are
some other exceptions in logs ( "DFSOutputStream ResponseProcessor
exception " etc . )


On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <yu...@gmail.com> wrote:

> There are a few tips under :
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>
> Can you check ?
>
> Thanks
>
> On Jun 5, 2013, at 2:05 AM, Vimal Jain <vk...@gmail.com> wrote:
>
> > I don't think so , as i dont find any issues in data node logs.
> > Also there are lot of exceptions like "session expired" , "slept more
> than
> > configured time" . what are these ?
> >
> >
> > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
> >
> >> Because your data node 192.168.20.30 broke down. which leads to RS down.
> >>
> >>
> >> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
> >>
> >>> Here is the complete log:
> >>>
> >>> http://bin.cakephp.org/saved/103001 - Hregion
> >>> http://bin.cakephp.org/saved/103000 - Hmaster
> >>> http://bin.cakephp.org/saved/103002 - Datanode
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com> wrote:
> >>>
> >>>> Hi,
> >>>> I have set up Hbase in pseudo-distributed mode.
> >>>> It was working fine for 6 days , but suddenly today morning both
> >> HMaster
> >>>> and Hregion process went down.
> >>>> I checked in logs of both hadoop and hbase.
> >>>> Please help here.
> >>>> Here are the snippets :-
> >>>>
> >>>> *Datanode logs:*
> >>>> 2013-06-05 05:12:51,436 INFO
> >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> >>> receiveBlock
> >>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
> >> trying
> >>>> to read 2347 bytes
> >>>> 2013-06-05 05:12:51,442 INFO
> >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >>>> blk_1597245478875608321_2818 received exception java.io.EOFException:
> >>> while
> >>>> trying to read 2347 bytes
> >>>> 2013-06-05 05:12:51,442 ERROR
> >>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> >>>> 192.168.20.30:50010,
> >>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> >>> infoPort=50075,
> >>>> ipcPort=50020):DataXceiver
> >>>> java.io.EOFException: while trying to read 2347 bytes
> >>>>
> >>>>
> >>>> *HRegion logs:*
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4694929ms instead of 3000ms, this is likely due to a long
> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> >>>> DFSOutputStream ResponseProcessor exception  for block
> >>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> >> millis
> >>>> timeout while waiting for channel to be ready for read. ch :
> >>>> java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333
> >>> remote=/
> >>>> 192.168.20.30:50010]
> >>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
> >>>> garbage collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> >>>> 192.168.20.30:50010
> >>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>> while
> >>>> syncing
> >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>> Aborting...
> >>>>    at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>> 2013-06-05 05:12:51,110 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >> Requesting
> >>>> close of hlog
> >>>> java.io.IOException: Reflection
> >>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>> 2013-06-05 05:12:51,180 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >> Requesting
> >>>> close of hlog
> >>>> java.io.IOException: Reflection
> >>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>> 2013-06-05 05:12:51,183 ERROR
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> >>> writer
> >>>> java.io.IOException: Reflection
> >>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>> 2013-06-05 05:12:51,184 WARN
> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
> >>>> failure! error count=1
> >>>> 2013-06-05 05:12:52,557 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> >>> server
> >>>> hbase.rummycircle.com,60020,1369877672964:
> >>>> regionserver:60020-0x13ef31264d00001
> >> regionserver:60020-0x13ef31264d00001
> >>>> received expired from ZooKeeper, aborting
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired
> >>>> 2013-06-05 05:12:52,557 FATAL
> >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> abort:
> >>>> loaded coprocessors are: []
> >>>> 2013-06-05 05:12:52,621 INFO
> >>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> >>>> interrupted while waiting for task, exiting:
> >>> java.lang.InterruptedException
> >>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info
> >> in
> >>>> region
> >> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> >>>> because user requested stop.
> >>>> 2013-06-05 05:12:53,425 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:12:55,426 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:12:59,427 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:13:07,427 WARN
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>> transient
> >>>> ZooKeeper exception:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>> 2013-06-05 05:13:07,427 ERROR
> >>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> >> delete
> >>>> failed after 3 retries
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> >>>> ,60020,1369877672964
> >>>>    at
> >>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>    at
> >>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> >> Exception
> >>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
> >>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>> Aborting...
> >>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>> Aborting...
> >>>>    at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>
> >>>>
> >>>> *HMaster logs:*
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4702394ms instead of 10000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4988731ms instead of 300000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4988726ms instead of 300000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4698291ms instead of 10000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4694502ms instead of 1000ms, this is likely due to a long
> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4694492ms instead of 1000ms, this is likely due to a long
> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >>>> slept 4695589ms instead of 60000ms, this is likely due to a long
> >> garbage
> >>>> collecting pause and it's usually bad, see
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
> >>>> Master server abort: loaded coprocessors are: []
> >>>> 2013-06-05 05:12:52,465 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Waiting for region servers count to settle; currently checked in 1,
> >> slept
> >>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> >> 4500
> >>>> ms, interval of 1500 ms.
> >>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
> >>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
> >> fatal
> >>>> error:
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired
> >>>> 2013-06-05 05:12:53,970 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Waiting for region servers count to settle; currently checked in 1,
> >> slept
> >>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> >>> 4500
> >>>> ms, interval of 1500 ms.
> >>>> 2013-06-05 05:12:55,476 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Waiting for region servers count to settle; currently checked in 1,
> >> slept
> >>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> >>> 4500
> >>>> ms, interval of 1500 ms.
> >>>> 2013-06-05 05:12:56,981 INFO
> >>> org.apache.hadoop.hbase.master.ServerManager:
> >>>> Finished waiting for region servers count to settle; checked in 1,
> >> slept
> >>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
> >>>> running.
> >>>> 2013-06-05 05:12:57,019 INFO
> >>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
> >>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> >>>> java.io.EOFException
> >>>> 2013-06-05 05:17:52,302 WARN
> >>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
> >>> logs
> >>>> in [hdfs://
> >>
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> >>> ]
> >>>> installed = 19 but only 0 done
> >>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
> >>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
> >>>> expired from ZooKeeper, aborting
> >>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>> KeeperErrorCode = Session expired
> >>>> java.io.IOException: Giving up after tries=1
> >>>> Caused by: java.lang.InterruptedException: sleep interrupted
> >>>> 2013-06-05 05:17:52,381 ERROR
> >>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> >> master
> >>>> java.lang.RuntimeException: HMaster Aborted
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Thanks and Regards,
> >>>> Vimal Jain
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards,
> >>> Vimal Jain
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Ted Yu <yu...@gmail.com>.

There are a few tips under :
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

Can you check ?

Thanks

On Jun 5, 2013, at 2:05 AM, Vimal Jain <vk...@gmail.com> wrote:

> I don't think so , as i dont find any issues in data node logs.
> Also there are lot of exceptions like "session expired" , "slept more than
> configured time" . what are these ?
> 
> 
> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:
> 
>> Because your data node 192.168.20.30 broke down. which leads to RS down.
>> 
>> 
>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
>> 
>>> Here is the complete log:
>>> 
>>> http://bin.cakephp.org/saved/103001 - Hregion
>>> http://bin.cakephp.org/saved/103000 - Hmaster
>>> http://bin.cakephp.org/saved/103002 - Datanode
>>> 
>>> 
>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com> wrote:
>>> 
>>>> Hi,
>>>> I have set up Hbase in pseudo-distributed mode.
>>>> It was working fine for 6 days , but suddenly today morning both
>> HMaster
>>>> and Hregion process went down.
>>>> I checked in logs of both hadoop and hbase.
>>>> Please help here.
>>>> Here are the snippets :-
>>>> 
>>>> *Datanode logs:*
>>>> 2013-06-05 05:12:51,436 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>> receiveBlock
>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
>> trying
>>>> to read 2347 bytes
>>>> 2013-06-05 05:12:51,442 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>> blk_1597245478875608321_2818 received exception java.io.EOFException:
>>> while
>>>> trying to read 2347 bytes
>>>> 2013-06-05 05:12:51,442 ERROR
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>>>> 192.168.20.30:50010,
>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
>>> infoPort=50075,
>>>> ipcPort=50020):DataXceiver
>>>> java.io.EOFException: while trying to read 2347 bytes
>>>> 
>>>> 
>>>> *HRegion logs:*
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4694929ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
>>>> DFSOutputStream ResponseProcessor exception  for block
>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
>> millis
>>>> timeout while waiting for channel to be ready for read. ch :
>>>> java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333
>>> remote=/
>>>> 192.168.20.30:50010]
>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
>>>> garbage collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
>>>> 192.168.20.30:50010
>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> while
>>>> syncing
>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>> Aborting...
>>>>    at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>> 2013-06-05 05:12:51,110 FATAL
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>> Requesting
>>>> close of hlog
>>>> java.io.IOException: Reflection
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>> 2013-06-05 05:12:51,180 FATAL
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>> Requesting
>>>> close of hlog
>>>> java.io.IOException: Reflection
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>> 2013-06-05 05:12:51,183 ERROR
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
>>> writer
>>>> java.io.IOException: Reflection
>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>> 2013-06-05 05:12:51,184 WARN
>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
>>>> failure! error count=1
>>>> 2013-06-05 05:12:52,557 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>> server
>>>> hbase.rummycircle.com,60020,1369877672964:
>>>> regionserver:60020-0x13ef31264d00001
>> regionserver:60020-0x13ef31264d00001
>>>> received expired from ZooKeeper, aborting
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>> 2013-06-05 05:12:52,557 FATAL
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
>>>> loaded coprocessors are: []
>>>> 2013-06-05 05:12:52,621 INFO
>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
>>>> interrupted while waiting for task, exiting:
>>> java.lang.InterruptedException
>>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info
>> in
>>>> region
>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
>>>> because user requested stop.
>>>> 2013-06-05 05:12:53,425 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>> ,60020,1369877672964
>>>> 2013-06-05 05:12:55,426 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>> ,60020,1369877672964
>>>> 2013-06-05 05:12:59,427 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>> ,60020,1369877672964
>>>> 2013-06-05 05:13:07,427 WARN
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>> transient
>>>> ZooKeeper exception:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>> ,60020,1369877672964
>>>> 2013-06-05 05:13:07,427 ERROR
>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>> delete
>>>> failed after 3 retries
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
>>>> ,60020,1369877672964
>>>>    at
>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>    at
>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
>> Exception
>>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>> Aborting...
>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>> Aborting...
>>>>    at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>> 
>>>> 
>>>> *HMaster logs:*
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4694502ms instead of 1000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4694492ms instead of 1000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
>> garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>> Master server abort: loaded coprocessors are: []
>>>> 2013-06-05 05:12:52,465 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Waiting for region servers count to settle; currently checked in 1,
>> slept
>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>> 4500
>>>> ms, interval of 1500 ms.
>>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
>> fatal
>>>> error:
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>> 2013-06-05 05:12:53,970 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Waiting for region servers count to settle; currently checked in 1,
>> slept
>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>> 4500
>>>> ms, interval of 1500 ms.
>>>> 2013-06-05 05:12:55,476 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Waiting for region servers count to settle; currently checked in 1,
>> slept
>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>> 4500
>>>> ms, interval of 1500 ms.
>>>> 2013-06-05 05:12:56,981 INFO
>>> org.apache.hadoop.hbase.master.ServerManager:
>>>> Finished waiting for region servers count to settle; checked in 1,
>> slept
>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
>>>> running.
>>>> 2013-06-05 05:12:57,019 INFO
>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
>>>> java.io.EOFException
>>>> 2013-06-05 05:17:52,302 WARN
>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
>>> logs
>>>> in [hdfs://
>> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
>>> ]
>>>> installed = 19 but only 0 done
>>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
>>>> expired from ZooKeeper, aborting
>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>> KeeperErrorCode = Session expired
>>>> java.io.IOException: Giving up after tries=1
>>>> Caused by: java.lang.InterruptedException: sleep interrupted
>>>> 2013-06-05 05:17:52,381 ERROR
>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
>> master
>>>> java.lang.RuntimeException: HMaster Aborted
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks and Regards,
>>>> Vimal Jain
>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards,
>>> Vimal Jain
> 
> 
> 
> -- 
> Thanks and Regards,
> Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

I don't think so , as i dont find any issues in data node logs.
Also there are lot of exceptions like "session expired" , "slept more than
configured time" . what are these ?


On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <az...@gmail.com> wrote:

> Because your data node 192.168.20.30 broke down. which leads to RS down.
>
>
> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:
>
> > Here is the complete log:
> >
> > http://bin.cakephp.org/saved/103001 - Hregion
> > http://bin.cakephp.org/saved/103000 - Hmaster
> > http://bin.cakephp.org/saved/103002 - Datanode
> >
> >
> > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com> wrote:
> >
> > > Hi,
> > > I have set up Hbase in pseudo-distributed mode.
> > > It was working fine for 6 days , but suddenly today morning both
> HMaster
> > > and Hregion process went down.
> > > I checked in logs of both hadoop and hbase.
> > > Please help here.
> > > Here are the snippets :-
> > >
> > > *Datanode logs:*
> > > 2013-06-05 05:12:51,436 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > receiveBlock
> > > for block blk_1597245478875608321_2818 java.io.EOFException: while
> trying
> > > to read 2347 bytes
> > > 2013-06-05 05:12:51,442 INFO
> > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > blk_1597245478875608321_2818 received exception java.io.EOFException:
> > while
> > > trying to read 2347 bytes
> > > 2013-06-05 05:12:51,442 ERROR
> > > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > > 192.168.20.30:50010,
> > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > infoPort=50075,
> > > ipcPort=50020):DataXceiver
> > > java.io.EOFException: while trying to read 2347 bytes
> > >
> > >
> > > *HRegion logs:*
> > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4694929ms instead of 3000ms, this is likely due to a long garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > > DFSOutputStream ResponseProcessor exception  for block
> > > blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> millis
> > > timeout while waiting for channel to be ready for read. ch :
> > > java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333
> > remote=/
> > > 192.168.20.30:50010]
> > > 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 11695345ms instead of 10000000ms, this is likely due to a long
> > > garbage collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > > 192.168.20.30:50010
> > > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > while
> > > syncing
> > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > Aborting...
> > >     at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > 2013-06-05 05:12:51,110 FATAL
> > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> Requesting
> > > close of hlog
> > > java.io.IOException: Reflection
> > > Caused by: java.lang.reflect.InvocationTargetException
> > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > 2013-06-05 05:12:51,180 FATAL
> > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> Requesting
> > > close of hlog
> > > java.io.IOException: Reflection
> > > Caused by: java.lang.reflect.InvocationTargetException
> > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > 2013-06-05 05:12:51,183 ERROR
> > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> > writer
> > > java.io.IOException: Reflection
> > > Caused by: java.lang.reflect.InvocationTargetException
> > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > 2013-06-05 05:12:51,184 WARN
> > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
> > > failure! error count=1
> > > 2013-06-05 05:12:52,557 FATAL
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> > server
> > > hbase.rummycircle.com,60020,1369877672964:
> > > regionserver:60020-0x13ef31264d00001
> regionserver:60020-0x13ef31264d00001
> > > received expired from ZooKeeper, aborting
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired
> > > 2013-06-05 05:12:52,557 FATAL
> > > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
> > > loaded coprocessors are: []
> > > 2013-06-05 05:12:52,621 INFO
> > > org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> > > interrupted while waiting for task, exiting:
> > java.lang.InterruptedException
> > > java.io.InterruptedIOException: Aborting compaction of store cfp_info
> in
> > > region
> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > because user requested stop.
> > > 2013-06-05 05:12:53,425 WARN
> > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > transient
> > > ZooKeeper exception:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > > ,60020,1369877672964
> > > 2013-06-05 05:12:55,426 WARN
> > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > transient
> > > ZooKeeper exception:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > > ,60020,1369877672964
> > > 2013-06-05 05:12:59,427 WARN
> > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > transient
> > > ZooKeeper exception:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > > ,60020,1369877672964
> > > 2013-06-05 05:13:07,427 WARN
> > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > transient
> > > ZooKeeper exception:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > > ,60020,1369877672964
> > > 2013-06-05 05:13:07,427 ERROR
> > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> delete
> > > failed after 3 retries
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > > ,60020,1369877672964
> > >     at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > >     at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> Exception
> > > closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
> > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > Aborting...
> > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > Aborting...
> > >     at
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > >
> > >
> > > *HMaster logs:*
> > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4702394ms instead of 10000ms, this is likely due to a long
> garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4988731ms instead of 300000ms, this is likely due to a long
> garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4988726ms instead of 300000ms, this is likely due to a long
> garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4698291ms instead of 10000ms, this is likely due to a long
> garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4694502ms instead of 1000ms, this is likely due to a long garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4694492ms instead of 1000ms, this is likely due to a long garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > > slept 4695589ms instead of 60000ms, this is likely due to a long
> garbage
> > > collecting pause and it's usually bad, see
> > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > Master server abort: loaded coprocessors are: []
> > > 2013-06-05 05:12:52,465 INFO
> > org.apache.hadoop.hbase.master.ServerManager:
> > > Waiting for region servers count to settle; currently checked in 1,
> slept
> > > for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> 4500
> > > ms, interval of 1500 ms.
> > > 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
> > > Region server hbase.rummycircle.com,60020,1369877672964 reported a
> fatal
> > > error:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired
> > > 2013-06-05 05:12:53,970 INFO
> > org.apache.hadoop.hbase.master.ServerManager:
> > > Waiting for region servers count to settle; currently checked in 1,
> slept
> > > for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> > 4500
> > > ms, interval of 1500 ms.
> > > 2013-06-05 05:12:55,476 INFO
> > org.apache.hadoop.hbase.master.ServerManager:
> > > Waiting for region servers count to settle; currently checked in 1,
> slept
> > > for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> > 4500
> > > ms, interval of 1500 ms.
> > > 2013-06-05 05:12:56,981 INFO
> > org.apache.hadoop.hbase.master.ServerManager:
> > > Finished waiting for region servers count to settle; checked in 1,
> slept
> > > for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
> > > running.
> > > 2013-06-05 05:12:57,019 INFO
> > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
> > > -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> > > java.io.EOFException
> > > 2013-06-05 05:17:52,302 WARN
> > > org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
> > logs
> > > in [hdfs://
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > ]
> > > installed = 19 but only 0 done
> > > 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
> > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
> > > expired from ZooKeeper, aborting
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > KeeperErrorCode = Session expired
> > > java.io.IOException: Giving up after tries=1
> > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > 2013-06-05 05:17:52,381 ERROR
> > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> master
> > > java.lang.RuntimeException: HMaster Aborted
> > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Vimal Jain
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
> >
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Posted by Azuryy Yu <az...@gmail.com>.

Because your data node 192.168.20.30 broke down. which leads to RS down.


On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vk...@gmail.com> wrote:

> Here is the complete log:
>
> http://bin.cakephp.org/saved/103001 - Hregion
> http://bin.cakephp.org/saved/103000 - Hmaster
> http://bin.cakephp.org/saved/103002 - Datanode
>
>
> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com> wrote:
>
> > Hi,
> > I have set up Hbase in pseudo-distributed mode.
> > It was working fine for 6 days , but suddenly today morning both HMaster
> > and Hregion process went down.
> > I checked in logs of both hadoop and hbase.
> > Please help here.
> > Here are the snippets :-
> >
> > *Datanode logs:*
> > 2013-06-05 05:12:51,436 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock
> > for block blk_1597245478875608321_2818 java.io.EOFException: while trying
> > to read 2347 bytes
> > 2013-06-05 05:12:51,442 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > blk_1597245478875608321_2818 received exception java.io.EOFException:
> while
> > trying to read 2347 bytes
> > 2013-06-05 05:12:51,442 ERROR
> > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> > 192.168.20.30:50010,
> > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> infoPort=50075,
> > ipcPort=50020):DataXceiver
> > java.io.EOFException: while trying to read 2347 bytes
> >
> >
> > *HRegion logs:*
> > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4694929ms instead of 3000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000 millis
> > timeout while waiting for channel to be ready for read. ch :
> > java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333
> remote=/
> > 192.168.20.30:50010]
> > 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 11695345ms instead of 10000000ms, this is likely due to a long
> > garbage collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > 192.168.20.30:50010
> > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
> while
> > syncing
> > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > Aborting...
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > 2013-06-05 05:12:51,110 FATAL
> > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting
> > close of hlog
> > java.io.IOException: Reflection
> > Caused by: java.lang.reflect.InvocationTargetException
> > Caused by: java.io.IOException: DFSOutputStream is closed
> > 2013-06-05 05:12:51,180 FATAL
> > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting
> > close of hlog
> > java.io.IOException: Reflection
> > Caused by: java.lang.reflect.InvocationTargetException
> > Caused by: java.io.IOException: DFSOutputStream is closed
> > 2013-06-05 05:12:51,183 ERROR
> > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
> writer
> > java.io.IOException: Reflection
> > Caused by: java.lang.reflect.InvocationTargetException
> > Caused by: java.io.IOException: DFSOutputStream is closed
> > 2013-06-05 05:12:51,184 WARN
> > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
> > failure! error count=1
> > 2013-06-05 05:12:52,557 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server
> > hbase.rummycircle.com,60020,1369877672964:
> > regionserver:60020-0x13ef31264d00001 regionserver:60020-0x13ef31264d00001
> > received expired from ZooKeeper, aborting
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired
> > 2013-06-05 05:12:52,557 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
> > loaded coprocessors are: []
> > 2013-06-05 05:12:52,621 INFO
> > org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> > interrupted while waiting for task, exiting:
> java.lang.InterruptedException
> > java.io.InterruptedIOException: Aborting compaction of store cfp_info in
> > region event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > because user requested stop.
> > 2013-06-05 05:12:53,425 WARN
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> transient
> > ZooKeeper exception:
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > ,60020,1369877672964
> > 2013-06-05 05:12:55,426 WARN
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> transient
> > ZooKeeper exception:
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > ,60020,1369877672964
> > 2013-06-05 05:12:59,427 WARN
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> transient
> > ZooKeeper exception:
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > ,60020,1369877672964
> > 2013-06-05 05:13:07,427 WARN
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> transient
> > ZooKeeper exception:
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > ,60020,1369877672964
> > 2013-06-05 05:13:07,427 ERROR
> > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper delete
> > failed after 3 retries
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> > ,60020,1369877672964
> >     at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >     at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
> > closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
> > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > Aborting...
> > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > Aborting...
> >     at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >
> >
> > *HMaster logs:*
> > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4702394ms instead of 10000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4988731ms instead of 300000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4988726ms instead of 300000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4698291ms instead of 10000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4694502ms instead of 1000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4694492ms instead of 1000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
> > slept 4695589ms instead of 60000ms, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
> > Master server abort: loaded coprocessors are: []
> > 2013-06-05 05:12:52,465 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > Waiting for region servers count to settle; currently checked in 1, slept
> > for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500
> > ms, interval of 1500 ms.
> > 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
> > Region server hbase.rummycircle.com,60020,1369877672964 reported a fatal
> > error:
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired
> > 2013-06-05 05:12:53,970 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > Waiting for region servers count to settle; currently checked in 1, slept
> > for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> 4500
> > ms, interval of 1500 ms.
> > 2013-06-05 05:12:55,476 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > Waiting for region servers count to settle; currently checked in 1, slept
> > for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of
> 4500
> > ms, interval of 1500 ms.
> > 2013-06-05 05:12:56,981 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > Finished waiting for region servers count to settle; checked in 1, slept
> > for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
> > running.
> > 2013-06-05 05:12:57,019 INFO
> > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
> > -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> > java.io.EOFException
> > 2013-06-05 05:17:52,302 WARN
> > org.apache.hadoop.hbase.master.SplitLogManager: error while splitting
> logs
> > in [hdfs://
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> ]
> > installed = 19 but only 0 done
> > 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
> > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
> > expired from ZooKeeper, aborting
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired
> > java.io.IOException: Giving up after tries=1
> > Caused by: java.lang.InterruptedException: sleep interrupted
> > 2013-06-05 05:17:52,381 ERROR
> > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> > java.lang.RuntimeException: HMaster Aborted
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
> >
>
>
>
> --
> Thanks and Regards,
> Vimal Jain
>

Re: HMaster and HRegionServer going down

Posted by Vimal Jain <vk...@gmail.com>.

Here is the complete log:

http://bin.cakephp.org/saved/103001 - Hregion
http://bin.cakephp.org/saved/103000 - Hmaster
http://bin.cakephp.org/saved/103002 - Datanode


On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vk...@gmail.com> wrote:

> Hi,
> I have set up Hbase in pseudo-distributed mode.
> It was working fine for 6 days , but suddenly today morning both HMaster
> and Hregion process went down.
> I checked in logs of both hadoop and hbase.
> Please help here.
> Here are the snippets :-
>
> *Datanode logs:*
> 2013-06-05 05:12:51,436 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
> for block blk_1597245478875608321_2818 java.io.EOFException: while trying
> to read 2347 bytes
> 2013-06-05 05:12:51,442 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> blk_1597245478875608321_2818 received exception java.io.EOFException: while
> trying to read 2347 bytes
> 2013-06-05 05:12:51,442 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 192.168.20.30:50010,
> storageID=DS-1816106352-192.168.20.30-50010-1369314076237, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 2347 bytes
>
>
> *HRegion logs:*
> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4694929ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000 millis
> timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333remote=/
> 192.168.20.30:50010]
> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 11695345ms instead of 10000000ms, this is likely due to a long
> garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> 192.168.20.30:50010
> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error while
> syncing
> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> Aborting...
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> 2013-06-05 05:12:51,110 FATAL
> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting
> close of hlog
> java.io.IOException: Reflection
> Caused by: java.lang.reflect.InvocationTargetException
> Caused by: java.io.IOException: DFSOutputStream is closed
> 2013-06-05 05:12:51,180 FATAL
> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting
> close of hlog
> java.io.IOException: Reflection
> Caused by: java.lang.reflect.InvocationTargetException
> Caused by: java.io.IOException: DFSOutputStream is closed
> 2013-06-05 05:12:51,183 ERROR
> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog writer
> java.io.IOException: Reflection
> Caused by: java.lang.reflect.InvocationTargetException
> Caused by: java.io.IOException: DFSOutputStream is closed
> 2013-06-05 05:12:51,184 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close
> failure! error count=1
> 2013-06-05 05:12:52,557 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> hbase.rummycircle.com,60020,1369877672964:
> regionserver:60020-0x13ef31264d00001 regionserver:60020-0x13ef31264d00001
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> 2013-06-05 05:12:52,557 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort:
> loaded coprocessors are: []
> 2013-06-05 05:12:52,621 INFO
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
> interrupted while waiting for task, exiting: java.lang.InterruptedException
> java.io.InterruptedIOException: Aborting compaction of store cfp_info in
> region event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> because user requested stop.
> 2013-06-05 05:12:53,425 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> ,60020,1369877672964
> 2013-06-05 05:12:55,426 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> ,60020,1369877672964
> 2013-06-05 05:12:59,427 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> ,60020,1369877672964
> 2013-06-05 05:13:07,427 WARN
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient
> ZooKeeper exception:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> ,60020,1369877672964
> 2013-06-05 05:13:07,427 ERROR
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper delete
> failed after 3 retries
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com
> ,60020,1369877672964
>     at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/
> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> Aborting...
> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> Aborting...
>     at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>
>
> *HMaster logs:*
> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4702394ms instead of 10000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4988731ms instead of 300000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4988726ms instead of 300000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4698291ms instead of 10000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4694502ms instead of 1000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4694492ms instead of 1000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept 4695589ms instead of 60000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster:
> Master server abort: loaded coprocessors are: []
> 2013-06-05 05:12:52,465 INFO org.apache.hadoop.hbase.master.ServerManager:
> Waiting for region servers count to settle; currently checked in 1, slept
> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500
> ms, interval of 1500 ms.
> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster:
> Region server hbase.rummycircle.com,60020,1369877672964 reported a fatal
> error:
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> 2013-06-05 05:12:53,970 INFO org.apache.hadoop.hbase.master.ServerManager:
> Waiting for region servers count to settle; currently checked in 1, slept
> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500
> ms, interval of 1500 ms.
> 2013-06-05 05:12:55,476 INFO org.apache.hadoop.hbase.master.ServerManager:
> Waiting for region servers count to settle; currently checked in 1, slept
> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500
> ms, interval of 1500 ms.
> 2013-06-05 05:12:56,981 INFO org.apache.hadoop.hbase.master.ServerManager:
> Finished waiting for region servers count to settle; checked in 1, slept
> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is
> running.
> 2013-06-05 05:12:57,019 INFO
> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of
> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> java.io.EOFException
> 2013-06-05 05:17:52,302 WARN
> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting logs
> in [hdfs://
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting]
> installed = 19 but only 0 done
> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster:
> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received
> expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
> java.io.IOException: Giving up after tries=1
> Caused by: java.lang.InterruptedException: sleep interrupted
> 2013-06-05 05:17:52,381 ERROR
> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
> java.lang.RuntimeException: HMaster Aborted
>
>
>
> --
> Thanks and Regards,
> Vimal Jain
>



-- 
Thanks and Regards,
Vimal Jain