You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Sandy Pratt <pr...@adobe.com> on 2012/01/03 22:37:22 UTC

RE: RegionServer dying every two or three days

11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing socket connection and attempting reconnect

It looks like the process has been unresponsive for some time, so ZK has terminated the session.  Did you experience a long GC pause right before this?  If you don't have GC logging enabled for the RS, you can sometimes tell by noticing a gap in the timestamps of the log statements leading up to the crash.

If it turns out to be GC, you might want to look at your kernel swappiness setting (set it to 0) and your JVM params.

Sandy


> -----Original Message-----
> From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> Sent: Thursday, December 29, 2011 07:44
> To: user@hbase.apache.org
> Subject: RegionServer dying every two or three days
> 
> Hi,
> 
> I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 Slaves),
> running on Amazon EC2. The master is a High-Memory Extra Large Instance
> (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> slaves are Extra Large Instances (m1.xlarge) running Datanode, TaskTracker,
> RegionServer and Zookeeper.
> 
> From time to time, every two or three days, one of the RegionServers
> processes goes down, but the other processes (DataNode, TaskTracker,
> Zookeeper) continue normally.
> 
> Reading the logs:
> 
> The connection with Zookeeper timed out:
> 
> ---------------------------
> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing
> socket connection and attempting reconnect
> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
> socket connection and attempting reconnect
> ---------------------------
> 
> And the Handlers start to fail:
> 
> ---------------------------
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> xx.xx.xx.xx:xxxx: output error
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
> caught: java.nio.channels.ClosedChannelException
>         at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> 3)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> 1341)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> aseServer.java:727)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> rver.java:792)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> 083)
> 
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> xx.xx.xx.xx:xxxx: output error
> 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
> caught: java.nio.channels.ClosedChannelException
>         at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> 3)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> 1341)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> aseServer.java:727)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> rver.java:792)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> 083)
> ---------------------------
> 
> And finally the server throws a YouAreDeadException :( :
> 
> ---------------------------
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating session
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
> connection
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating session
> 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> ZooKeeper service, session 0x346c561a55953e has expired, closing socket
> connection
> 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> Unhandled
> exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> rejected; currently processing
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> org.apache.hadoop.hbase.YouAreDeadException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead server
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
> AccessorImpl.java:39)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
> structorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
> ption.java:95)
>         at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> Exception.java:79)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> ort(HRegionServer.java:735)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> ava:596)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead server
>         at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> ger.java:204)
>         at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
> erManager.java:262)
>         at
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
> a:669)
>         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> 039)
> 
>         at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>         at $Proxy6.regionServerReport(Unknown Source)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> ort(HRegionServer.java:729)
>         ... 2 more
> 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> requests=66, regions=206, stores=2078, storefiles=970,
> storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> blockCacheSize=705907552, blockCacheFree=150412064,
> blockCacheCount=10648, blockCacheHitCount=79578618,
> blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> blockCacheHitRatio=96,
> blockCacheHitCachingRatio=98
> 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
> exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> rejected; currently processing
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> ---------------------------
> 
> Then i restart the RegionServer and everything is back to normal.
> Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
> abnormality in the same time window.
> I think it was caused by the lost of connection to zookeeper. Is it advisable to
> run zookeeper in the same machines?
> if the RegionServer lost it's connection to Zookeeper, there's a way (a
> configuration perhaps) to re-join the cluster, and not only die?
> 
> Any idea what is causing this?? Or to prevent it from happening?
> 
> Any help is appreciated.
> 
> Best Regards,
> 
> --
> 
> *Leonardo Gamas*
> Software Engineer
> +557134943514
> +557581347440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
Thanks Matt for point out the problems that can happen, I will look into
it. And thanks Neil for sharing more details about your infrastructure, it
has been of great help.
I will run some tests with these instances and make my choice according to
our needs. Thank you all for the time dispensed. :)

2012/1/24 Neil Yalowitz <ne...@gmail.com>

> Hi Leonardo, excuse the late response.
>
> I read the link that Matt sent below RE: instance type and hardware
> isolation a while ago and struggled with the same problem with c1.xlarge
> and memory.  Another issue there, as Matt mentions network throughput, is
> the type of network connection.  We decided to go with cluster compute
> instances (cc1.4xlarge) instead since the larger memory and fatter pipe
> (10Gbit) suited our needs (more MR daemons/children and large DB rows,
> respectively).  The c1.xlarge also seemed like a bad match as the best
> trait of that instance, the CPU units, aren't really our bottleneck (it's
> more an issue of RAM and I/O).
>
> While the cluster compute instances improved performance and stability
> somewhat, the pain hasn't stopped there.  Creating/terminating instances
> seems to be a lottery, possibly due to bad neighbors on the physical host
> or network.  Some cluster instances are rock solid for days and weeks while
> we run our tests, others are problematic within hours of creation despite
> having an identical setup.  Even with the cluster compute instances, we
> have test clusters where we will run benchmarks, wipe the data, and rerun
> the benchmarks with wildly different performance (off by 400%).
>  Occasionally, an instance will become unresponsive to pings and SSH and
> will completely fall out of the cluster.
>
> It seems the strategy for EC2 deployment is to expect everything to fail
> and plan accordingly.  It hasn't been a good experience.
>
>
>
> Neil Yalowitz
>
> On Mon, Jan 23, 2012 at 1:37 PM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > You could always try going with a little smaller heap and see how it
> works
> > for your particular workload, maybe 4G.  1G block cache, 1G memstores,
> ~1G
> > GC overhead(?), leaving 1G for active program data.
> >
> > If trying to squeeze memory, you should be aware there is a limitation in
> > 0.90 where storefile indexes come out of that remaining 1G as opposed to
> > being stored in the block cache.  If you have big indexes, you would need
> > to shrink block cache and memstore limits to compensate.
> >
> >
> http://search-hadoop.com/m/OH4cT1LiN4Q1/corgan&subj=Re+a+question+storefileIndexSize
> >
> >
> > On Mon, Jan 23, 2012 at 4:32 AM, Leonardo Gamas
> > <le...@jusbrasil.com.br>wrote:
> >
> > > Thanks again Matt! I will try out this instance type, but i'm concerned
> > > about the MapReduce cluster running apart from HBase in my case, since
> we
> > > have some MapReduces running and planning to run more. Feels like
> losing
> > > the great strength of MapReduce, by running it far from data.
> > >
> > > 2012/1/21 Matt Corgan <mc...@hotpads.com>
> > >
> > > > We actually don't run map/reduce on the same machines (most of our
> jobs
> > > are
> > > > on an old message based system), so don't have much experience there.
> >  We
> > > > run only HDFS (1G heap) and HBase (5.5G heap) with 12 * 100GB EBS
> > volumes
> > > > per regionserver, and ~350 regions/server at the moment.  5.5G is
> > > already a
> > > > small heap in the hbase world, so I wouldn't recommend decreasing it
> to
> > > fit
> > > > M/R,  You could always run map/reduce on separate servers, adding or
> > > > removing servers as needed (more at night?), or use Amazon's Elastic
> > M/R.
> > > >
> > > >
> > > > On Sat, Jan 21, 2012 at 5:04 AM, Leonardo Gamas
> > > > <le...@jusbrasil.com.br>wrote:
> > > >
> > > > > Thanks Matt for this insightful article, I will run my cluster with
> > > > > c1.xlarge to test it's performance. But i'm concerned with this
> > > machine,
> > > > > because the amount of RAM available, only 7GB. How many map/reduce
> > > slots
> > > > do
> > > > > you configure? And the amount of Heap for HBase? How many regions
> per
> > > > > RegionServer could my cluster support?
> > > > >
> > > > > 2012/1/20 Matt Corgan <mc...@hotpads.com>
> > > > >
> > > > > > I run c1.xlarge servers and have found them very stable.  I see
> 100
> > > > > Mbit/s
> > > > > > sustained bi-directional network throughput (200Mbit/s total),
> > > > sometimes
> > > > > up
> > > > > > to 150 * 2 Mbit/s.
> > > > > >
> > > > > > Here's a pretty thorough examination of the underlying hardware:
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
> > > > > >
> > > > > >
> > > > > > *High-CPU instances*
> > > > > >
> > > > > > The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> > > > > > dual-socket Intel Xeon E5410 2.33GHz processors. It is
> dual-socket
> > > > > because
> > > > > > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge
> > > > instance
> > > > > > almost takes up the whole physical machine. However, we
> frequently
> > > > > observe
> > > > > > steal cycle on a c1.xlarge instance ranging from 0% to 25% with
> an
> > > > > average
> > > > > > of about 10%. The amount of steal cycle is not enough to host
> > another
> > > > > > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used
> to
> > > run
> > > > > > Amazon’s software firewall (security group). On Passmark-CPU
> mark,
> > a
> > > > > > c1.xlarge machine achieves 7,962.6, actually higher than an
> average
> > > > > > dual-sock E5410 system is able to achieve (average is 6,903).
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> > > > > > <le...@jusbrasil.com.br>wrote:
> > > > > >
> > > > > > > Thanks Neil for sharing your experience with AWS! Could you
> tell
> > > what
> > > > > > > instance type are you using?
> > > > > > > We are using m1.xlarge, that has 4 virtual cores, but i
> normally
> > > see
> > > > > > > recommendations for machines with 8 cores like c1.xlarge,
> > > m2.4xlarge,
> > > > > > etc.
> > > > > > > In principle these 8-core machines don't suffer too much with
> I/O
> > > > > > problems
> > > > > > > since they don't share the physical server. Is there any piece
> of
> > > > > > > information from Amazon or other source that affirms that or
> it's
> > > > based
> > > > > > in
> > > > > > > empirical analysis?
> > > > > > >
> > > > > > > 2012/1/19 Neil Yalowitz <ne...@gmail.com>
> > > > > > >
> > > > > > > > We have experienced many problems with our cluster on EC2.
>  The
> > > > blunt
> > > > > > > > solution was to increase the Zookeeper timeout to 5 minutes
> or
> > > even
> > > > > > more.
> > > > > > > >
> > > > > > > > Even with a long timeout, however, it's not uncommon for us
> to
> > > see
> > > > an
> > > > > > EC2
> > > > > > > > instance to become unresponsive to pings and SSH several
> times
> > > > > during a
> > > > > > > > week.  It's been a very bad environment for clusters.
> > > > > > > >
> > > > > > > >
> > > > > > > > Neil
> > > > > > > >
> > > > > > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > > > > > > <le...@jusbrasil.com.br>wrote:
> > > > > > > >
> > > > > > > > > Hi Guys,
> > > > > > > > >
> > > > > > > > > I have tested the parameters provided by Sandy, and it
> solved
> > > the
> > > > > GC
> > > > > > > > > problems with the -XX:+UseParallelOldGC, thanks for the
> help
> > > > Sandy.
> > > > > > > > > I'm still experiencing some difficulties, the RegionServer
> > > > > continues
> > > > > > to
> > > > > > > > > shutdown, but it seems related to I/O. It starts to timeout
> > > many
> > > > > > > > > connections, new connections to/from the machine timeout
> too,
> > > and
> > > > > > > finally
> > > > > > > > > the RegionServer dies because of YouAreDeadException. I
> will
> > > > > collect
> > > > > > > more
> > > > > > > > > data, but i think it's an Amazon/Virtualized Environment
> > > inherent
> > > > > > > issue.
> > > > > > > > >
> > > > > > > > > Thanks for the great help provided so far.
> > > > > > > > >
> > > > > > > > > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > > > > > >
> > > > > > > > > > I don't think so, if Amazon stopped the machine it would
> > > cause
> > > > a
> > > > > > stop
> > > > > > > > of
> > > > > > > > > > minutes, not seconds, and since the DataNode, TaskTracker
> > and
> > > > > > > Zookeeper
> > > > > > > > > > continue to work normally.
> > > > > > > > > > But it can be related to the shared environment nature of
> > > > Amazon,
> > > > > > > maybe
> > > > > > > > > > some spike in I/O caused by another virtualized server in
> > the
> > > > > same
> > > > > > > > > physical
> > > > > > > > > > machine.
> > > > > > > > > >
> > > > > > > > > > But the intance type i'm using:
> > > > > > > > > >
> > > > > > > > > > *Extra Large Instance*
> > > > > > > > > >
> > > > > > > > > > 15 GB memory
> > > > > > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute
> > Units
> > > > > each)
> > > > > > > > > > 1,690 GB instance storage
> > > > > > > > > > 64-bit platform
> > > > > > > > > > I/O Performance: High
> > > > > > > > > > API name: m1.xlarge
> > > > > > > > > > I was not expecting to suffer from this problems, or at
> > least
> > > > not
> > > > > > > much.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > > > > >
> > > > > > > > > >> You think it's an Amazon problem maybe?  Like they
> paused
> > or
> > > > > > > migrated
> > > > > > > > > >> your virtual machine, and it just happens to be during
> GC,
> > > > > leaving
> > > > > > > us
> > > > > > > > to
> > > > > > > > > >> think the GC ran long when it didn't?  I don't have a
> lot
> > of
> > > > > > > > experience
> > > > > > > > > >> with Amazon so I don't know if that sort of thing is
> > common.
> > > > > > > > > >>
> > > > > > > > > >> > -----Original Message-----
> > > > > > > > > >> > From: Leonardo Gamas [mailto:
> leogamas@jusbrasil.com.br]
> > > > > > > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > > > > > > >> > To: user@hbase.apache.org
> > > > > > > > > >> > Subject: Re: RegionServer dying every two or three
> days
> > > > > > > > > >> >
> > > > > > > > > >> > I checked the CPU Utilization graphics provided by
> > Amazon
> > > > > (it's
> > > > > > > not
> > > > > > > > > >> accurate,
> > > > > > > > > >> > since the sample time is about 5 minutes) and don't
> see
> > > any
> > > > > > > > > >> abnormality. I
> > > > > > > > > >> > will setup TSDB with Nagios to have a more reliable
> > source
> > > > of
> > > > > > > > > >> performance
> > > > > > > > > >> > data.
> > > > > > > > > >> >
> > > > > > > > > >> > The machines don't have swap space, if i run:
> > > > > > > > > >> >
> > > > > > > > > >> > $ swapon -s
> > > > > > > > > >> >
> > > > > > > > > >> > To display swap usage summary, it returns an empty
> list.
> > > > > > > > > >> >
> > > > > > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> in
> > > my
> > > > to
> > > > > > > > tests.
> > > > > > > > > >> >
> > > > > > > > > >> > I don't have payed much attention to the value of the
> > new
> > > > size
> > > > > > > > param.
> > > > > > > > > >> >
> > > > > > > > > >> > Thanks again for the help!!
> > > > > > > > > >> >
> > > > > > > > > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > > > > >> >
> > > > > > > > > >> > > That size heap doesn't seem like it should cause a
> 36
> > > > second
> > > > > > GC
> > > > > > > (a
> > > > > > > > > >> > > minor GC even if I remember your logs correctly),
> so I
> > > > tend
> > > > > to
> > > > > > > > think
> > > > > > > > > >> > > that other things are probably going on.
> > > > > > > > > >> > >
> > > > > > > > > >> > > This line here:
> > > > > > > > > >> > >
> > > > > > > > > >> > > 14251.690: [GC 14288.620: [ParNew:
> > > 105352K->413K(118016K),
> > > > > > > > 0.0361840
> > > > > > > > > >> > > secs]
> > > > > > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times:
> > > > > user=0.05
> > > > > > > > > >> > > 954388K->sys=0.01,
> > > > > > > > > >> > > real=36.96 secs]
> > > > > > > > > >> > >
> > > > > > > > > >> > > is really mysterious to me.  It seems to indicate
> that
> > > the
> > > > > > > process
> > > > > > > > > was
> > > > > > > > > >> > > blocked for almost 37 seconds during a minor
> > collection.
> > > > >  Note
> > > > > > > the
> > > > > > > > > CPU
> > > > > > > > > >> > > times are very low but the wall time is very high.
>  If
> > > it
> > > > > was
> > > > > > > > > actually
> > > > > > > > > >> > > doing GC work, I'd expect to see user time higher
> than
> > > > real
> > > > > > > time,
> > > > > > > > as
> > > > > > > > > >> > > it is in other parallel collections (see your log
> > > > snippet).
> > > > > > >  Were
> > > > > > > > > you
> > > > > > > > > >> > > really so CPU starved that it took 37 seconds to get
> > in
> > > > 50ms
> > > > > > of
> > > > > > > > > work?
> > > > > > > > > >> > > I can't make sense of that.  I'm trying to think of
> > > > > something
> > > > > > > that
> > > > > > > > > >> > > would block you for that long while all your threads
> > are
> > > > > > stopped
> > > > > > > > for
> > > > > > > > > >> > > GC, other than being in swap, but I can't come up
> with
> > > > > > anything.
> > > > > > > > > >>  You're
> > > > > > > > > >> > certain you're not in swap?
> > > > > > > > > >> > >
> > > > > > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> > > > > > -XX:+AggressiveOpts
> > > > > > > > > while
> > > > > > > > > >> > > you troubleshoot?
> > > > > > > > > >> > >
> > > > > > > > > >> > > Why is your new size so small?  This generally means
> > > that
> > > > > > > > relatively
> > > > > > > > > >> > > more objects are being tenured than would be with a
> > > larger
> > > > > new
> > > > > > > > size.
> > > > > > > > > >> > > This could make collections of the old gen worse (GC
> > > time
> > > > is
> > > > > > > said
> > > > > > > > to
> > > > > > > > > >> > > be proportional to the number of live objects in the
> > > > > > generation,
> > > > > > > > and
> > > > > > > > > >> > > CMS does indeed cause STW pauses).  A typical new to
> > > > tenured
> > > > > > > ratio
> > > > > > > > > >> > > might be 1:3.  Were the new gen GCs taking too long?
> > >  This
> > > > > is
> > > > > > > > > probably
> > > > > > > > > >> > > orthogonal to your immediate issue, though.
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > -----Original Message-----
> > > > > > > > > >> > > From: Leonardo Gamas [mailto:
> > leogamas@jusbrasil.com.br]
> > > > > > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > > > > > > >> > > To: user@hbase.apache.org
> > > > > > > > > >> > > Subject: Re: RegionServer dying every two or three
> > days
> > > > > > > > > >> > >
> > > > > > > > > >> > >  St.Ack,
> > > > > > > > > >> > >
> > > > > > > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > > > > > > >> > > I will read the perf section as suggested.
> > > > > > > > > >> > > I'm currently using Nagios + JMX to monitor the
> > cluster,
> > > > but
> > > > > > > it's
> > > > > > > > > >> > > currently used for alert only, the perfdata is not
> > been
> > > > > > stored,
> > > > > > > so
> > > > > > > > > >> > > it's kind of useless right now, but i was thinking
> in
> > > use
> > > > > TSDB
> > > > > > > to
> > > > > > > > > >> > > store it, any known case of integration?
> > > > > > > > > >> > > ---
> > > > > > > > > >> > >
> > > > > > > > > >> > > Sandy,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Yes, my timeout is 30 seconds:
> > > > > > > > > >> > >
> > > > > > > > > >> > > <property>
> > > > > > > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > > > > > > >> > >   <value>30000</value>
> > > > > > > > > >> > > </property>
> > > > > > > > > >> > >
> > > > > > > > > >> > > To our application it's a sufferable time to wait in
> > > case
> > > > a
> > > > > > > > > >> > > RegionServer go offline.
> > > > > > > > > >> > >
> > > > > > > > > >> > > My heap is 4GB and my JVM params are:
> > > > > > > > > >> > >
> > > > > > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC
> > > -XX:+UseConcMarkSweepGC
> > > > > > > > > >> > > -XX:CMSInitiatingOccupancyFraction=70
> -XX:NewSize=128m
> > > > > > > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
> > > > > -XX:+AggressiveOpts
> > > > > > > > > >> > > -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps
> > > > > > > > > >> > >
> -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > > > > > > >> > >
> > > > > > > > > >> > > I will try the -XX:+UseParallelOldGC param and post
> my
> > > > > > feedback
> > > > > > > > > here.
> > > > > > > > > >> > > ---
> > > > > > > > > >> > >
> > > > > > > > > >> > > Ramkrishna,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Seems the GC is the root of all evil in this case.
> > > > > > > > > >> > > ----
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thank you all for the answers. I will try out these
> > > > valuable
> > > > > > > > advices
> > > > > > > > > >> > > given here and post my results.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Leo Gamas.
> > > > > > > > > >> > >
> > > > > > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > > > > > > ramkrishna.vasudevan@huawei.com>
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Recently we faced a similar problem and it was due
> > to
> > > GC
> > > > > > > config.
> > > > > > > > > >> > > > Pls check your GC.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Regards
> > > > > > > > > >> > > > Ram
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > -----Original Message-----
> > > > > > > > > >> > > > From: saint.ack@gmail.com [mailto:
> > saint.ack@gmail.com
> > > ]
> > > > On
> > > > > > > > Behalf
> > > > > > > > > Of
> > > > > > > > > >> > > > Stack
> > > > > > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > > > > > > >> > > > To: user@hbase.apache.org
> > > > > > > > > >> > > > Subject: Re: RegionServer dying every two or three
> > > days
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > > > > > > >> > > > <le...@jusbrasil.com.br> wrote:
> > > > > > > > > >> > > > > The third line took 36.96 seconds to execute,
> can
> > > this
> > > > > be
> > > > > > > > > causing
> > > > > > > > > >> > > > > this problem?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Reading the code a little it seems that, even if
> > > it's
> > > > > > > > disabled,
> > > > > > > > > if
> > > > > > > > > >> > > > > all files are target in a compaction, it's
> > > considered
> > > > a
> > > > > > > major
> > > > > > > > > >> > > > > compaction. Is
> > > > > > > > > >> > > > it
> > > > > > > > > >> > > > > right?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > That is right.  They get 'upgraded' from minor to
> > > major.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > This should be fine though.  What you are avoiding
> > > > setting
> > > > > > > major
> > > > > > > > > >> > > > compactions to 0 is all regions being major
> > compacted
> > > > on a
> > > > > > > > > period, a
> > > > > > > > > >> > > > heavy weight effective rewrite of all your data
> > > (unless
> > > > > > > already
> > > > > > > > > >> major
> > > > > > > > > >> > > > compacted).   It looks like you have this disabled
> > > which
> > > > > is
> > > > > > > good
> > > > > > > > > >> until
> > > > > > > > > >> > > > you've wrestled your cluster into submission.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > The machines don't have swap, so the swappiness
> > > > > parameter
> > > > > > > > don't
> > > > > > > > > >> > > > > seem to apply here. Any other suggestion?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > See the perf section of the hbase manual.  It has
> > our
> > > > > > current
> > > > > > > > > list.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Are you monitoring your cluster w/ ganglia or
> tsdb?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > St.Ack
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Thanks.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > 2012/1/4 Leonardo Gamas <
> > leogamas@jusbrasil.com.br>
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >> I will investigate this, thanks for the
> response.
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > Client
> > > > > > > session
> > > > > > > > > >> > > > >>> timed out, have not heard from server in
> 61103ms
> > > for
> > > > > > > > sessionid
> > > > > > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection
> and
> > > > > > > attempting
> > > > > > > > > >> > > > >>> reconnect
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> It looks like the process has been
> unresponsive
> > > for
> > > > > some
> > > > > > > > time,
> > > > > > > > > >> > > > >>> so ZK
> > > > > > > > > >> > > > has
> > > > > > > > > >> > > > >>> terminated the session.  Did you experience a
> > long
> > > > GC
> > > > > > > pause
> > > > > > > > > >> > > > >>> right
> > > > > > > > > >> > > > before
> > > > > > > > > >> > > > >>> this?  If you don't have GC logging enabled
> for
> > > the
> > > > > RS,
> > > > > > > you
> > > > > > > > > can
> > > > > > > > > >> > > > sometimes
> > > > > > > > > >> > > > >>> tell by noticing a gap in the timestamps of
> the
> > > log
> > > > > > > > statements
> > > > > > > > > >> > > > >>> leading
> > > > > > > > > >> > > > up
> > > > > > > > > >> > > > >>> to the crash.
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> If it turns out to be GC, you might want to
> look
> > > at
> > > > > your
> > > > > > > > > kernel
> > > > > > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM
> > > > params.
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> Sandy
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>> > -----Original Message-----
> > > > > > > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> > > > > > leogamas@jusbrasil.com.br]
> > > > > > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > > > > > > >> > > > >>> > To: user@hbase.apache.org
> > > > > > > > > >> > > > >>> > Subject: RegionServer dying every two or
> three
> > > > days
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Hi,
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4
> > > machines
> > > > > (1
> > > > > > > > > Master +
> > > > > > > > > >> > > > >>> > 3
> > > > > > > > > >> > > > >>> Slaves),
> > > > > > > > > >> > > > >>> > running on Amazon EC2. The master is a
> > > High-Memory
> > > > > > Extra
> > > > > > > > > Large
> > > > > > > > > >> > > > Instance
> > > > > > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker,
> HMaster
> > > and
> > > > > > > > > Zookeeper.
> > > > > > > > > >> > > > >>> > The slaves are Extra Large Instances
> > (m1.xlarge)
> > > > > > running
> > > > > > > > > >> > > > >>> > Datanode,
> > > > > > > > > >> > > > >>> TaskTracker,
> > > > > > > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > From time to time, every two or three days,
> > one
> > > of
> > > > > the
> > > > > > > > > >> > > > >>> > RegionServers processes goes down, but the
> > other
> > > > > > > processes
> > > > > > > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > > > > > > >> > > > >>> > Zookeeper) continue normally.
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Reading the logs:
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > > Client
> > > > > > > > session
> > > > > > > > > >> > > > >>> > timed
> > > > > > > > > >> > > > out,
> > > > > > > > > >> > > > >>> have
> > > > > > > > > >> > > > >>> > not heard from server in 61103ms for
> sessionid
> > > > > > > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > > > > > > >> > > > >>> closing
> > > > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > > Client
> > > > > > > > session
> > > > > > > > > >> > > > >>> > timed
> > > > > > > > > >> > > > out,
> > > > > > > > > >> > > > >>> have
> > > > > > > > > >> > > > >>> > not heard from server in 61205ms for
> sessionid
> > > > > > > > > >> > > > >>> > 0x346c561a55953e,
> > > > > > > > > >> > > > closing
> > > > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > And the Handlers start to fail:
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > > > Responder,
> > > > > > > > > >> > > > >>> > call
> > > > > > > > > >> > > > >>> >
> > > > > > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > > > > > > )
> > > > > > > > > >> > > > >>> > from
> > > > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > handler
> > > > > > > > > 81
> > > > > > > > > >> > > > >>> > on
> > > > > > > > > >> > > > 60020
> > > > > > > > > >> > > > >>> > caught:
> > java.nio.channels.ClosedChannelException
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > > > >> > > > 13
> > > > > > > > > >> > > > >>> > 3)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>>
> > > > > > > >
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > > >
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > > > >> > > > >>> > 1341)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > > > >> > > > >>> > ns
> > > > > > > > > >> > > > >>> > e(HB
> > > > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > > > >> > > > >>> > as
> > > > > > > > > >> > > > >>> > eSe
> > > > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > > >> > > > :1
> > > > > > > > > >> > > > >>> > 083)
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > > > Responder,
> > > > > > > > > >> > > > >>> > call
> > > > > > > > > >> > > > >>> >
> > > > > > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > > > > > > )
> > > > > > > > > >> > > > >>> > from
> > > > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > > Server
> > > > > > > handler
> > > > > > > > > 62
> > > > > > > > > >> > > > >>> > on
> > > > > > > > > >> > > > 60020
> > > > > > > > > >> > > > >>> > caught:
> > java.nio.channels.ClosedChannelException
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > > > >> > > > 13
> > > > > > > > > >> > > > >>> > 3)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>>
> > > > > > > >
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > > >
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > > > >> > > > >>> > 1341)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > > > >> > > > >>> > ns
> > > > > > > > > >> > > > >>> > e(HB
> > > > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > > > >> > > > >>> > as
> > > > > > > > > >> > > > >>> > eSe
> > > > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > > >> > > > :1
> > > > > > > > > >> > > > >>> > 083)
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > And finally the server throws a
> > > > YouAreDeadException
> > > > > > :( :
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Opening
> > > > > > > > socket
> > > > > > > > > >> > > > connection
> > > > > > > > > >> > > > >>> to
> > > > > > > > > >> > > > >>> > server
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Socket
> > > > > > > > > connection
> > > > > > > > > >> > > > >>> > established to
> > > > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > > > >> > > > initiating
> > > > > > > > > >> > > > >>> session
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Unable
> > > > > to
> > > > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > > > > 0x23462a4cf93a8fc
> > > > > > > > > has
> > > > > > > > > >> > > > >>> > expired, closing
> > > > > > > > > >> > > > socket
> > > > > > > > > >> > > > >>> > connection
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Opening
> > > > > > > > socket
> > > > > > > > > >> > > > connection
> > > > > > > > > >> > > > >>> to
> > > > > > > > > >> > > > >>> > server
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Socket
> > > > > > > > > connection
> > > > > > > > > >> > > > >>> > established to
> > > > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > > > >> > > > initiating
> > > > > > > > > >> > > > >>> session
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > > Unable
> > > > > to
> > > > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > > > 0x346c561a55953e
> > > > > > > > has
> > > > > > > > > >> > > > >>> > expired, closing
> > > > > > > > > >> > > > socket
> > > > > > > > > >> > > > >>> > connection
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL
> > > > regionserver.HRegionServer:
> > > > > > > > ABORTING
> > > > > > > > > >> > > > >>> > region server
> > > > > > > > > >> > > > >>> >
> > > > > > > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > > > > > > >> > > > >>> > load=(requests=447, regions=206,
> > usedHeap=1584,
> > > > > > > > > >> > maxHeap=4083):
> > > > > > > > > >> > > > >>> > Unhandled
> > > > > > > > > >> > > > >>> > exception:
> > > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > > >> > Server
> > > > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > as
> > > > > > dead
> > > > > > > > > server
> > > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > Server
> > > > > > > REPORT
> > > > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > > > >> > > > as
> > > > > > > > > >> > > > >>> > dead server
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > >
> > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > > >> > > > >>> > Method)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > >
> > > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > r
> > > > > > > > > >> > > > >>> > AccessorImpl.java:39)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > >
> > > > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > > > > > > >> > > > Co
> > > > > > > > > >> > > > n
> > > > > > > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>>
> > > > > > > > >
> > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > > > > > > >> > > > >>> > ot
> > > > > > > > > >> > > > >>> > eExce
> > > > > > > > > >> > > > >>> > ption.java:95)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > > > > > > >> > > > >>> > mo
> > > > > > > > > >> > > > >>> > te
> > > > > > > > > >> > > > >>> > Exception.java:79)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > > > >> > > > >>> > rv
> > > > > > > > > >> > > > >>> > erRep
> > > > > > > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > > > > > > >> > > > .j
> > > > > > > > > >> > > > >>> > ava:596)
> > > > > > > > > >> > > > >>> >         at
> > java.lang.Thread.run(Thread.java:662)
> > > > > > > > > >> > > > >>> > Caused by:
> > > org.apache.hadoop.ipc.RemoteException:
> > > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > Server
> > > > > > > REPORT
> > > > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > > > >> > > > as
> > > > > > > > > >> > > > >>> > dead server
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > > > > > > >> > > > >>> > rM
> > > > > > > > > >> > > > >>> > ana
> > > > > > > > > >> > > > >>> > ger.java:204)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > > > > > > >> > > > >>> > t(
> > > > > > > > > >> > > > >>> > Serv
> > > > > > > > > >> > > > >>> > erManager.java:262)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > > > > > > >> > > > >>> > te
> > > > > > > > > >> > > > >>> > r.jav
> > > > > > > > > >> > > > >>> > a:669)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > > > > > > >> > > > Source)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > > > > > > >> > > > >>> > od
> > > > > > > > > >> > > > >>> > Acces
> > > > > > > > > >> > > > >>> > sorImpl.java:25)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > > > >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > >
> > > > > > > > > >> >
> > > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > > >> > > > :1
> > > > > > > > > >> > > > >>> > 039)
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > > >
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > > > > > > >> > > > >>> > av
> > > > > > > > > >> > > > >>> > a:257
> > > > > > > > > >> > > > >>> > )
> > > > > > > > > >> > > > >>> >         at
> $Proxy6.regionServerReport(Unknown
> > > > > Source)
> > > > > > > > > >> > > > >>> >         at
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> >
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > > > >> > > > >>> > rv
> > > > > > > > > >> > > > >>> > erRep
> > > > > > > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > > > > > > >> > > > >>> >         ... 2 more
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> > > regionserver.HRegionServer:
> > > > > > Dump
> > > > > > > of
> > > > > > > > > >> > metrics:
> > > > > > > > > >> > > > >>> > requests=66, regions=206, stores=2078,
> > > > > storefiles=970,
> > > > > > > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > > > > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> > > > > > usedHeap=1672,
> > > > > > > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > > > > > > >> > > > >>> > blockCacheFree=150412064,
> > blockCacheCount=10648,
> > > > > > > > > >> > > > >>> > blockCacheHitCount=79578618,
> > > > > > > blockCacheMissCount=3036335,
> > > > > > > > > >> > > > >>> > blockCacheEvictedCount=1401352,
> > > > > blockCacheHitRatio=96,
> > > > > > > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> > > regionserver.HRegionServer:
> > > > > > > > STOPPED:
> > > > > > > > > >> > > > >>> > Unhandled
> > > > > > > > > >> > > > >>> > exception:
> > > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > > >> > Server
> > > > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > > > >> > > > >>> >
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > as
> > > > > > dead
> > > > > > > > > server
> > > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer:
> > Stopping
> > > > > > server
> > > > > > > on
> > > > > > > > > >> > > > >>> > 60020
> > > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Then i restart the RegionServer and
> everything
> > > is
> > > > > back
> > > > > > > to
> > > > > > > > > >> normal.
> > > > > > > > > >> > > > >>> > Reading the DataNode, Zookeeper and
> > TaskTracker
> > > > > logs,
> > > > > > i
> > > > > > > > > don't
> > > > > > > > > >> > > > >>> > see any abnormality in the same time window.
> > > > > > > > > >> > > > >>> > I think it was caused by the lost of
> > connection
> > > to
> > > > > > > > > zookeeper.
> > > > > > > > > >> > > > >>> > Is it
> > > > > > > > > >> > > > >>> advisable to
> > > > > > > > > >> > > > >>> > run zookeeper in the same machines?
> > > > > > > > > >> > > > >>> > if the RegionServer lost it's connection to
> > > > > Zookeeper,
> > > > > > > > > there's
> > > > > > > > > >> > > > >>> > a way
> > > > > > > > > >> > > > (a
> > > > > > > > > >> > > > >>> > configuration perhaps) to re-join the
> cluster,
> > > and
> > > > > not
> > > > > > > > only
> > > > > > > > > >> die?
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Any idea what is causing this?? Or to
> prevent
> > it
> > > > > from
> > > > > > > > > >> happening?
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Any help is appreciated.
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > Best Regards,
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > --
> > > > > > > > > >> > > > >>> >
> > > > > > > > > >> > > > >>> > *Leonardo Gamas*
> > > > > > > > > >> > > > >>> > Software Engineer
> > > > > > > > > >> > > > >>> > +557134943514
> > > > > > > > > >> > > > >>> > +557581347440
> > > > > > > > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > > > > > > > >> > > > >>> > www.jusbrasil.com.br
> > > > > > > > > >> > > > >>>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >> --
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >> *Leonardo Gamas*
> > > > > > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71)
> > > > > 3494-3514C
> > > > > > > > > (75)
> > > > > > > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br
> > > > > www.jusbrasil.com.br
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >>
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > --
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > *Leonardo Gamas*
> > > > > > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71)
> > > > > 3494-3514C
> > > > > > > > (75)
> > > > > > > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br
> > > > > www.jusbrasil.com.br
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > --
> > > > > > > > > >> > >
> > > > > > > > > >> > > *Leonardo Gamas*
> > > > > > > > > >> > > Software Engineer
> > > > > > > > > >> > > +557134943514
> > > > > > > > > >> > > +557581347440
> > > > > > > > > >> > > leogamas@jusbrasil.com.br
> > > > > > > > > >> > > www.jusbrasil.com.br
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > --
> > > > > > > > > >> >
> > > > > > > > > >> > *Leonardo Gamas*
> > > > > > > > > >> > Software Engineer
> > > > > > > > > >> > T +55 (71) 3494-3514
> > > > > > > > > >> > C +55 (75) 8134-7440
> > > > > > > > > >> > leogamas@jusbrasil.com.br
> > > > > > > > > >> > www.jusbrasil.com.br
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > *Leonardo Gamas*
> > > > > > > > > >
> > > > > > > > > > Software Engineer
> > > > > > > > > > T +55 (71) 3494-3514
> > > > > > > > > > C +55 (75) 8134-7440
> > > > > > > > > > leogamas@jusbrasil.com.br
> > > > > > > > > >
> > > > > > > > > > www.jusbrasil.com.br
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > *Leonardo Gamas*
> > > > > > > > > Software Engineer
> > > > > > > > > T +55 (71) 3494-3514
> > > > > > > > > C +55 (75) 8134-7440
> > > > > > > > > leogamas@jusbrasil.com.br
> > > > > > > > > www.jusbrasil.com.br
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Leonardo Gamas*
> > > > > > > Software Engineer
> > > > > > > T +55 (71) 3494-3514
> > > > > > > C +55 (75) 8134-7440
> > > > > > > leogamas@jusbrasil.com.br
> > > > > > > www.jusbrasil.com.br
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > > Software Engineer
> > > > > T +55 (71) 3494-3514
> > > > > C +55 (75) 8134-7440
> > > > > leogamas@jusbrasil.com.br
> > > > > www.jusbrasil.com.br
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer
> > > T +55 (71) 3494-3514
> > > C +55 (75) 8134-7440
> > > leogamas@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > >
> >
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Neil Yalowitz <ne...@gmail.com>.
Hi Leonardo, excuse the late response.

I read the link that Matt sent below RE: instance type and hardware
isolation a while ago and struggled with the same problem with c1.xlarge
and memory.  Another issue there, as Matt mentions network throughput, is
the type of network connection.  We decided to go with cluster compute
instances (cc1.4xlarge) instead since the larger memory and fatter pipe
(10Gbit) suited our needs (more MR daemons/children and large DB rows,
respectively).  The c1.xlarge also seemed like a bad match as the best
trait of that instance, the CPU units, aren't really our bottleneck (it's
more an issue of RAM and I/O).

While the cluster compute instances improved performance and stability
somewhat, the pain hasn't stopped there.  Creating/terminating instances
seems to be a lottery, possibly due to bad neighbors on the physical host
or network.  Some cluster instances are rock solid for days and weeks while
we run our tests, others are problematic within hours of creation despite
having an identical setup.  Even with the cluster compute instances, we
have test clusters where we will run benchmarks, wipe the data, and rerun
the benchmarks with wildly different performance (off by 400%).
 Occasionally, an instance will become unresponsive to pings and SSH and
will completely fall out of the cluster.

It seems the strategy for EC2 deployment is to expect everything to fail
and plan accordingly.  It hasn't been a good experience.



Neil Yalowitz

On Mon, Jan 23, 2012 at 1:37 PM, Matt Corgan <mc...@hotpads.com> wrote:

> You could always try going with a little smaller heap and see how it works
> for your particular workload, maybe 4G.  1G block cache, 1G memstores, ~1G
> GC overhead(?), leaving 1G for active program data.
>
> If trying to squeeze memory, you should be aware there is a limitation in
> 0.90 where storefile indexes come out of that remaining 1G as opposed to
> being stored in the block cache.  If you have big indexes, you would need
> to shrink block cache and memstore limits to compensate.
>
> http://search-hadoop.com/m/OH4cT1LiN4Q1/corgan&subj=Re+a+question+storefileIndexSize
>
>
> On Mon, Jan 23, 2012 at 4:32 AM, Leonardo Gamas
> <le...@jusbrasil.com.br>wrote:
>
> > Thanks again Matt! I will try out this instance type, but i'm concerned
> > about the MapReduce cluster running apart from HBase in my case, since we
> > have some MapReduces running and planning to run more. Feels like losing
> > the great strength of MapReduce, by running it far from data.
> >
> > 2012/1/21 Matt Corgan <mc...@hotpads.com>
> >
> > > We actually don't run map/reduce on the same machines (most of our jobs
> > are
> > > on an old message based system), so don't have much experience there.
>  We
> > > run only HDFS (1G heap) and HBase (5.5G heap) with 12 * 100GB EBS
> volumes
> > > per regionserver, and ~350 regions/server at the moment.  5.5G is
> > already a
> > > small heap in the hbase world, so I wouldn't recommend decreasing it to
> > fit
> > > M/R,  You could always run map/reduce on separate servers, adding or
> > > removing servers as needed (more at night?), or use Amazon's Elastic
> M/R.
> > >
> > >
> > > On Sat, Jan 21, 2012 at 5:04 AM, Leonardo Gamas
> > > <le...@jusbrasil.com.br>wrote:
> > >
> > > > Thanks Matt for this insightful article, I will run my cluster with
> > > > c1.xlarge to test it's performance. But i'm concerned with this
> > machine,
> > > > because the amount of RAM available, only 7GB. How many map/reduce
> > slots
> > > do
> > > > you configure? And the amount of Heap for HBase? How many regions per
> > > > RegionServer could my cluster support?
> > > >
> > > > 2012/1/20 Matt Corgan <mc...@hotpads.com>
> > > >
> > > > > I run c1.xlarge servers and have found them very stable.  I see 100
> > > > Mbit/s
> > > > > sustained bi-directional network throughput (200Mbit/s total),
> > > sometimes
> > > > up
> > > > > to 150 * 2 Mbit/s.
> > > > >
> > > > > Here's a pretty thorough examination of the underlying hardware:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
> > > > >
> > > > >
> > > > > *High-CPU instances*
> > > > >
> > > > > The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> > > > > dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket
> > > > because
> > > > > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge
> > > instance
> > > > > almost takes up the whole physical machine. However, we frequently
> > > > observe
> > > > > steal cycle on a c1.xlarge instance ranging from 0% to 25% with an
> > > > average
> > > > > of about 10%. The amount of steal cycle is not enough to host
> another
> > > > > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to
> > run
> > > > > Amazon’s software firewall (security group). On Passmark-CPU mark,
> a
> > > > > c1.xlarge machine achieves 7,962.6, actually higher than an average
> > > > > dual-sock E5410 system is able to achieve (average is 6,903).
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> > > > > <le...@jusbrasil.com.br>wrote:
> > > > >
> > > > > > Thanks Neil for sharing your experience with AWS! Could you tell
> > what
> > > > > > instance type are you using?
> > > > > > We are using m1.xlarge, that has 4 virtual cores, but i normally
> > see
> > > > > > recommendations for machines with 8 cores like c1.xlarge,
> > m2.4xlarge,
> > > > > etc.
> > > > > > In principle these 8-core machines don't suffer too much with I/O
> > > > > problems
> > > > > > since they don't share the physical server. Is there any piece of
> > > > > > information from Amazon or other source that affirms that or it's
> > > based
> > > > > in
> > > > > > empirical analysis?
> > > > > >
> > > > > > 2012/1/19 Neil Yalowitz <ne...@gmail.com>
> > > > > >
> > > > > > > We have experienced many problems with our cluster on EC2.  The
> > > blunt
> > > > > > > solution was to increase the Zookeeper timeout to 5 minutes or
> > even
> > > > > more.
> > > > > > >
> > > > > > > Even with a long timeout, however, it's not uncommon for us to
> > see
> > > an
> > > > > EC2
> > > > > > > instance to become unresponsive to pings and SSH several times
> > > > during a
> > > > > > > week.  It's been a very bad environment for clusters.
> > > > > > >
> > > > > > >
> > > > > > > Neil
> > > > > > >
> > > > > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > > > > > <le...@jusbrasil.com.br>wrote:
> > > > > > >
> > > > > > > > Hi Guys,
> > > > > > > >
> > > > > > > > I have tested the parameters provided by Sandy, and it solved
> > the
> > > > GC
> > > > > > > > problems with the -XX:+UseParallelOldGC, thanks for the help
> > > Sandy.
> > > > > > > > I'm still experiencing some difficulties, the RegionServer
> > > > continues
> > > > > to
> > > > > > > > shutdown, but it seems related to I/O. It starts to timeout
> > many
> > > > > > > > connections, new connections to/from the machine timeout too,
> > and
> > > > > > finally
> > > > > > > > the RegionServer dies because of YouAreDeadException. I will
> > > > collect
> > > > > > more
> > > > > > > > data, but i think it's an Amazon/Virtualized Environment
> > inherent
> > > > > > issue.
> > > > > > > >
> > > > > > > > Thanks for the great help provided so far.
> > > > > > > >
> > > > > > > > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > > > > >
> > > > > > > > > I don't think so, if Amazon stopped the machine it would
> > cause
> > > a
> > > > > stop
> > > > > > > of
> > > > > > > > > minutes, not seconds, and since the DataNode, TaskTracker
> and
> > > > > > Zookeeper
> > > > > > > > > continue to work normally.
> > > > > > > > > But it can be related to the shared environment nature of
> > > Amazon,
> > > > > > maybe
> > > > > > > > > some spike in I/O caused by another virtualized server in
> the
> > > > same
> > > > > > > > physical
> > > > > > > > > machine.
> > > > > > > > >
> > > > > > > > > But the intance type i'm using:
> > > > > > > > >
> > > > > > > > > *Extra Large Instance*
> > > > > > > > >
> > > > > > > > > 15 GB memory
> > > > > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute
> Units
> > > > each)
> > > > > > > > > 1,690 GB instance storage
> > > > > > > > > 64-bit platform
> > > > > > > > > I/O Performance: High
> > > > > > > > > API name: m1.xlarge
> > > > > > > > > I was not expecting to suffer from this problems, or at
> least
> > > not
> > > > > > much.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > > > >
> > > > > > > > >> You think it's an Amazon problem maybe?  Like they paused
> or
> > > > > > migrated
> > > > > > > > >> your virtual machine, and it just happens to be during GC,
> > > > leaving
> > > > > > us
> > > > > > > to
> > > > > > > > >> think the GC ran long when it didn't?  I don't have a lot
> of
> > > > > > > experience
> > > > > > > > >> with Amazon so I don't know if that sort of thing is
> common.
> > > > > > > > >>
> > > > > > > > >> > -----Original Message-----
> > > > > > > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > > > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > > > > > >> > To: user@hbase.apache.org
> > > > > > > > >> > Subject: Re: RegionServer dying every two or three days
> > > > > > > > >> >
> > > > > > > > >> > I checked the CPU Utilization graphics provided by
> Amazon
> > > > (it's
> > > > > > not
> > > > > > > > >> accurate,
> > > > > > > > >> > since the sample time is about 5 minutes) and don't see
> > any
> > > > > > > > >> abnormality. I
> > > > > > > > >> > will setup TSDB with Nagios to have a more reliable
> source
> > > of
> > > > > > > > >> performance
> > > > > > > > >> > data.
> > > > > > > > >> >
> > > > > > > > >> > The machines don't have swap space, if i run:
> > > > > > > > >> >
> > > > > > > > >> > $ swapon -s
> > > > > > > > >> >
> > > > > > > > >> > To display swap usage summary, it returns an empty list.
> > > > > > > > >> >
> > > > > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in
> > my
> > > to
> > > > > > > tests.
> > > > > > > > >> >
> > > > > > > > >> > I don't have payed much attention to the value of the
> new
> > > size
> > > > > > > param.
> > > > > > > > >> >
> > > > > > > > >> > Thanks again for the help!!
> > > > > > > > >> >
> > > > > > > > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > > > >> >
> > > > > > > > >> > > That size heap doesn't seem like it should cause a 36
> > > second
> > > > > GC
> > > > > > (a
> > > > > > > > >> > > minor GC even if I remember your logs correctly), so I
> > > tend
> > > > to
> > > > > > > think
> > > > > > > > >> > > that other things are probably going on.
> > > > > > > > >> > >
> > > > > > > > >> > > This line here:
> > > > > > > > >> > >
> > > > > > > > >> > > 14251.690: [GC 14288.620: [ParNew:
> > 105352K->413K(118016K),
> > > > > > > 0.0361840
> > > > > > > > >> > > secs]
> > > > > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times:
> > > > user=0.05
> > > > > > > > >> > > 954388K->sys=0.01,
> > > > > > > > >> > > real=36.96 secs]
> > > > > > > > >> > >
> > > > > > > > >> > > is really mysterious to me.  It seems to indicate that
> > the
> > > > > > process
> > > > > > > > was
> > > > > > > > >> > > blocked for almost 37 seconds during a minor
> collection.
> > > >  Note
> > > > > > the
> > > > > > > > CPU
> > > > > > > > >> > > times are very low but the wall time is very high.  If
> > it
> > > > was
> > > > > > > > actually
> > > > > > > > >> > > doing GC work, I'd expect to see user time higher than
> > > real
> > > > > > time,
> > > > > > > as
> > > > > > > > >> > > it is in other parallel collections (see your log
> > > snippet).
> > > > > >  Were
> > > > > > > > you
> > > > > > > > >> > > really so CPU starved that it took 37 seconds to get
> in
> > > 50ms
> > > > > of
> > > > > > > > work?
> > > > > > > > >> > > I can't make sense of that.  I'm trying to think of
> > > > something
> > > > > > that
> > > > > > > > >> > > would block you for that long while all your threads
> are
> > > > > stopped
> > > > > > > for
> > > > > > > > >> > > GC, other than being in swap, but I can't come up with
> > > > > anything.
> > > > > > > > >>  You're
> > > > > > > > >> > certain you're not in swap?
> > > > > > > > >> > >
> > > > > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> > > > > -XX:+AggressiveOpts
> > > > > > > > while
> > > > > > > > >> > > you troubleshoot?
> > > > > > > > >> > >
> > > > > > > > >> > > Why is your new size so small?  This generally means
> > that
> > > > > > > relatively
> > > > > > > > >> > > more objects are being tenured than would be with a
> > larger
> > > > new
> > > > > > > size.
> > > > > > > > >> > > This could make collections of the old gen worse (GC
> > time
> > > is
> > > > > > said
> > > > > > > to
> > > > > > > > >> > > be proportional to the number of live objects in the
> > > > > generation,
> > > > > > > and
> > > > > > > > >> > > CMS does indeed cause STW pauses).  A typical new to
> > > tenured
> > > > > > ratio
> > > > > > > > >> > > might be 1:3.  Were the new gen GCs taking too long?
> >  This
> > > > is
> > > > > > > > probably
> > > > > > > > >> > > orthogonal to your immediate issue, though.
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > -----Original Message-----
> > > > > > > > >> > > From: Leonardo Gamas [mailto:
> leogamas@jusbrasil.com.br]
> > > > > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > > > > > >> > > To: user@hbase.apache.org
> > > > > > > > >> > > Subject: Re: RegionServer dying every two or three
> days
> > > > > > > > >> > >
> > > > > > > > >> > >  St.Ack,
> > > > > > > > >> > >
> > > > > > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > > > > > >> > > I will read the perf section as suggested.
> > > > > > > > >> > > I'm currently using Nagios + JMX to monitor the
> cluster,
> > > but
> > > > > > it's
> > > > > > > > >> > > currently used for alert only, the perfdata is not
> been
> > > > > stored,
> > > > > > so
> > > > > > > > >> > > it's kind of useless right now, but i was thinking in
> > use
> > > > TSDB
> > > > > > to
> > > > > > > > >> > > store it, any known case of integration?
> > > > > > > > >> > > ---
> > > > > > > > >> > >
> > > > > > > > >> > > Sandy,
> > > > > > > > >> > >
> > > > > > > > >> > > Yes, my timeout is 30 seconds:
> > > > > > > > >> > >
> > > > > > > > >> > > <property>
> > > > > > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > > > > > >> > >   <value>30000</value>
> > > > > > > > >> > > </property>
> > > > > > > > >> > >
> > > > > > > > >> > > To our application it's a sufferable time to wait in
> > case
> > > a
> > > > > > > > >> > > RegionServer go offline.
> > > > > > > > >> > >
> > > > > > > > >> > > My heap is 4GB and my JVM params are:
> > > > > > > > >> > >
> > > > > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC
> > -XX:+UseConcMarkSweepGC
> > > > > > > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > > > > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
> > > > -XX:+AggressiveOpts
> > > > > > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > > > > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > > > > > >> > >
> > > > > > > > >> > > I will try the -XX:+UseParallelOldGC param and post my
> > > > > feedback
> > > > > > > > here.
> > > > > > > > >> > > ---
> > > > > > > > >> > >
> > > > > > > > >> > > Ramkrishna,
> > > > > > > > >> > >
> > > > > > > > >> > > Seems the GC is the root of all evil in this case.
> > > > > > > > >> > > ----
> > > > > > > > >> > >
> > > > > > > > >> > > Thank you all for the answers. I will try out these
> > > valuable
> > > > > > > advices
> > > > > > > > >> > > given here and post my results.
> > > > > > > > >> > >
> > > > > > > > >> > > Leo Gamas.
> > > > > > > > >> > >
> > > > > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > > > > > ramkrishna.vasudevan@huawei.com>
> > > > > > > > >> > >
> > > > > > > > >> > > > Recently we faced a similar problem and it was due
> to
> > GC
> > > > > > config.
> > > > > > > > >> > > > Pls check your GC.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Regards
> > > > > > > > >> > > > Ram
> > > > > > > > >> > > >
> > > > > > > > >> > > > -----Original Message-----
> > > > > > > > >> > > > From: saint.ack@gmail.com [mailto:
> saint.ack@gmail.com
> > ]
> > > On
> > > > > > > Behalf
> > > > > > > > Of
> > > > > > > > >> > > > Stack
> > > > > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > > > > > >> > > > To: user@hbase.apache.org
> > > > > > > > >> > > > Subject: Re: RegionServer dying every two or three
> > days
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > > > > > >> > > > <le...@jusbrasil.com.br> wrote:
> > > > > > > > >> > > > > The third line took 36.96 seconds to execute, can
> > this
> > > > be
> > > > > > > > causing
> > > > > > > > >> > > > > this problem?
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Reading the code a little it seems that, even if
> > it's
> > > > > > > disabled,
> > > > > > > > if
> > > > > > > > >> > > > > all files are target in a compaction, it's
> > considered
> > > a
> > > > > > major
> > > > > > > > >> > > > > compaction. Is
> > > > > > > > >> > > > it
> > > > > > > > >> > > > > right?
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > That is right.  They get 'upgraded' from minor to
> > major.
> > > > > > > > >> > > >
> > > > > > > > >> > > > This should be fine though.  What you are avoiding
> > > setting
> > > > > > major
> > > > > > > > >> > > > compactions to 0 is all regions being major
> compacted
> > > on a
> > > > > > > > period, a
> > > > > > > > >> > > > heavy weight effective rewrite of all your data
> > (unless
> > > > > > already
> > > > > > > > >> major
> > > > > > > > >> > > > compacted).   It looks like you have this disabled
> > which
> > > > is
> > > > > > good
> > > > > > > > >> until
> > > > > > > > >> > > > you've wrestled your cluster into submission.
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > > The machines don't have swap, so the swappiness
> > > > parameter
> > > > > > > don't
> > > > > > > > >> > > > > seem to apply here. Any other suggestion?
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > See the perf section of the hbase manual.  It has
> our
> > > > > current
> > > > > > > > list.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > St.Ack
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Thanks.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > 2012/1/4 Leonardo Gamas <
> leogamas@jusbrasil.com.br>
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >> I will investigate this, thanks for the response.
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > Client
> > > > > > session
> > > > > > > > >> > > > >>> timed out, have not heard from server in 61103ms
> > for
> > > > > > > sessionid
> > > > > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> > > > > > attempting
> > > > > > > > >> > > > >>> reconnect
> > > > > > > > >> > > > >>>
> > > > > > > > >> > > > >>> It looks like the process has been unresponsive
> > for
> > > > some
> > > > > > > time,
> > > > > > > > >> > > > >>> so ZK
> > > > > > > > >> > > > has
> > > > > > > > >> > > > >>> terminated the session.  Did you experience a
> long
> > > GC
> > > > > > pause
> > > > > > > > >> > > > >>> right
> > > > > > > > >> > > > before
> > > > > > > > >> > > > >>> this?  If you don't have GC logging enabled for
> > the
> > > > RS,
> > > > > > you
> > > > > > > > can
> > > > > > > > >> > > > sometimes
> > > > > > > > >> > > > >>> tell by noticing a gap in the timestamps of the
> > log
> > > > > > > statements
> > > > > > > > >> > > > >>> leading
> > > > > > > > >> > > > up
> > > > > > > > >> > > > >>> to the crash.
> > > > > > > > >> > > > >>>
> > > > > > > > >> > > > >>> If it turns out to be GC, you might want to look
> > at
> > > > your
> > > > > > > > kernel
> > > > > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM
> > > params.
> > > > > > > > >> > > > >>>
> > > > > > > > >> > > > >>> Sandy
> > > > > > > > >> > > > >>>
> > > > > > > > >> > > > >>>
> > > > > > > > >> > > > >>> > -----Original Message-----
> > > > > > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> > > > > leogamas@jusbrasil.com.br]
> > > > > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > > > > > >> > > > >>> > To: user@hbase.apache.org
> > > > > > > > >> > > > >>> > Subject: RegionServer dying every two or three
> > > days
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > Hi,
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4
> > machines
> > > > (1
> > > > > > > > Master +
> > > > > > > > >> > > > >>> > 3
> > > > > > > > >> > > > >>> Slaves),
> > > > > > > > >> > > > >>> > running on Amazon EC2. The master is a
> > High-Memory
> > > > > Extra
> > > > > > > > Large
> > > > > > > > >> > > > Instance
> > > > > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster
> > and
> > > > > > > > Zookeeper.
> > > > > > > > >> > > > >>> > The slaves are Extra Large Instances
> (m1.xlarge)
> > > > > running
> > > > > > > > >> > > > >>> > Datanode,
> > > > > > > > >> > > > >>> TaskTracker,
> > > > > > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > From time to time, every two or three days,
> one
> > of
> > > > the
> > > > > > > > >> > > > >>> > RegionServers processes goes down, but the
> other
> > > > > > processes
> > > > > > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > > > > > >> > > > >>> > Zookeeper) continue normally.
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > Reading the logs:
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > Client
> > > > > > > session
> > > > > > > > >> > > > >>> > timed
> > > > > > > > >> > > > out,
> > > > > > > > >> > > > >>> have
> > > > > > > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > > > > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > > > > > >> > > > >>> closing
> > > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > > Client
> > > > > > > session
> > > > > > > > >> > > > >>> > timed
> > > > > > > > >> > > > out,
> > > > > > > > >> > > > >>> have
> > > > > > > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > > > > > > >> > > > >>> > 0x346c561a55953e,
> > > > > > > > >> > > > closing
> > > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > And the Handlers start to fail:
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > Server
> > > > > > > > Responder,
> > > > > > > > >> > > > >>> > call
> > > > > > > > >> > > > >>> >
> > > > > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > > > > > )
> > > > > > > > >> > > > >>> > from
> > > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > Server
> > > > > > handler
> > > > > > > > 81
> > > > > > > > >> > > > >>> > on
> > > > > > > > >> > > > 60020
> > > > > > > > >> > > > >>> > caught:
> java.nio.channels.ClosedChannelException
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > > >> > > > 13
> > > > > > > > >> > > > >>> > 3)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>>
> > > > > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > > >
> > > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > > >> > > > >>> > 1341)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > > >> > > > >>> > ns
> > > > > > > > >> > > > >>> > e(HB
> > > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > > >> > > > >>> > as
> > > > > > > > >> > > > >>> > eSe
> > > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > >> > > > :1
> > > > > > > > >> > > > >>> > 083)
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > Server
> > > > > > > > Responder,
> > > > > > > > >> > > > >>> > call
> > > > > > > > >> > > > >>> >
> > > > > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > > > > > )
> > > > > > > > >> > > > >>> > from
> > > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> > Server
> > > > > > handler
> > > > > > > > 62
> > > > > > > > >> > > > >>> > on
> > > > > > > > >> > > > 60020
> > > > > > > > >> > > > >>> > caught:
> java.nio.channels.ClosedChannelException
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > > >> > > > 13
> > > > > > > > >> > > > >>> > 3)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>>
> > > > > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > > >
> > > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > > >> > > > >>> > 1341)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > > >> > > > >>> > ns
> > > > > > > > >> > > > >>> > e(HB
> > > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > > >> > > > >>> > as
> > > > > > > > >> > > > >>> > eSe
> > > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > >> > > > :1
> > > > > > > > >> > > > >>> > 083)
> > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > And finally the server throws a
> > > YouAreDeadException
> > > > > :( :
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > Opening
> > > > > > > socket
> > > > > > > > >> > > > connection
> > > > > > > > >> > > > >>> to
> > > > > > > > >> > > > >>> > server
> > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > Socket
> > > > > > > > connection
> > > > > > > > >> > > > >>> > established to
> > > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > > >> > > > initiating
> > > > > > > > >> > > > >>> session
> > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > Unable
> > > > to
> > > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > > > 0x23462a4cf93a8fc
> > > > > > > > has
> > > > > > > > >> > > > >>> > expired, closing
> > > > > > > > >> > > > socket
> > > > > > > > >> > > > >>> > connection
> > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > Opening
> > > > > > > socket
> > > > > > > > >> > > > connection
> > > > > > > > >> > > > >>> to
> > > > > > > > >> > > > >>> > server
> > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > Socket
> > > > > > > > connection
> > > > > > > > >> > > > >>> > established to
> > > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > > >> > > > initiating
> > > > > > > > >> > > > >>> session
> > > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > > Unable
> > > > to
> > > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > > 0x346c561a55953e
> > > > > > > has
> > > > > > > > >> > > > >>> > expired, closing
> > > > > > > > >> > > > socket
> > > > > > > > >> > > > >>> > connection
> > > > > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL
> > > regionserver.HRegionServer:
> > > > > > > ABORTING
> > > > > > > > >> > > > >>> > region server
> > > > > > > > >> > > > >>> >
> > > > > > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > > > > > >> > > > >>> > load=(requests=447, regions=206,
> usedHeap=1584,
> > > > > > > > >> > maxHeap=4083):
> > > > > > > > >> > > > >>> > Unhandled
> > > > > > > > >> > > > >>> > exception:
> > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > >> > Server
> > > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > > >> > > > >>> >
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > as
> > > > > dead
> > > > > > > > server
> > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > Server
> > > > > > REPORT
> > > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > > >> > > > >>> >
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > > >> > > > as
> > > > > > > > >> > > > >>> > dead server
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > >
> > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > > >> > > > >>> > Method)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > >
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > > > > > >> > > > to
> > > > > > > > >> > > > r
> > > > > > > > >> > > > >>> > AccessorImpl.java:39)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > >
> > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > > > > > >> > > > Co
> > > > > > > > >> > > > n
> > > > > > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>>
> > > > > > > >
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > > > > > >> > > > >>> > ot
> > > > > > > > >> > > > >>> > eExce
> > > > > > > > >> > > > >>> > ption.java:95)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > > > > > >> > > > >>> > mo
> > > > > > > > >> > > > >>> > te
> > > > > > > > >> > > > >>> > Exception.java:79)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > > >> > > > >>> > rv
> > > > > > > > >> > > > >>> > erRep
> > > > > > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > > > > > >> > > > .j
> > > > > > > > >> > > > >>> > ava:596)
> > > > > > > > >> > > > >>> >         at
> java.lang.Thread.run(Thread.java:662)
> > > > > > > > >> > > > >>> > Caused by:
> > org.apache.hadoop.ipc.RemoteException:
> > > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > Server
> > > > > > REPORT
> > > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > > >> > > > >>> >
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > > >> > > > as
> > > > > > > > >> > > > >>> > dead server
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > > > > > >> > > > >>> > rM
> > > > > > > > >> > > > >>> > ana
> > > > > > > > >> > > > >>> > ger.java:204)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > > > > > >> > > > >>> > t(
> > > > > > > > >> > > > >>> > Serv
> > > > > > > > >> > > > >>> > erManager.java:262)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > > > > > >> > > > >>> > te
> > > > > > > > >> > > > >>> > r.jav
> > > > > > > > >> > > > >>> > a:669)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > > > > > >> > > > Source)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > > > > > >> > > > >>> > od
> > > > > > > > >> > > > >>> > Acces
> > > > > > > > >> > > > >>> > sorImpl.java:25)
> > > > > > > > >> > > > >>> >         at
> > > > > > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > > > >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > >
> > > > > > > > >> >
> > > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > > >> > > > :1
> > > > > > > > >> > > > >>> > 039)
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > > > > > >> > > > >>> > av
> > > > > > > > >> > > > >>> > a:257
> > > > > > > > >> > > > >>> > )
> > > > > > > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown
> > > > Source)
> > > > > > > > >> > > > >>> >         at
> > > > > > > > >> > > > >>> >
> > > > > > > > >> >
> > > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > > >> > > > >>> > rv
> > > > > > > > >> > > > >>> > erRep
> > > > > > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > > > > > >> > > > >>> >         ... 2 more
> > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> > regionserver.HRegionServer:
> > > > > Dump
> > > > > > of
> > > > > > > > >> > metrics:
> > > > > > > > >> > > > >>> > requests=66, regions=206, stores=2078,
> > > > storefiles=970,
> > > > > > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > > > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> > > > > usedHeap=1672,
> > > > > > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > > > > > >> > > > >>> > blockCacheFree=150412064,
> blockCacheCount=10648,
> > > > > > > > >> > > > >>> > blockCacheHitCount=79578618,
> > > > > > blockCacheMissCount=3036335,
> > > > > > > > >> > > > >>> > blockCacheEvictedCount=1401352,
> > > > blockCacheHitRatio=96,
> > > > > > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> > regionserver.HRegionServer:
> > > > > > > STOPPED:
> > > > > > > > >> > > > >>> > Unhandled
> > > > > > > > >> > > > >>> > exception:
> > > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > > >> > Server
> > > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > > >> > > > >>> >
> ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > as
> > > > > dead
> > > > > > > > server
> > > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer:
> Stopping
> > > > > server
> > > > > > on
> > > > > > > > >> > > > >>> > 60020
> > > > > > > > >> > > > >>> > ---------------------------
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > Then i restart the RegionServer and everything
> > is
> > > > back
> > > > > > to
> > > > > > > > >> normal.
> > > > > > > > >> > > > >>> > Reading the DataNode, Zookeeper and
> TaskTracker
> > > > logs,
> > > > > i
> > > > > > > > don't
> > > > > > > > >> > > > >>> > see any abnormality in the same time window.
> > > > > > > > >> > > > >>> > I think it was caused by the lost of
> connection
> > to
> > > > > > > > zookeeper.
> > > > > > > > >> > > > >>> > Is it
> > > > > > > > >> > > > >>> advisable to
> > > > > > > > >> > > > >>> > run zookeeper in the same machines?
> > > > > > > > >> > > > >>> > if the RegionServer lost it's connection to
> > > > Zookeeper,
> > > > > > > > there's
> > > > > > > > >> > > > >>> > a way
> > > > > > > > >> > > > (a
> > > > > > > > >> > > > >>> > configuration perhaps) to re-join the cluster,
> > and
> > > > not
> > > > > > > only
> > > > > > > > >> die?
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > Any idea what is causing this?? Or to prevent
> it
> > > > from
> > > > > > > > >> happening?
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > Any help is appreciated.
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > Best Regards,
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > --
> > > > > > > > >> > > > >>> >
> > > > > > > > >> > > > >>> > *Leonardo Gamas*
> > > > > > > > >> > > > >>> > Software Engineer
> > > > > > > > >> > > > >>> > +557134943514
> > > > > > > > >> > > > >>> > +557581347440
> > > > > > > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > > > > > > >> > > > >>> > www.jusbrasil.com.br
> > > > > > > > >> > > > >>>
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >> --
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >> *Leonardo Gamas*
> > > > > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71)
> > > > 3494-3514C
> > > > > > > > (75)
> > > > > > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br
> > > > www.jusbrasil.com.br
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >>
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > --
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > *Leonardo Gamas*
> > > > > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71)
> > > > 3494-3514C
> > > > > > > (75)
> > > > > > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br
> > > > www.jusbrasil.com.br
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > --
> > > > > > > > >> > >
> > > > > > > > >> > > *Leonardo Gamas*
> > > > > > > > >> > > Software Engineer
> > > > > > > > >> > > +557134943514
> > > > > > > > >> > > +557581347440
> > > > > > > > >> > > leogamas@jusbrasil.com.br
> > > > > > > > >> > > www.jusbrasil.com.br
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > --
> > > > > > > > >> >
> > > > > > > > >> > *Leonardo Gamas*
> > > > > > > > >> > Software Engineer
> > > > > > > > >> > T +55 (71) 3494-3514
> > > > > > > > >> > C +55 (75) 8134-7440
> > > > > > > > >> > leogamas@jusbrasil.com.br
> > > > > > > > >> > www.jusbrasil.com.br
> > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > *Leonardo Gamas*
> > > > > > > > >
> > > > > > > > > Software Engineer
> > > > > > > > > T +55 (71) 3494-3514
> > > > > > > > > C +55 (75) 8134-7440
> > > > > > > > > leogamas@jusbrasil.com.br
> > > > > > > > >
> > > > > > > > > www.jusbrasil.com.br
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > *Leonardo Gamas*
> > > > > > > > Software Engineer
> > > > > > > > T +55 (71) 3494-3514
> > > > > > > > C +55 (75) 8134-7440
> > > > > > > > leogamas@jusbrasil.com.br
> > > > > > > > www.jusbrasil.com.br
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Leonardo Gamas*
> > > > > > Software Engineer
> > > > > > T +55 (71) 3494-3514
> > > > > > C +55 (75) 8134-7440
> > > > > > leogamas@jusbrasil.com.br
> > > > > > www.jusbrasil.com.br
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Leonardo Gamas*
> > > > Software Engineer
> > > > T +55 (71) 3494-3514
> > > > C +55 (75) 8134-7440
> > > > leogamas@jusbrasil.com.br
> > > > www.jusbrasil.com.br
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> >
>

Re: RegionServer dying every two or three days

Posted by Matt Corgan <mc...@hotpads.com>.
You could always try going with a little smaller heap and see how it works
for your particular workload, maybe 4G.  1G block cache, 1G memstores, ~1G
GC overhead(?), leaving 1G for active program data.

If trying to squeeze memory, you should be aware there is a limitation in
0.90 where storefile indexes come out of that remaining 1G as opposed to
being stored in the block cache.  If you have big indexes, you would need
to shrink block cache and memstore limits to compensate.
http://search-hadoop.com/m/OH4cT1LiN4Q1/corgan&subj=Re+a+question+storefileIndexSize


On Mon, Jan 23, 2012 at 4:32 AM, Leonardo Gamas
<le...@jusbrasil.com.br>wrote:

> Thanks again Matt! I will try out this instance type, but i'm concerned
> about the MapReduce cluster running apart from HBase in my case, since we
> have some MapReduces running and planning to run more. Feels like losing
> the great strength of MapReduce, by running it far from data.
>
> 2012/1/21 Matt Corgan <mc...@hotpads.com>
>
> > We actually don't run map/reduce on the same machines (most of our jobs
> are
> > on an old message based system), so don't have much experience there.  We
> > run only HDFS (1G heap) and HBase (5.5G heap) with 12 * 100GB EBS volumes
> > per regionserver, and ~350 regions/server at the moment.  5.5G is
> already a
> > small heap in the hbase world, so I wouldn't recommend decreasing it to
> fit
> > M/R,  You could always run map/reduce on separate servers, adding or
> > removing servers as needed (more at night?), or use Amazon's Elastic M/R.
> >
> >
> > On Sat, Jan 21, 2012 at 5:04 AM, Leonardo Gamas
> > <le...@jusbrasil.com.br>wrote:
> >
> > > Thanks Matt for this insightful article, I will run my cluster with
> > > c1.xlarge to test it's performance. But i'm concerned with this
> machine,
> > > because the amount of RAM available, only 7GB. How many map/reduce
> slots
> > do
> > > you configure? And the amount of Heap for HBase? How many regions per
> > > RegionServer could my cluster support?
> > >
> > > 2012/1/20 Matt Corgan <mc...@hotpads.com>
> > >
> > > > I run c1.xlarge servers and have found them very stable.  I see 100
> > > Mbit/s
> > > > sustained bi-directional network throughput (200Mbit/s total),
> > sometimes
> > > up
> > > > to 150 * 2 Mbit/s.
> > > >
> > > > Here's a pretty thorough examination of the underlying hardware:
> > > >
> > > >
> > > >
> > >
> >
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
> > > >
> > > >
> > > > *High-CPU instances*
> > > >
> > > > The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> > > > dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket
> > > because
> > > > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge
> > instance
> > > > almost takes up the whole physical machine. However, we frequently
> > > observe
> > > > steal cycle on a c1.xlarge instance ranging from 0% to 25% with an
> > > average
> > > > of about 10%. The amount of steal cycle is not enough to host another
> > > > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to
> run
> > > > Amazon’s software firewall (security group). On Passmark-CPU mark, a
> > > > c1.xlarge machine achieves 7,962.6, actually higher than an average
> > > > dual-sock E5410 system is able to achieve (average is 6,903).
> > > >
> > > >
> > > >
> > > > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> > > > <le...@jusbrasil.com.br>wrote:
> > > >
> > > > > Thanks Neil for sharing your experience with AWS! Could you tell
> what
> > > > > instance type are you using?
> > > > > We are using m1.xlarge, that has 4 virtual cores, but i normally
> see
> > > > > recommendations for machines with 8 cores like c1.xlarge,
> m2.4xlarge,
> > > > etc.
> > > > > In principle these 8-core machines don't suffer too much with I/O
> > > > problems
> > > > > since they don't share the physical server. Is there any piece of
> > > > > information from Amazon or other source that affirms that or it's
> > based
> > > > in
> > > > > empirical analysis?
> > > > >
> > > > > 2012/1/19 Neil Yalowitz <ne...@gmail.com>
> > > > >
> > > > > > We have experienced many problems with our cluster on EC2.  The
> > blunt
> > > > > > solution was to increase the Zookeeper timeout to 5 minutes or
> even
> > > > more.
> > > > > >
> > > > > > Even with a long timeout, however, it's not uncommon for us to
> see
> > an
> > > > EC2
> > > > > > instance to become unresponsive to pings and SSH several times
> > > during a
> > > > > > week.  It's been a very bad environment for clusters.
> > > > > >
> > > > > >
> > > > > > Neil
> > > > > >
> > > > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > > > > <le...@jusbrasil.com.br>wrote:
> > > > > >
> > > > > > > Hi Guys,
> > > > > > >
> > > > > > > I have tested the parameters provided by Sandy, and it solved
> the
> > > GC
> > > > > > > problems with the -XX:+UseParallelOldGC, thanks for the help
> > Sandy.
> > > > > > > I'm still experiencing some difficulties, the RegionServer
> > > continues
> > > > to
> > > > > > > shutdown, but it seems related to I/O. It starts to timeout
> many
> > > > > > > connections, new connections to/from the machine timeout too,
> and
> > > > > finally
> > > > > > > the RegionServer dies because of YouAreDeadException. I will
> > > collect
> > > > > more
> > > > > > > data, but i think it's an Amazon/Virtualized Environment
> inherent
> > > > > issue.
> > > > > > >
> > > > > > > Thanks for the great help provided so far.
> > > > > > >
> > > > > > > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > > > >
> > > > > > > > I don't think so, if Amazon stopped the machine it would
> cause
> > a
> > > > stop
> > > > > > of
> > > > > > > > minutes, not seconds, and since the DataNode, TaskTracker and
> > > > > Zookeeper
> > > > > > > > continue to work normally.
> > > > > > > > But it can be related to the shared environment nature of
> > Amazon,
> > > > > maybe
> > > > > > > > some spike in I/O caused by another virtualized server in the
> > > same
> > > > > > > physical
> > > > > > > > machine.
> > > > > > > >
> > > > > > > > But the intance type i'm using:
> > > > > > > >
> > > > > > > > *Extra Large Instance*
> > > > > > > >
> > > > > > > > 15 GB memory
> > > > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units
> > > each)
> > > > > > > > 1,690 GB instance storage
> > > > > > > > 64-bit platform
> > > > > > > > I/O Performance: High
> > > > > > > > API name: m1.xlarge
> > > > > > > > I was not expecting to suffer from this problems, or at least
> > not
> > > > > much.
> > > > > > > >
> > > > > > > >
> > > > > > > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > > >
> > > > > > > >> You think it's an Amazon problem maybe?  Like they paused or
> > > > > migrated
> > > > > > > >> your virtual machine, and it just happens to be during GC,
> > > leaving
> > > > > us
> > > > > > to
> > > > > > > >> think the GC ran long when it didn't?  I don't have a lot of
> > > > > > experience
> > > > > > > >> with Amazon so I don't know if that sort of thing is common.
> > > > > > > >>
> > > > > > > >> > -----Original Message-----
> > > > > > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > > > > >> > To: user@hbase.apache.org
> > > > > > > >> > Subject: Re: RegionServer dying every two or three days
> > > > > > > >> >
> > > > > > > >> > I checked the CPU Utilization graphics provided by Amazon
> > > (it's
> > > > > not
> > > > > > > >> accurate,
> > > > > > > >> > since the sample time is about 5 minutes) and don't see
> any
> > > > > > > >> abnormality. I
> > > > > > > >> > will setup TSDB with Nagios to have a more reliable source
> > of
> > > > > > > >> performance
> > > > > > > >> > data.
> > > > > > > >> >
> > > > > > > >> > The machines don't have swap space, if i run:
> > > > > > > >> >
> > > > > > > >> > $ swapon -s
> > > > > > > >> >
> > > > > > > >> > To display swap usage summary, it returns an empty list.
> > > > > > > >> >
> > > > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in
> my
> > to
> > > > > > tests.
> > > > > > > >> >
> > > > > > > >> > I don't have payed much attention to the value of the new
> > size
> > > > > > param.
> > > > > > > >> >
> > > > > > > >> > Thanks again for the help!!
> > > > > > > >> >
> > > > > > > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > > >> >
> > > > > > > >> > > That size heap doesn't seem like it should cause a 36
> > second
> > > > GC
> > > > > (a
> > > > > > > >> > > minor GC even if I remember your logs correctly), so I
> > tend
> > > to
> > > > > > think
> > > > > > > >> > > that other things are probably going on.
> > > > > > > >> > >
> > > > > > > >> > > This line here:
> > > > > > > >> > >
> > > > > > > >> > > 14251.690: [GC 14288.620: [ParNew:
> 105352K->413K(118016K),
> > > > > > 0.0361840
> > > > > > > >> > > secs]
> > > > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times:
> > > user=0.05
> > > > > > > >> > > 954388K->sys=0.01,
> > > > > > > >> > > real=36.96 secs]
> > > > > > > >> > >
> > > > > > > >> > > is really mysterious to me.  It seems to indicate that
> the
> > > > > process
> > > > > > > was
> > > > > > > >> > > blocked for almost 37 seconds during a minor collection.
> > >  Note
> > > > > the
> > > > > > > CPU
> > > > > > > >> > > times are very low but the wall time is very high.  If
> it
> > > was
> > > > > > > actually
> > > > > > > >> > > doing GC work, I'd expect to see user time higher than
> > real
> > > > > time,
> > > > > > as
> > > > > > > >> > > it is in other parallel collections (see your log
> > snippet).
> > > > >  Were
> > > > > > > you
> > > > > > > >> > > really so CPU starved that it took 37 seconds to get in
> > 50ms
> > > > of
> > > > > > > work?
> > > > > > > >> > > I can't make sense of that.  I'm trying to think of
> > > something
> > > > > that
> > > > > > > >> > > would block you for that long while all your threads are
> > > > stopped
> > > > > > for
> > > > > > > >> > > GC, other than being in swap, but I can't come up with
> > > > anything.
> > > > > > > >>  You're
> > > > > > > >> > certain you're not in swap?
> > > > > > > >> > >
> > > > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> > > > -XX:+AggressiveOpts
> > > > > > > while
> > > > > > > >> > > you troubleshoot?
> > > > > > > >> > >
> > > > > > > >> > > Why is your new size so small?  This generally means
> that
> > > > > > relatively
> > > > > > > >> > > more objects are being tenured than would be with a
> larger
> > > new
> > > > > > size.
> > > > > > > >> > > This could make collections of the old gen worse (GC
> time
> > is
> > > > > said
> > > > > > to
> > > > > > > >> > > be proportional to the number of live objects in the
> > > > generation,
> > > > > > and
> > > > > > > >> > > CMS does indeed cause STW pauses).  A typical new to
> > tenured
> > > > > ratio
> > > > > > > >> > > might be 1:3.  Were the new gen GCs taking too long?
>  This
> > > is
> > > > > > > probably
> > > > > > > >> > > orthogonal to your immediate issue, though.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > -----Original Message-----
> > > > > > > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > > > > >> > > To: user@hbase.apache.org
> > > > > > > >> > > Subject: Re: RegionServer dying every two or three days
> > > > > > > >> > >
> > > > > > > >> > >  St.Ack,
> > > > > > > >> > >
> > > > > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > > > > >> > > I will read the perf section as suggested.
> > > > > > > >> > > I'm currently using Nagios + JMX to monitor the cluster,
> > but
> > > > > it's
> > > > > > > >> > > currently used for alert only, the perfdata is not been
> > > > stored,
> > > > > so
> > > > > > > >> > > it's kind of useless right now, but i was thinking in
> use
> > > TSDB
> > > > > to
> > > > > > > >> > > store it, any known case of integration?
> > > > > > > >> > > ---
> > > > > > > >> > >
> > > > > > > >> > > Sandy,
> > > > > > > >> > >
> > > > > > > >> > > Yes, my timeout is 30 seconds:
> > > > > > > >> > >
> > > > > > > >> > > <property>
> > > > > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > > > > >> > >   <value>30000</value>
> > > > > > > >> > > </property>
> > > > > > > >> > >
> > > > > > > >> > > To our application it's a sufferable time to wait in
> case
> > a
> > > > > > > >> > > RegionServer go offline.
> > > > > > > >> > >
> > > > > > > >> > > My heap is 4GB and my JVM params are:
> > > > > > > >> > >
> > > > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> > > > > > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > > > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
> > > -XX:+AggressiveOpts
> > > > > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > > > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > > > > >> > >
> > > > > > > >> > > I will try the -XX:+UseParallelOldGC param and post my
> > > > feedback
> > > > > > > here.
> > > > > > > >> > > ---
> > > > > > > >> > >
> > > > > > > >> > > Ramkrishna,
> > > > > > > >> > >
> > > > > > > >> > > Seems the GC is the root of all evil in this case.
> > > > > > > >> > > ----
> > > > > > > >> > >
> > > > > > > >> > > Thank you all for the answers. I will try out these
> > valuable
> > > > > > advices
> > > > > > > >> > > given here and post my results.
> > > > > > > >> > >
> > > > > > > >> > > Leo Gamas.
> > > > > > > >> > >
> > > > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > > > > ramkrishna.vasudevan@huawei.com>
> > > > > > > >> > >
> > > > > > > >> > > > Recently we faced a similar problem and it was due to
> GC
> > > > > config.
> > > > > > > >> > > > Pls check your GC.
> > > > > > > >> > > >
> > > > > > > >> > > > Regards
> > > > > > > >> > > > Ram
> > > > > > > >> > > >
> > > > > > > >> > > > -----Original Message-----
> > > > > > > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com
> ]
> > On
> > > > > > Behalf
> > > > > > > Of
> > > > > > > >> > > > Stack
> > > > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > > > > >> > > > To: user@hbase.apache.org
> > > > > > > >> > > > Subject: Re: RegionServer dying every two or three
> days
> > > > > > > >> > > >
> > > > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > > > > >> > > > <le...@jusbrasil.com.br> wrote:
> > > > > > > >> > > > > The third line took 36.96 seconds to execute, can
> this
> > > be
> > > > > > > causing
> > > > > > > >> > > > > this problem?
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > > Reading the code a little it seems that, even if
> it's
> > > > > > disabled,
> > > > > > > if
> > > > > > > >> > > > > all files are target in a compaction, it's
> considered
> > a
> > > > > major
> > > > > > > >> > > > > compaction. Is
> > > > > > > >> > > > it
> > > > > > > >> > > > > right?
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > > > That is right.  They get 'upgraded' from minor to
> major.
> > > > > > > >> > > >
> > > > > > > >> > > > This should be fine though.  What you are avoiding
> > setting
> > > > > major
> > > > > > > >> > > > compactions to 0 is all regions being major compacted
> > on a
> > > > > > > period, a
> > > > > > > >> > > > heavy weight effective rewrite of all your data
> (unless
> > > > > already
> > > > > > > >> major
> > > > > > > >> > > > compacted).   It looks like you have this disabled
> which
> > > is
> > > > > good
> > > > > > > >> until
> > > > > > > >> > > > you've wrestled your cluster into submission.
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > > The machines don't have swap, so the swappiness
> > > parameter
> > > > > > don't
> > > > > > > >> > > > > seem to apply here. Any other suggestion?
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > > > See the perf section of the hbase manual.  It has our
> > > > current
> > > > > > > list.
> > > > > > > >> > > >
> > > > > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > St.Ack
> > > > > > > >> > > >
> > > > > > > >> > > > > Thanks.
> > > > > > > >> > > > >
> > > > > > > >> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > > > > >> > > > >
> > > > > > > >> > > > >> I will investigate this, thanks for the response.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> Client
> > > > > session
> > > > > > > >> > > > >>> timed out, have not heard from server in 61103ms
> for
> > > > > > sessionid
> > > > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> > > > > attempting
> > > > > > > >> > > > >>> reconnect
> > > > > > > >> > > > >>>
> > > > > > > >> > > > >>> It looks like the process has been unresponsive
> for
> > > some
> > > > > > time,
> > > > > > > >> > > > >>> so ZK
> > > > > > > >> > > > has
> > > > > > > >> > > > >>> terminated the session.  Did you experience a long
> > GC
> > > > > pause
> > > > > > > >> > > > >>> right
> > > > > > > >> > > > before
> > > > > > > >> > > > >>> this?  If you don't have GC logging enabled for
> the
> > > RS,
> > > > > you
> > > > > > > can
> > > > > > > >> > > > sometimes
> > > > > > > >> > > > >>> tell by noticing a gap in the timestamps of the
> log
> > > > > > statements
> > > > > > > >> > > > >>> leading
> > > > > > > >> > > > up
> > > > > > > >> > > > >>> to the crash.
> > > > > > > >> > > > >>>
> > > > > > > >> > > > >>> If it turns out to be GC, you might want to look
> at
> > > your
> > > > > > > kernel
> > > > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM
> > params.
> > > > > > > >> > > > >>>
> > > > > > > >> > > > >>> Sandy
> > > > > > > >> > > > >>>
> > > > > > > >> > > > >>>
> > > > > > > >> > > > >>> > -----Original Message-----
> > > > > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> > > > leogamas@jusbrasil.com.br]
> > > > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > > > > >> > > > >>> > To: user@hbase.apache.org
> > > > > > > >> > > > >>> > Subject: RegionServer dying every two or three
> > days
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > Hi,
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4
> machines
> > > (1
> > > > > > > Master +
> > > > > > > >> > > > >>> > 3
> > > > > > > >> > > > >>> Slaves),
> > > > > > > >> > > > >>> > running on Amazon EC2. The master is a
> High-Memory
> > > > Extra
> > > > > > > Large
> > > > > > > >> > > > Instance
> > > > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster
> and
> > > > > > > Zookeeper.
> > > > > > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge)
> > > > running
> > > > > > > >> > > > >>> > Datanode,
> > > > > > > >> > > > >>> TaskTracker,
> > > > > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > From time to time, every two or three days, one
> of
> > > the
> > > > > > > >> > > > >>> > RegionServers processes goes down, but the other
> > > > > processes
> > > > > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > > > > >> > > > >>> > Zookeeper) continue normally.
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > Reading the logs:
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > ---------------------------
> > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > Client
> > > > > > session
> > > > > > > >> > > > >>> > timed
> > > > > > > >> > > > out,
> > > > > > > >> > > > >>> have
> > > > > > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > > > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > > > > >> > > > >>> closing
> > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> > Client
> > > > > > session
> > > > > > > >> > > > >>> > timed
> > > > > > > >> > > > out,
> > > > > > > >> > > > >>> have
> > > > > > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > > > > > >> > > > >>> > 0x346c561a55953e,
> > > > > > > >> > > > closing
> > > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > > >> > > > >>> > ---------------------------
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > And the Handlers start to fail:
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > ---------------------------
> > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> Server
> > > > > > > Responder,
> > > > > > > >> > > > >>> > call
> > > > > > > >> > > > >>> >
> > > > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > > > > )
> > > > > > > >> > > > >>> > from
> > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> Server
> > > > > handler
> > > > > > > 81
> > > > > > > >> > > > >>> > on
> > > > > > > >> > > > 60020
> > > > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > >> > > > 13
> > > > > > > >> > > > >>> > 3)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>>
> > > > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > >> > > > >>> > 1341)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > >> > > > >>> > ns
> > > > > > > >> > > > >>> > e(HB
> > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > >> > > > >>> > as
> > > > > > > >> > > > >>> > eSe
> > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > >> > > > :1
> > > > > > > >> > > > >>> > 083)
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> Server
> > > > > > > Responder,
> > > > > > > >> > > > >>> > call
> > > > > > > >> > > > >>> >
> > > > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > > > > )
> > > > > > > >> > > > >>> > from
> > > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC
> Server
> > > > > handler
> > > > > > > 62
> > > > > > > >> > > > >>> > on
> > > > > > > >> > > > 60020
> > > > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > > >> > > > 13
> > > > > > > >> > > > >>> > 3)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>>
> > > > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > >
> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > > >> > > > >>> > 1341)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > > >> > > > >>> > ns
> > > > > > > >> > > > >>> > e(HB
> > > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > > >> > > > >>> > as
> > > > > > > >> > > > >>> > eSe
> > > > > > > >> > > > >>> > rver.java:792)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > >> > > > :1
> > > > > > > >> > > > >>> > 083)
> > > > > > > >> > > > >>> > ---------------------------
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > And finally the server throws a
> > YouAreDeadException
> > > > :( :
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > ---------------------------
> > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > Opening
> > > > > > socket
> > > > > > > >> > > > connection
> > > > > > > >> > > > >>> to
> > > > > > > >> > > > >>> > server
> > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > Socket
> > > > > > > connection
> > > > > > > >> > > > >>> > established to
> > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > >> > > > initiating
> > > > > > > >> > > > >>> session
> > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > Unable
> > > to
> > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > > 0x23462a4cf93a8fc
> > > > > > > has
> > > > > > > >> > > > >>> > expired, closing
> > > > > > > >> > > > socket
> > > > > > > >> > > > >>> > connection
> > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > Opening
> > > > > > socket
> > > > > > > >> > > > connection
> > > > > > > >> > > > >>> to
> > > > > > > >> > > > >>> > server
> > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > Socket
> > > > > > > connection
> > > > > > > >> > > > >>> > established to
> > > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > > >> > > > initiating
> > > > > > > >> > > > >>> session
> > > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> > Unable
> > > to
> > > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > 0x346c561a55953e
> > > > > > has
> > > > > > > >> > > > >>> > expired, closing
> > > > > > > >> > > > socket
> > > > > > > >> > > > >>> > connection
> > > > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL
> > regionserver.HRegionServer:
> > > > > > ABORTING
> > > > > > > >> > > > >>> > region server
> > > > > > > >> > > > >>> >
> > > > > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > > > > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > > > > > > >> > maxHeap=4083):
> > > > > > > >> > > > >>> > Unhandled
> > > > > > > >> > > > >>> > exception:
> > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > >> > Server
> > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> > > > dead
> > > > > > > server
> > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > Server
> > > > > REPORT
> > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > >> > > > as
> > > > > > > >> > > > >>> > dead server
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > >
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > > >> > > > >>> > Method)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > >
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > > > > >> > > > to
> > > > > > > >> > > > r
> > > > > > > >> > > > >>> > AccessorImpl.java:39)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > >
> > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > > > > >> > > > Co
> > > > > > > >> > > > n
> > > > > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>>
> > > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > > > > >> > > > >>> > ot
> > > > > > > >> > > > >>> > eExce
> > > > > > > >> > > > >>> > ption.java:95)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > > > > >> > > > >>> > mo
> > > > > > > >> > > > >>> > te
> > > > > > > >> > > > >>> > Exception.java:79)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > >> > > > >>> > rv
> > > > > > > >> > > > >>> > erRep
> > > > > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > > > > >> > > > .j
> > > > > > > >> > > > >>> > ava:596)
> > > > > > > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > > > > > >> > > > >>> > Caused by:
> org.apache.hadoop.ipc.RemoteException:
> > > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > Server
> > > > > REPORT
> > > > > > > >> > > > >>> > rejected; currently processing
> > > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > > >> > > > as
> > > > > > > >> > > > >>> > dead server
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > > > > >> > > > >>> > rM
> > > > > > > >> > > > >>> > ana
> > > > > > > >> > > > >>> > ger.java:204)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > > > > >> > > > >>> > t(
> > > > > > > >> > > > >>> > Serv
> > > > > > > >> > > > >>> > erManager.java:262)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > > > > >> > > > >>> > te
> > > > > > > >> > > > >>> > r.jav
> > > > > > > >> > > > >>> > a:669)
> > > > > > > >> > > > >>> >         at
> > > > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > > > > >> > > > Source)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > > > > >> > > > >>> > od
> > > > > > > >> > > > >>> > Acces
> > > > > > > >> > > > >>> > sorImpl.java:25)
> > > > > > > >> > > > >>> >         at
> > > > > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > > >> > > > :1
> > > > > > > >> > > > >>> > 039)
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > > > > >> > > > >>> > av
> > > > > > > >> > > > >>> > a:257
> > > > > > > >> > > > >>> > )
> > > > > > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown
> > > Source)
> > > > > > > >> > > > >>> >         at
> > > > > > > >> > > > >>> >
> > > > > > > >> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > > >> > > > >>> > rv
> > > > > > > >> > > > >>> > erRep
> > > > > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > > > > >> > > > >>> >         ... 2 more
> > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> regionserver.HRegionServer:
> > > > Dump
> > > > > of
> > > > > > > >> > metrics:
> > > > > > > >> > > > >>> > requests=66, regions=206, stores=2078,
> > > storefiles=970,
> > > > > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> > > > usedHeap=1672,
> > > > > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > > > > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > > > > > >> > > > >>> > blockCacheHitCount=79578618,
> > > > > blockCacheMissCount=3036335,
> > > > > > > >> > > > >>> > blockCacheEvictedCount=1401352,
> > > blockCacheHitRatio=96,
> > > > > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO
> regionserver.HRegionServer:
> > > > > > STOPPED:
> > > > > > > >> > > > >>> > Unhandled
> > > > > > > >> > > > >>> > exception:
> > > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > > >> > Server
> > > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> > > > dead
> > > > > > > server
> > > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping
> > > > server
> > > > > on
> > > > > > > >> > > > >>> > 60020
> > > > > > > >> > > > >>> > ---------------------------
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > Then i restart the RegionServer and everything
> is
> > > back
> > > > > to
> > > > > > > >> normal.
> > > > > > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker
> > > logs,
> > > > i
> > > > > > > don't
> > > > > > > >> > > > >>> > see any abnormality in the same time window.
> > > > > > > >> > > > >>> > I think it was caused by the lost of connection
> to
> > > > > > > zookeeper.
> > > > > > > >> > > > >>> > Is it
> > > > > > > >> > > > >>> advisable to
> > > > > > > >> > > > >>> > run zookeeper in the same machines?
> > > > > > > >> > > > >>> > if the RegionServer lost it's connection to
> > > Zookeeper,
> > > > > > > there's
> > > > > > > >> > > > >>> > a way
> > > > > > > >> > > > (a
> > > > > > > >> > > > >>> > configuration perhaps) to re-join the cluster,
> and
> > > not
> > > > > > only
> > > > > > > >> die?
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it
> > > from
> > > > > > > >> happening?
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > Any help is appreciated.
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > Best Regards,
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > --
> > > > > > > >> > > > >>> >
> > > > > > > >> > > > >>> > *Leonardo Gamas*
> > > > > > > >> > > > >>> > Software Engineer
> > > > > > > >> > > > >>> > +557134943514
> > > > > > > >> > > > >>> > +557581347440
> > > > > > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > > > > > >> > > > >>> > www.jusbrasil.com.br
> > > > > > > >> > > > >>>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> --
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> *Leonardo Gamas*
> > > > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71)
> > > 3494-3514C
> > > > > > > (75)
> > > > > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > > > > > > >> > > > >>
> > > > > > > >> > > > >>
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > --
> > > > > > > >> > > > >
> > > > > > > >> > > > > *Leonardo Gamas*
> > > > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71)
> > > 3494-3514C
> > > > > > (75)
> > > > > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > --
> > > > > > > >> > >
> > > > > > > >> > > *Leonardo Gamas*
> > > > > > > >> > > Software Engineer
> > > > > > > >> > > +557134943514
> > > > > > > >> > > +557581347440
> > > > > > > >> > > leogamas@jusbrasil.com.br
> > > > > > > >> > > www.jusbrasil.com.br
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > --
> > > > > > > >> >
> > > > > > > >> > *Leonardo Gamas*
> > > > > > > >> > Software Engineer
> > > > > > > >> > T +55 (71) 3494-3514
> > > > > > > >> > C +55 (75) 8134-7440
> > > > > > > >> > leogamas@jusbrasil.com.br
> > > > > > > >> > www.jusbrasil.com.br
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > *Leonardo Gamas*
> > > > > > > >
> > > > > > > > Software Engineer
> > > > > > > > T +55 (71) 3494-3514
> > > > > > > > C +55 (75) 8134-7440
> > > > > > > > leogamas@jusbrasil.com.br
> > > > > > > >
> > > > > > > > www.jusbrasil.com.br
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Leonardo Gamas*
> > > > > > > Software Engineer
> > > > > > > T +55 (71) 3494-3514
> > > > > > > C +55 (75) 8134-7440
> > > > > > > leogamas@jusbrasil.com.br
> > > > > > > www.jusbrasil.com.br
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > > Software Engineer
> > > > > T +55 (71) 3494-3514
> > > > > C +55 (75) 8134-7440
> > > > > leogamas@jusbrasil.com.br
> > > > > www.jusbrasil.com.br
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer
> > > T +55 (71) 3494-3514
> > > C +55 (75) 8134-7440
> > > leogamas@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > >
> >
>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
>

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
Thanks again Matt! I will try out this instance type, but i'm concerned
about the MapReduce cluster running apart from HBase in my case, since we
have some MapReduces running and planning to run more. Feels like losing
the great strength of MapReduce, by running it far from data.

2012/1/21 Matt Corgan <mc...@hotpads.com>

> We actually don't run map/reduce on the same machines (most of our jobs are
> on an old message based system), so don't have much experience there.  We
> run only HDFS (1G heap) and HBase (5.5G heap) with 12 * 100GB EBS volumes
> per regionserver, and ~350 regions/server at the moment.  5.5G is already a
> small heap in the hbase world, so I wouldn't recommend decreasing it to fit
> M/R,  You could always run map/reduce on separate servers, adding or
> removing servers as needed (more at night?), or use Amazon's Elastic M/R.
>
>
> On Sat, Jan 21, 2012 at 5:04 AM, Leonardo Gamas
> <le...@jusbrasil.com.br>wrote:
>
> > Thanks Matt for this insightful article, I will run my cluster with
> > c1.xlarge to test it's performance. But i'm concerned with this machine,
> > because the amount of RAM available, only 7GB. How many map/reduce slots
> do
> > you configure? And the amount of Heap for HBase? How many regions per
> > RegionServer could my cluster support?
> >
> > 2012/1/20 Matt Corgan <mc...@hotpads.com>
> >
> > > I run c1.xlarge servers and have found them very stable.  I see 100
> > Mbit/s
> > > sustained bi-directional network throughput (200Mbit/s total),
> sometimes
> > up
> > > to 150 * 2 Mbit/s.
> > >
> > > Here's a pretty thorough examination of the underlying hardware:
> > >
> > >
> > >
> >
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
> > >
> > >
> > > *High-CPU instances*
> > >
> > > The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> > > dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket
> > because
> > > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge
> instance
> > > almost takes up the whole physical machine. However, we frequently
> > observe
> > > steal cycle on a c1.xlarge instance ranging from 0% to 25% with an
> > average
> > > of about 10%. The amount of steal cycle is not enough to host another
> > > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
> > > Amazon’s software firewall (security group). On Passmark-CPU mark, a
> > > c1.xlarge machine achieves 7,962.6, actually higher than an average
> > > dual-sock E5410 system is able to achieve (average is 6,903).
> > >
> > >
> > >
> > > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> > > <le...@jusbrasil.com.br>wrote:
> > >
> > > > Thanks Neil for sharing your experience with AWS! Could you tell what
> > > > instance type are you using?
> > > > We are using m1.xlarge, that has 4 virtual cores, but i normally see
> > > > recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge,
> > > etc.
> > > > In principle these 8-core machines don't suffer too much with I/O
> > > problems
> > > > since they don't share the physical server. Is there any piece of
> > > > information from Amazon or other source that affirms that or it's
> based
> > > in
> > > > empirical analysis?
> > > >
> > > > 2012/1/19 Neil Yalowitz <ne...@gmail.com>
> > > >
> > > > > We have experienced many problems with our cluster on EC2.  The
> blunt
> > > > > solution was to increase the Zookeeper timeout to 5 minutes or even
> > > more.
> > > > >
> > > > > Even with a long timeout, however, it's not uncommon for us to see
> an
> > > EC2
> > > > > instance to become unresponsive to pings and SSH several times
> > during a
> > > > > week.  It's been a very bad environment for clusters.
> > > > >
> > > > >
> > > > > Neil
> > > > >
> > > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > > > <le...@jusbrasil.com.br>wrote:
> > > > >
> > > > > > Hi Guys,
> > > > > >
> > > > > > I have tested the parameters provided by Sandy, and it solved the
> > GC
> > > > > > problems with the -XX:+UseParallelOldGC, thanks for the help
> Sandy.
> > > > > > I'm still experiencing some difficulties, the RegionServer
> > continues
> > > to
> > > > > > shutdown, but it seems related to I/O. It starts to timeout many
> > > > > > connections, new connections to/from the machine timeout too, and
> > > > finally
> > > > > > the RegionServer dies because of YouAreDeadException. I will
> > collect
> > > > more
> > > > > > data, but i think it's an Amazon/Virtualized Environment inherent
> > > > issue.
> > > > > >
> > > > > > Thanks for the great help provided so far.
> > > > > >
> > > > > > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > > >
> > > > > > > I don't think so, if Amazon stopped the machine it would cause
> a
> > > stop
> > > > > of
> > > > > > > minutes, not seconds, and since the DataNode, TaskTracker and
> > > > Zookeeper
> > > > > > > continue to work normally.
> > > > > > > But it can be related to the shared environment nature of
> Amazon,
> > > > maybe
> > > > > > > some spike in I/O caused by another virtualized server in the
> > same
> > > > > > physical
> > > > > > > machine.
> > > > > > >
> > > > > > > But the intance type i'm using:
> > > > > > >
> > > > > > > *Extra Large Instance*
> > > > > > >
> > > > > > > 15 GB memory
> > > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units
> > each)
> > > > > > > 1,690 GB instance storage
> > > > > > > 64-bit platform
> > > > > > > I/O Performance: High
> > > > > > > API name: m1.xlarge
> > > > > > > I was not expecting to suffer from this problems, or at least
> not
> > > > much.
> > > > > > >
> > > > > > >
> > > > > > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > >
> > > > > > >> You think it's an Amazon problem maybe?  Like they paused or
> > > > migrated
> > > > > > >> your virtual machine, and it just happens to be during GC,
> > leaving
> > > > us
> > > > > to
> > > > > > >> think the GC ran long when it didn't?  I don't have a lot of
> > > > > experience
> > > > > > >> with Amazon so I don't know if that sort of thing is common.
> > > > > > >>
> > > > > > >> > -----Original Message-----
> > > > > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > > > >> > To: user@hbase.apache.org
> > > > > > >> > Subject: Re: RegionServer dying every two or three days
> > > > > > >> >
> > > > > > >> > I checked the CPU Utilization graphics provided by Amazon
> > (it's
> > > > not
> > > > > > >> accurate,
> > > > > > >> > since the sample time is about 5 minutes) and don't see any
> > > > > > >> abnormality. I
> > > > > > >> > will setup TSDB with Nagios to have a more reliable source
> of
> > > > > > >> performance
> > > > > > >> > data.
> > > > > > >> >
> > > > > > >> > The machines don't have swap space, if i run:
> > > > > > >> >
> > > > > > >> > $ swapon -s
> > > > > > >> >
> > > > > > >> > To display swap usage summary, it returns an empty list.
> > > > > > >> >
> > > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my
> to
> > > > > tests.
> > > > > > >> >
> > > > > > >> > I don't have payed much attention to the value of the new
> size
> > > > > param.
> > > > > > >> >
> > > > > > >> > Thanks again for the help!!
> > > > > > >> >
> > > > > > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > > >> >
> > > > > > >> > > That size heap doesn't seem like it should cause a 36
> second
> > > GC
> > > > (a
> > > > > > >> > > minor GC even if I remember your logs correctly), so I
> tend
> > to
> > > > > think
> > > > > > >> > > that other things are probably going on.
> > > > > > >> > >
> > > > > > >> > > This line here:
> > > > > > >> > >
> > > > > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K),
> > > > > 0.0361840
> > > > > > >> > > secs]
> > > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times:
> > user=0.05
> > > > > > >> > > 954388K->sys=0.01,
> > > > > > >> > > real=36.96 secs]
> > > > > > >> > >
> > > > > > >> > > is really mysterious to me.  It seems to indicate that the
> > > > process
> > > > > > was
> > > > > > >> > > blocked for almost 37 seconds during a minor collection.
> >  Note
> > > > the
> > > > > > CPU
> > > > > > >> > > times are very low but the wall time is very high.  If it
> > was
> > > > > > actually
> > > > > > >> > > doing GC work, I'd expect to see user time higher than
> real
> > > > time,
> > > > > as
> > > > > > >> > > it is in other parallel collections (see your log
> snippet).
> > > >  Were
> > > > > > you
> > > > > > >> > > really so CPU starved that it took 37 seconds to get in
> 50ms
> > > of
> > > > > > work?
> > > > > > >> > > I can't make sense of that.  I'm trying to think of
> > something
> > > > that
> > > > > > >> > > would block you for that long while all your threads are
> > > stopped
> > > > > for
> > > > > > >> > > GC, other than being in swap, but I can't come up with
> > > anything.
> > > > > > >>  You're
> > > > > > >> > certain you're not in swap?
> > > > > > >> > >
> > > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> > > -XX:+AggressiveOpts
> > > > > > while
> > > > > > >> > > you troubleshoot?
> > > > > > >> > >
> > > > > > >> > > Why is your new size so small?  This generally means that
> > > > > relatively
> > > > > > >> > > more objects are being tenured than would be with a larger
> > new
> > > > > size.
> > > > > > >> > > This could make collections of the old gen worse (GC time
> is
> > > > said
> > > > > to
> > > > > > >> > > be proportional to the number of live objects in the
> > > generation,
> > > > > and
> > > > > > >> > > CMS does indeed cause STW pauses).  A typical new to
> tenured
> > > > ratio
> > > > > > >> > > might be 1:3.  Were the new gen GCs taking too long?  This
> > is
> > > > > > probably
> > > > > > >> > > orthogonal to your immediate issue, though.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > -----Original Message-----
> > > > > > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > > > >> > > To: user@hbase.apache.org
> > > > > > >> > > Subject: Re: RegionServer dying every two or three days
> > > > > > >> > >
> > > > > > >> > >  St.Ack,
> > > > > > >> > >
> > > > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > > > >> > > I will read the perf section as suggested.
> > > > > > >> > > I'm currently using Nagios + JMX to monitor the cluster,
> but
> > > > it's
> > > > > > >> > > currently used for alert only, the perfdata is not been
> > > stored,
> > > > so
> > > > > > >> > > it's kind of useless right now, but i was thinking in use
> > TSDB
> > > > to
> > > > > > >> > > store it, any known case of integration?
> > > > > > >> > > ---
> > > > > > >> > >
> > > > > > >> > > Sandy,
> > > > > > >> > >
> > > > > > >> > > Yes, my timeout is 30 seconds:
> > > > > > >> > >
> > > > > > >> > > <property>
> > > > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > > > >> > >   <value>30000</value>
> > > > > > >> > > </property>
> > > > > > >> > >
> > > > > > >> > > To our application it's a sufferable time to wait in case
> a
> > > > > > >> > > RegionServer go offline.
> > > > > > >> > >
> > > > > > >> > > My heap is 4GB and my JVM params are:
> > > > > > >> > >
> > > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > > > > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
> > -XX:+AggressiveOpts
> > > > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > > > >> > >
> > > > > > >> > > I will try the -XX:+UseParallelOldGC param and post my
> > > feedback
> > > > > > here.
> > > > > > >> > > ---
> > > > > > >> > >
> > > > > > >> > > Ramkrishna,
> > > > > > >> > >
> > > > > > >> > > Seems the GC is the root of all evil in this case.
> > > > > > >> > > ----
> > > > > > >> > >
> > > > > > >> > > Thank you all for the answers. I will try out these
> valuable
> > > > > advices
> > > > > > >> > > given here and post my results.
> > > > > > >> > >
> > > > > > >> > > Leo Gamas.
> > > > > > >> > >
> > > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > > > ramkrishna.vasudevan@huawei.com>
> > > > > > >> > >
> > > > > > >> > > > Recently we faced a similar problem and it was due to GC
> > > > config.
> > > > > > >> > > > Pls check your GC.
> > > > > > >> > > >
> > > > > > >> > > > Regards
> > > > > > >> > > > Ram
> > > > > > >> > > >
> > > > > > >> > > > -----Original Message-----
> > > > > > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com]
> On
> > > > > Behalf
> > > > > > Of
> > > > > > >> > > > Stack
> > > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > > > >> > > > To: user@hbase.apache.org
> > > > > > >> > > > Subject: Re: RegionServer dying every two or three days
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > > > >> > > > <le...@jusbrasil.com.br> wrote:
> > > > > > >> > > > > The third line took 36.96 seconds to execute, can this
> > be
> > > > > > causing
> > > > > > >> > > > > this problem?
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > > Reading the code a little it seems that, even if it's
> > > > > disabled,
> > > > > > if
> > > > > > >> > > > > all files are target in a compaction, it's considered
> a
> > > > major
> > > > > > >> > > > > compaction. Is
> > > > > > >> > > > it
> > > > > > >> > > > > right?
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > > > That is right.  They get 'upgraded' from minor to major.
> > > > > > >> > > >
> > > > > > >> > > > This should be fine though.  What you are avoiding
> setting
> > > > major
> > > > > > >> > > > compactions to 0 is all regions being major compacted
> on a
> > > > > > period, a
> > > > > > >> > > > heavy weight effective rewrite of all your data (unless
> > > > already
> > > > > > >> major
> > > > > > >> > > > compacted).   It looks like you have this disabled which
> > is
> > > > good
> > > > > > >> until
> > > > > > >> > > > you've wrestled your cluster into submission.
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > > The machines don't have swap, so the swappiness
> > parameter
> > > > > don't
> > > > > > >> > > > > seem to apply here. Any other suggestion?
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > > > See the perf section of the hbase manual.  It has our
> > > current
> > > > > > list.
> > > > > > >> > > >
> > > > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > St.Ack
> > > > > > >> > > >
> > > > > > >> > > > > Thanks.
> > > > > > >> > > > >
> > > > > > >> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > > > >> > > > >
> > > > > > >> > > > >> I will investigate this, thanks for the response.
> > > > > > >> > > > >>
> > > > > > >> > > > >>
> > > > > > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > > > > >> > > > >>
> > > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > > session
> > > > > > >> > > > >>> timed out, have not heard from server in 61103ms for
> > > > > sessionid
> > > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> > > > attempting
> > > > > > >> > > > >>> reconnect
> > > > > > >> > > > >>>
> > > > > > >> > > > >>> It looks like the process has been unresponsive for
> > some
> > > > > time,
> > > > > > >> > > > >>> so ZK
> > > > > > >> > > > has
> > > > > > >> > > > >>> terminated the session.  Did you experience a long
> GC
> > > > pause
> > > > > > >> > > > >>> right
> > > > > > >> > > > before
> > > > > > >> > > > >>> this?  If you don't have GC logging enabled for the
> > RS,
> > > > you
> > > > > > can
> > > > > > >> > > > sometimes
> > > > > > >> > > > >>> tell by noticing a gap in the timestamps of the log
> > > > > statements
> > > > > > >> > > > >>> leading
> > > > > > >> > > > up
> > > > > > >> > > > >>> to the crash.
> > > > > > >> > > > >>>
> > > > > > >> > > > >>> If it turns out to be GC, you might want to look at
> > your
> > > > > > kernel
> > > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM
> params.
> > > > > > >> > > > >>>
> > > > > > >> > > > >>> Sandy
> > > > > > >> > > > >>>
> > > > > > >> > > > >>>
> > > > > > >> > > > >>> > -----Original Message-----
> > > > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> > > leogamas@jusbrasil.com.br]
> > > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > > > >> > > > >>> > To: user@hbase.apache.org
> > > > > > >> > > > >>> > Subject: RegionServer dying every two or three
> days
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > Hi,
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines
> > (1
> > > > > > Master +
> > > > > > >> > > > >>> > 3
> > > > > > >> > > > >>> Slaves),
> > > > > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory
> > > Extra
> > > > > > Large
> > > > > > >> > > > Instance
> > > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> > > > > > Zookeeper.
> > > > > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge)
> > > running
> > > > > > >> > > > >>> > Datanode,
> > > > > > >> > > > >>> TaskTracker,
> > > > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > From time to time, every two or three days, one of
> > the
> > > > > > >> > > > >>> > RegionServers processes goes down, but the other
> > > > processes
> > > > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > > > >> > > > >>> > Zookeeper) continue normally.
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > Reading the logs:
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > ---------------------------
> > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> Client
> > > > > session
> > > > > > >> > > > >>> > timed
> > > > > > >> > > > out,
> > > > > > >> > > > >>> have
> > > > > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > > > >> > > > >>> closing
> > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn:
> Client
> > > > > session
> > > > > > >> > > > >>> > timed
> > > > > > >> > > > out,
> > > > > > >> > > > >>> have
> > > > > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > > > > >> > > > >>> > 0x346c561a55953e,
> > > > > > >> > > > closing
> > > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > > >> > > > >>> > ---------------------------
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > And the Handlers start to fail:
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > ---------------------------
> > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > > > Responder,
> > > > > > >> > > > >>> > call
> > > > > > >> > > > >>> >
> > > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > > > )
> > > > > > >> > > > >>> > from
> > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > handler
> > > > > > 81
> > > > > > >> > > > >>> > on
> > > > > > >> > > > 60020
> > > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > >> > > > 13
> > > > > > >> > > > >>> > 3)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>>
> > > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > >> > > > >>> > 1341)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > >> > > > >>> > ns
> > > > > > >> > > > >>> > e(HB
> > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > >> > > > >>> > as
> > > > > > >> > > > >>> > eSe
> > > > > > >> > > > >>> > rver.java:792)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > >> > > > :1
> > > > > > >> > > > >>> > 083)
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > > > Responder,
> > > > > > >> > > > >>> > call
> > > > > > >> > > > >>> >
> > > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > > > )
> > > > > > >> > > > >>> > from
> > > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > handler
> > > > > > 62
> > > > > > >> > > > >>> > on
> > > > > > >> > > > 60020
> > > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > > >> > > > 13
> > > > > > >> > > > >>> > 3)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>>
> > > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > > >> > > > >>> > 1341)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > > >> > > > >>> > ns
> > > > > > >> > > > >>> > e(HB
> > > > > > >> > > > >>> > aseServer.java:727)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > > >> > > > >>> > as
> > > > > > >> > > > >>> > eSe
> > > > > > >> > > > >>> > rver.java:792)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > >> > > > :1
> > > > > > >> > > > >>> > 083)
> > > > > > >> > > > >>> > ---------------------------
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > And finally the server throws a
> YouAreDeadException
> > > :( :
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > ---------------------------
> > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> Opening
> > > > > socket
> > > > > > >> > > > connection
> > > > > > >> > > > >>> to
> > > > > > >> > > > >>> > server
> ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> Socket
> > > > > > connection
> > > > > > >> > > > >>> > established to
> > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > >> > > > initiating
> > > > > > >> > > > >>> session
> > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> Unable
> > to
> > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > > 0x23462a4cf93a8fc
> > > > > > has
> > > > > > >> > > > >>> > expired, closing
> > > > > > >> > > > socket
> > > > > > >> > > > >>> > connection
> > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> Opening
> > > > > socket
> > > > > > >> > > > connection
> > > > > > >> > > > >>> to
> > > > > > >> > > > >>> > server
> ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> Socket
> > > > > > connection
> > > > > > >> > > > >>> > established to
> > > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > > >> > > > initiating
> > > > > > >> > > > >>> session
> > > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn:
> Unable
> > to
> > > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > 0x346c561a55953e
> > > > > has
> > > > > > >> > > > >>> > expired, closing
> > > > > > >> > > > socket
> > > > > > >> > > > >>> > connection
> > > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL
> regionserver.HRegionServer:
> > > > > ABORTING
> > > > > > >> > > > >>> > region server
> > > > > > >> > > > >>> >
> > > > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > > > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > > > > > >> > maxHeap=4083):
> > > > > > >> > > > >>> > Unhandled
> > > > > > >> > > > >>> > exception:
> > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > >> > Server
> > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > > dead
> > > > > > server
> > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> Server
> > > > REPORT
> > > > > > >> > > > >>> > rejected; currently processing
> > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > >> > > > as
> > > > > > >> > > > >>> > dead server
> > > > > > >> > > > >>> >         at
> > > > > > >> > > >
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > > >> > > > >>> > Method)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > > > >> > > > to
> > > > > > >> > > > r
> > > > > > >> > > > >>> > AccessorImpl.java:39)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > > > >> > > > Co
> > > > > > >> > > > n
> > > > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>>
> > > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > > > >> > > > >>> > ot
> > > > > > >> > > > >>> > eExce
> > > > > > >> > > > >>> > ption.java:95)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > > > >> > > > >>> > mo
> > > > > > >> > > > >>> > te
> > > > > > >> > > > >>> > Exception.java:79)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > >> > > > >>> > rv
> > > > > > >> > > > >>> > erRep
> > > > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > > > >> > > > .j
> > > > > > >> > > > >>> > ava:596)
> > > > > > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > > > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> Server
> > > > REPORT
> > > > > > >> > > > >>> > rejected; currently processing
> > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > > >> > > > as
> > > > > > >> > > > >>> > dead server
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > > > >> > > > >>> > rM
> > > > > > >> > > > >>> > ana
> > > > > > >> > > > >>> > ger.java:204)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > > > >> > > > >>> > t(
> > > > > > >> > > > >>> > Serv
> > > > > > >> > > > >>> > erManager.java:262)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > > > >> > > > >>> > te
> > > > > > >> > > > >>> > r.jav
> > > > > > >> > > > >>> > a:669)
> > > > > > >> > > > >>> >         at
> > > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > > > >> > > > Source)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > > > >> > > > >>> > od
> > > > > > >> > > > >>> > Acces
> > > > > > >> > > > >>> > sorImpl.java:25)
> > > > > > >> > > > >>> >         at
> > > > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> > > >
> > > > > > >> >
> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > > >> > > > :1
> > > > > > >> > > > >>> > 039)
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > > > >> > > > >>> > av
> > > > > > >> > > > >>> > a:257
> > > > > > >> > > > >>> > )
> > > > > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown
> > Source)
> > > > > > >> > > > >>> >         at
> > > > > > >> > > > >>> >
> > > > > > >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > > >> > > > >>> > rv
> > > > > > >> > > > >>> > erRep
> > > > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > > > >> > > > >>> >         ... 2 more
> > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > > Dump
> > > > of
> > > > > > >> > metrics:
> > > > > > >> > > > >>> > requests=66, regions=206, stores=2078,
> > storefiles=970,
> > > > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> > > usedHeap=1672,
> > > > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > > > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > > > > >> > > > >>> > blockCacheHitCount=79578618,
> > > > blockCacheMissCount=3036335,
> > > > > > >> > > > >>> > blockCacheEvictedCount=1401352,
> > blockCacheHitRatio=96,
> > > > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > > > > STOPPED:
> > > > > > >> > > > >>> > Unhandled
> > > > > > >> > > > >>> > exception:
> > > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > > >> > Server
> > > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > > dead
> > > > > > server
> > > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping
> > > server
> > > > on
> > > > > > >> > > > >>> > 60020
> > > > > > >> > > > >>> > ---------------------------
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > Then i restart the RegionServer and everything is
> > back
> > > > to
> > > > > > >> normal.
> > > > > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker
> > logs,
> > > i
> > > > > > don't
> > > > > > >> > > > >>> > see any abnormality in the same time window.
> > > > > > >> > > > >>> > I think it was caused by the lost of connection to
> > > > > > zookeeper.
> > > > > > >> > > > >>> > Is it
> > > > > > >> > > > >>> advisable to
> > > > > > >> > > > >>> > run zookeeper in the same machines?
> > > > > > >> > > > >>> > if the RegionServer lost it's connection to
> > Zookeeper,
> > > > > > there's
> > > > > > >> > > > >>> > a way
> > > > > > >> > > > (a
> > > > > > >> > > > >>> > configuration perhaps) to re-join the cluster, and
> > not
> > > > > only
> > > > > > >> die?
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it
> > from
> > > > > > >> happening?
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > Any help is appreciated.
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > Best Regards,
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > --
> > > > > > >> > > > >>> >
> > > > > > >> > > > >>> > *Leonardo Gamas*
> > > > > > >> > > > >>> > Software Engineer
> > > > > > >> > > > >>> > +557134943514
> > > > > > >> > > > >>> > +557581347440
> > > > > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > > > > >> > > > >>> > www.jusbrasil.com.br
> > > > > > >> > > > >>>
> > > > > > >> > > > >>
> > > > > > >> > > > >>
> > > > > > >> > > > >>
> > > > > > >> > > > >> --
> > > > > > >> > > > >>
> > > > > > >> > > > >> *Leonardo Gamas*
> > > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71)
> > 3494-3514C
> > > > > > (75)
> > > > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> > > > > > >> > > > >>
> > > > > > >> > > > >>
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > --
> > > > > > >> > > > >
> > > > > > >> > > > > *Leonardo Gamas*
> > > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71)
> > 3494-3514C
> > > > > (75)
> > > > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > --
> > > > > > >> > >
> > > > > > >> > > *Leonardo Gamas*
> > > > > > >> > > Software Engineer
> > > > > > >> > > +557134943514
> > > > > > >> > > +557581347440
> > > > > > >> > > leogamas@jusbrasil.com.br
> > > > > > >> > > www.jusbrasil.com.br
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> >
> > > > > > >> > *Leonardo Gamas*
> > > > > > >> > Software Engineer
> > > > > > >> > T +55 (71) 3494-3514
> > > > > > >> > C +55 (75) 8134-7440
> > > > > > >> > leogamas@jusbrasil.com.br
> > > > > > >> > www.jusbrasil.com.br
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Leonardo Gamas*
> > > > > > >
> > > > > > > Software Engineer
> > > > > > > T +55 (71) 3494-3514
> > > > > > > C +55 (75) 8134-7440
> > > > > > > leogamas@jusbrasil.com.br
> > > > > > >
> > > > > > > www.jusbrasil.com.br
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Leonardo Gamas*
> > > > > > Software Engineer
> > > > > > T +55 (71) 3494-3514
> > > > > > C +55 (75) 8134-7440
> > > > > > leogamas@jusbrasil.com.br
> > > > > > www.jusbrasil.com.br
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Leonardo Gamas*
> > > > Software Engineer
> > > > T +55 (71) 3494-3514
> > > > C +55 (75) 8134-7440
> > > > leogamas@jusbrasil.com.br
> > > > www.jusbrasil.com.br
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> >
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Matt Corgan <mc...@hotpads.com>.
We actually don't run map/reduce on the same machines (most of our jobs are
on an old message based system), so don't have much experience there.  We
run only HDFS (1G heap) and HBase (5.5G heap) with 12 * 100GB EBS volumes
per regionserver, and ~350 regions/server at the moment.  5.5G is already a
small heap in the hbase world, so I wouldn't recommend decreasing it to fit
M/R,  You could always run map/reduce on separate servers, adding or
removing servers as needed (more at night?), or use Amazon's Elastic M/R.


On Sat, Jan 21, 2012 at 5:04 AM, Leonardo Gamas
<le...@jusbrasil.com.br>wrote:

> Thanks Matt for this insightful article, I will run my cluster with
> c1.xlarge to test it's performance. But i'm concerned with this machine,
> because the amount of RAM available, only 7GB. How many map/reduce slots do
> you configure? And the amount of Heap for HBase? How many regions per
> RegionServer could my cluster support?
>
> 2012/1/20 Matt Corgan <mc...@hotpads.com>
>
> > I run c1.xlarge servers and have found them very stable.  I see 100
> Mbit/s
> > sustained bi-directional network throughput (200Mbit/s total), sometimes
> up
> > to 150 * 2 Mbit/s.
> >
> > Here's a pretty thorough examination of the underlying hardware:
> >
> >
> >
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
> >
> >
> > *High-CPU instances*
> >
> > The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> > dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket
> because
> > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance
> > almost takes up the whole physical machine. However, we frequently
> observe
> > steal cycle on a c1.xlarge instance ranging from 0% to 25% with an
> average
> > of about 10%. The amount of steal cycle is not enough to host another
> > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
> > Amazon’s software firewall (security group). On Passmark-CPU mark, a
> > c1.xlarge machine achieves 7,962.6, actually higher than an average
> > dual-sock E5410 system is able to achieve (average is 6,903).
> >
> >
> >
> > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> > <le...@jusbrasil.com.br>wrote:
> >
> > > Thanks Neil for sharing your experience with AWS! Could you tell what
> > > instance type are you using?
> > > We are using m1.xlarge, that has 4 virtual cores, but i normally see
> > > recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge,
> > etc.
> > > In principle these 8-core machines don't suffer too much with I/O
> > problems
> > > since they don't share the physical server. Is there any piece of
> > > information from Amazon or other source that affirms that or it's based
> > in
> > > empirical analysis?
> > >
> > > 2012/1/19 Neil Yalowitz <ne...@gmail.com>
> > >
> > > > We have experienced many problems with our cluster on EC2.  The blunt
> > > > solution was to increase the Zookeeper timeout to 5 minutes or even
> > more.
> > > >
> > > > Even with a long timeout, however, it's not uncommon for us to see an
> > EC2
> > > > instance to become unresponsive to pings and SSH several times
> during a
> > > > week.  It's been a very bad environment for clusters.
> > > >
> > > >
> > > > Neil
> > > >
> > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > > <le...@jusbrasil.com.br>wrote:
> > > >
> > > > > Hi Guys,
> > > > >
> > > > > I have tested the parameters provided by Sandy, and it solved the
> GC
> > > > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > > > > I'm still experiencing some difficulties, the RegionServer
> continues
> > to
> > > > > shutdown, but it seems related to I/O. It starts to timeout many
> > > > > connections, new connections to/from the machine timeout too, and
> > > finally
> > > > > the RegionServer dies because of YouAreDeadException. I will
> collect
> > > more
> > > > > data, but i think it's an Amazon/Virtualized Environment inherent
> > > issue.
> > > > >
> > > > > Thanks for the great help provided so far.
> > > > >
> > > > > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > >
> > > > > > I don't think so, if Amazon stopped the machine it would cause a
> > stop
> > > > of
> > > > > > minutes, not seconds, and since the DataNode, TaskTracker and
> > > Zookeeper
> > > > > > continue to work normally.
> > > > > > But it can be related to the shared environment nature of Amazon,
> > > maybe
> > > > > > some spike in I/O caused by another virtualized server in the
> same
> > > > > physical
> > > > > > machine.
> > > > > >
> > > > > > But the intance type i'm using:
> > > > > >
> > > > > > *Extra Large Instance*
> > > > > >
> > > > > > 15 GB memory
> > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units
> each)
> > > > > > 1,690 GB instance storage
> > > > > > 64-bit platform
> > > > > > I/O Performance: High
> > > > > > API name: m1.xlarge
> > > > > > I was not expecting to suffer from this problems, or at least not
> > > much.
> > > > > >
> > > > > >
> > > > > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > >
> > > > > >> You think it's an Amazon problem maybe?  Like they paused or
> > > migrated
> > > > > >> your virtual machine, and it just happens to be during GC,
> leaving
> > > us
> > > > to
> > > > > >> think the GC ran long when it didn't?  I don't have a lot of
> > > > experience
> > > > > >> with Amazon so I don't know if that sort of thing is common.
> > > > > >>
> > > > > >> > -----Original Message-----
> > > > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > > >> > To: user@hbase.apache.org
> > > > > >> > Subject: Re: RegionServer dying every two or three days
> > > > > >> >
> > > > > >> > I checked the CPU Utilization graphics provided by Amazon
> (it's
> > > not
> > > > > >> accurate,
> > > > > >> > since the sample time is about 5 minutes) and don't see any
> > > > > >> abnormality. I
> > > > > >> > will setup TSDB with Nagios to have a more reliable source of
> > > > > >> performance
> > > > > >> > data.
> > > > > >> >
> > > > > >> > The machines don't have swap space, if i run:
> > > > > >> >
> > > > > >> > $ swapon -s
> > > > > >> >
> > > > > >> > To display swap usage summary, it returns an empty list.
> > > > > >> >
> > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to
> > > > tests.
> > > > > >> >
> > > > > >> > I don't have payed much attention to the value of the new size
> > > > param.
> > > > > >> >
> > > > > >> > Thanks again for the help!!
> > > > > >> >
> > > > > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > > >> >
> > > > > >> > > That size heap doesn't seem like it should cause a 36 second
> > GC
> > > (a
> > > > > >> > > minor GC even if I remember your logs correctly), so I tend
> to
> > > > think
> > > > > >> > > that other things are probably going on.
> > > > > >> > >
> > > > > >> > > This line here:
> > > > > >> > >
> > > > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K),
> > > > 0.0361840
> > > > > >> > > secs]
> > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times:
> user=0.05
> > > > > >> > > 954388K->sys=0.01,
> > > > > >> > > real=36.96 secs]
> > > > > >> > >
> > > > > >> > > is really mysterious to me.  It seems to indicate that the
> > > process
> > > > > was
> > > > > >> > > blocked for almost 37 seconds during a minor collection.
>  Note
> > > the
> > > > > CPU
> > > > > >> > > times are very low but the wall time is very high.  If it
> was
> > > > > actually
> > > > > >> > > doing GC work, I'd expect to see user time higher than real
> > > time,
> > > > as
> > > > > >> > > it is in other parallel collections (see your log snippet).
> > >  Were
> > > > > you
> > > > > >> > > really so CPU starved that it took 37 seconds to get in 50ms
> > of
> > > > > work?
> > > > > >> > > I can't make sense of that.  I'm trying to think of
> something
> > > that
> > > > > >> > > would block you for that long while all your threads are
> > stopped
> > > > for
> > > > > >> > > GC, other than being in swap, but I can't come up with
> > anything.
> > > > > >>  You're
> > > > > >> > certain you're not in swap?
> > > > > >> > >
> > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> > -XX:+AggressiveOpts
> > > > > while
> > > > > >> > > you troubleshoot?
> > > > > >> > >
> > > > > >> > > Why is your new size so small?  This generally means that
> > > > relatively
> > > > > >> > > more objects are being tenured than would be with a larger
> new
> > > > size.
> > > > > >> > > This could make collections of the old gen worse (GC time is
> > > said
> > > > to
> > > > > >> > > be proportional to the number of live objects in the
> > generation,
> > > > and
> > > > > >> > > CMS does indeed cause STW pauses).  A typical new to tenured
> > > ratio
> > > > > >> > > might be 1:3.  Were the new gen GCs taking too long?  This
> is
> > > > > probably
> > > > > >> > > orthogonal to your immediate issue, though.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > -----Original Message-----
> > > > > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > > >> > > To: user@hbase.apache.org
> > > > > >> > > Subject: Re: RegionServer dying every two or three days
> > > > > >> > >
> > > > > >> > >  St.Ack,
> > > > > >> > >
> > > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > > >> > > I will read the perf section as suggested.
> > > > > >> > > I'm currently using Nagios + JMX to monitor the cluster, but
> > > it's
> > > > > >> > > currently used for alert only, the perfdata is not been
> > stored,
> > > so
> > > > > >> > > it's kind of useless right now, but i was thinking in use
> TSDB
> > > to
> > > > > >> > > store it, any known case of integration?
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > Sandy,
> > > > > >> > >
> > > > > >> > > Yes, my timeout is 30 seconds:
> > > > > >> > >
> > > > > >> > > <property>
> > > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > > >> > >   <value>30000</value>
> > > > > >> > > </property>
> > > > > >> > >
> > > > > >> > > To our application it's a sufferable time to wait in case a
> > > > > >> > > RegionServer go offline.
> > > > > >> > >
> > > > > >> > > My heap is 4GB and my JVM params are:
> > > > > >> > >
> > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > > > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis
> -XX:+AggressiveOpts
> > > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > > >> > >
> > > > > >> > > I will try the -XX:+UseParallelOldGC param and post my
> > feedback
> > > > > here.
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > Ramkrishna,
> > > > > >> > >
> > > > > >> > > Seems the GC is the root of all evil in this case.
> > > > > >> > > ----
> > > > > >> > >
> > > > > >> > > Thank you all for the answers. I will try out these valuable
> > > > advices
> > > > > >> > > given here and post my results.
> > > > > >> > >
> > > > > >> > > Leo Gamas.
> > > > > >> > >
> > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > > ramkrishna.vasudevan@huawei.com>
> > > > > >> > >
> > > > > >> > > > Recently we faced a similar problem and it was due to GC
> > > config.
> > > > > >> > > > Pls check your GC.
> > > > > >> > > >
> > > > > >> > > > Regards
> > > > > >> > > > Ram
> > > > > >> > > >
> > > > > >> > > > -----Original Message-----
> > > > > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On
> > > > Behalf
> > > > > Of
> > > > > >> > > > Stack
> > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > > >> > > > To: user@hbase.apache.org
> > > > > >> > > > Subject: Re: RegionServer dying every two or three days
> > > > > >> > > >
> > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > > >> > > > <le...@jusbrasil.com.br> wrote:
> > > > > >> > > > > The third line took 36.96 seconds to execute, can this
> be
> > > > > causing
> > > > > >> > > > > this problem?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > > Reading the code a little it seems that, even if it's
> > > > disabled,
> > > > > if
> > > > > >> > > > > all files are target in a compaction, it's considered a
> > > major
> > > > > >> > > > > compaction. Is
> > > > > >> > > > it
> > > > > >> > > > > right?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > That is right.  They get 'upgraded' from minor to major.
> > > > > >> > > >
> > > > > >> > > > This should be fine though.  What you are avoiding setting
> > > major
> > > > > >> > > > compactions to 0 is all regions being major compacted on a
> > > > > period, a
> > > > > >> > > > heavy weight effective rewrite of all your data (unless
> > > already
> > > > > >> major
> > > > > >> > > > compacted).   It looks like you have this disabled which
> is
> > > good
> > > > > >> until
> > > > > >> > > > you've wrestled your cluster into submission.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > > The machines don't have swap, so the swappiness
> parameter
> > > > don't
> > > > > >> > > > > seem to apply here. Any other suggestion?
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > > > See the perf section of the hbase manual.  It has our
> > current
> > > > > list.
> > > > > >> > > >
> > > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > St.Ack
> > > > > >> > > >
> > > > > >> > > > > Thanks.
> > > > > >> > > > >
> > > > > >> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > > >> > > > >
> > > > > >> > > > >> I will investigate this, thanks for the response.
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > > > >> > > > >>
> > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > session
> > > > > >> > > > >>> timed out, have not heard from server in 61103ms for
> > > > sessionid
> > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> > > attempting
> > > > > >> > > > >>> reconnect
> > > > > >> > > > >>>
> > > > > >> > > > >>> It looks like the process has been unresponsive for
> some
> > > > time,
> > > > > >> > > > >>> so ZK
> > > > > >> > > > has
> > > > > >> > > > >>> terminated the session.  Did you experience a long GC
> > > pause
> > > > > >> > > > >>> right
> > > > > >> > > > before
> > > > > >> > > > >>> this?  If you don't have GC logging enabled for the
> RS,
> > > you
> > > > > can
> > > > > >> > > > sometimes
> > > > > >> > > > >>> tell by noticing a gap in the timestamps of the log
> > > > statements
> > > > > >> > > > >>> leading
> > > > > >> > > > up
> > > > > >> > > > >>> to the crash.
> > > > > >> > > > >>>
> > > > > >> > > > >>> If it turns out to be GC, you might want to look at
> your
> > > > > kernel
> > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> > > > > >> > > > >>>
> > > > > >> > > > >>> Sandy
> > > > > >> > > > >>>
> > > > > >> > > > >>>
> > > > > >> > > > >>> > -----Original Message-----
> > > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> > leogamas@jusbrasil.com.br]
> > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > > >> > > > >>> > To: user@hbase.apache.org
> > > > > >> > > > >>> > Subject: RegionServer dying every two or three days
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Hi,
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines
> (1
> > > > > Master +
> > > > > >> > > > >>> > 3
> > > > > >> > > > >>> Slaves),
> > > > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory
> > Extra
> > > > > Large
> > > > > >> > > > Instance
> > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> > > > > Zookeeper.
> > > > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge)
> > running
> > > > > >> > > > >>> > Datanode,
> > > > > >> > > > >>> TaskTracker,
> > > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > From time to time, every two or three days, one of
> the
> > > > > >> > > > >>> > RegionServers processes goes down, but the other
> > > processes
> > > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > > >> > > > >>> > Zookeeper) continue normally.
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Reading the logs:
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > > session
> > > > > >> > > > >>> > timed
> > > > > >> > > > out,
> > > > > >> > > > >>> have
> > > > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > > >> > > > >>> closing
> > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > > session
> > > > > >> > > > >>> > timed
> > > > > >> > > > out,
> > > > > >> > > > >>> have
> > > > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > > > >> > > > >>> > 0x346c561a55953e,
> > > > > >> > > > closing
> > > > > >> > > > >>> > socket connection and attempting reconnect
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > And the Handlers start to fail:
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > > Responder,
> > > > > >> > > > >>> > call
> > > > > >> > > > >>> >
> > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > > )
> > > > > >> > > > >>> > from
> > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > handler
> > > > > 81
> > > > > >> > > > >>> > on
> > > > > >> > > > 60020
> > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > >> > > > 13
> > > > > >> > > > >>> > 3)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>>
> > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > >> > > > >>> > 1341)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > >> > > > >>> > ns
> > > > > >> > > > >>> > e(HB
> > > > > >> > > > >>> > aseServer.java:727)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > >> > > > >>> > as
> > > > > >> > > > >>> > eSe
> > > > > >> > > > >>> > rver.java:792)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > >> > > > :1
> > > > > >> > > > >>> > 083)
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > > Responder,
> > > > > >> > > > >>> > call
> > > > > >> > > > >>> >
> > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > > )
> > > > > >> > > > >>> > from
> > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > handler
> > > > > 62
> > > > > >> > > > >>> > on
> > > > > >> > > > 60020
> > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > > >> > > > 13
> > > > > >> > > > >>> > 3)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>>
> > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > > >> > > > >>> > 1341)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > > >> > > > >>> > ns
> > > > > >> > > > >>> > e(HB
> > > > > >> > > > >>> > aseServer.java:727)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > > >> > > > >>> > as
> > > > > >> > > > >>> > eSe
> > > > > >> > > > >>> > rver.java:792)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > >> > > > :1
> > > > > >> > > > >>> > 083)
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > And finally the server throws a YouAreDeadException
> > :( :
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > > socket
> > > > > >> > > > connection
> > > > > >> > > > >>> to
> > > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > > connection
> > > > > >> > > > >>> > established to
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > >> > > > initiating
> > > > > >> > > > >>> session
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable
> to
> > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > > 0x23462a4cf93a8fc
> > > > > has
> > > > > >> > > > >>> > expired, closing
> > > > > >> > > > socket
> > > > > >> > > > >>> > connection
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > > socket
> > > > > >> > > > connection
> > > > > >> > > > >>> to
> > > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > > connection
> > > > > >> > > > >>> > established to
> > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > > >> > > > initiating
> > > > > >> > > > >>> session
> > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable
> to
> > > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > 0x346c561a55953e
> > > > has
> > > > > >> > > > >>> > expired, closing
> > > > > >> > > > socket
> > > > > >> > > > >>> > connection
> > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer:
> > > > ABORTING
> > > > > >> > > > >>> > region server
> > > > > >> > > > >>> >
> > > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > > > > >> > maxHeap=4083):
> > > > > >> > > > >>> > Unhandled
> > > > > >> > > > >>> > exception:
> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > >> > Server
> > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead
> > > > > server
> > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > > REPORT
> > > > > >> > > > >>> > rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > >> > > > as
> > > > > >> > > > >>> > dead server
> > > > > >> > > > >>> >         at
> > > > > >> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > >> > > > >>> > Method)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > > >> > > > to
> > > > > >> > > > r
> > > > > >> > > > >>> > AccessorImpl.java:39)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > > >> > > > Co
> > > > > >> > > > n
> > > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>>
> > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > > >> > > > >>> > ot
> > > > > >> > > > >>> > eExce
> > > > > >> > > > >>> > ption.java:95)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > > >> > > > >>> > mo
> > > > > >> > > > >>> > te
> > > > > >> > > > >>> > Exception.java:79)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > >> > > > >>> > rv
> > > > > >> > > > >>> > erRep
> > > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > > >> > > > .j
> > > > > >> > > > >>> > ava:596)
> > > > > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > > REPORT
> > > > > >> > > > >>> > rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > > >> > > > as
> > > > > >> > > > >>> > dead server
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > > >> > > > >>> > rM
> > > > > >> > > > >>> > ana
> > > > > >> > > > >>> > ger.java:204)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > > >> > > > >>> > t(
> > > > > >> > > > >>> > Serv
> > > > > >> > > > >>> > erManager.java:262)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > > >> > > > >>> > te
> > > > > >> > > > >>> > r.jav
> > > > > >> > > > >>> > a:669)
> > > > > >> > > > >>> >         at
> > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > > >> > > > Source)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > > >> > > > >>> > od
> > > > > >> > > > >>> > Acces
> > > > > >> > > > >>> > sorImpl.java:25)
> > > > > >> > > > >>> >         at
> > > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > > >
> > > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > > >> > > > :1
> > > > > >> > > > >>> > 039)
> > > > > >> > > > >>> >
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > > >> > > > >>> > av
> > > > > >> > > > >>> > a:257
> > > > > >> > > > >>> > )
> > > > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown
> Source)
> > > > > >> > > > >>> >         at
> > > > > >> > > > >>> >
> > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > > >> > > > >>> > rv
> > > > > >> > > > >>> > erRep
> > > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > > >> > > > >>> >         ... 2 more
> > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > Dump
> > > of
> > > > > >> > metrics:
> > > > > >> > > > >>> > requests=66, regions=206, stores=2078,
> storefiles=970,
> > > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> > usedHeap=1672,
> > > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > > > >> > > > >>> > blockCacheHitCount=79578618,
> > > blockCacheMissCount=3036335,
> > > > > >> > > > >>> > blockCacheEvictedCount=1401352,
> blockCacheHitRatio=96,
> > > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > > > STOPPED:
> > > > > >> > > > >>> > Unhandled
> > > > > >> > > > >>> > exception:
> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > > >> > Server
> > > > > >> > > > >>> > REPORT rejected; currently processing
> > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead
> > > > > server
> > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping
> > server
> > > on
> > > > > >> > > > >>> > 60020
> > > > > >> > > > >>> > ---------------------------
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Then i restart the RegionServer and everything is
> back
> > > to
> > > > > >> normal.
> > > > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker
> logs,
> > i
> > > > > don't
> > > > > >> > > > >>> > see any abnormality in the same time window.
> > > > > >> > > > >>> > I think it was caused by the lost of connection to
> > > > > zookeeper.
> > > > > >> > > > >>> > Is it
> > > > > >> > > > >>> advisable to
> > > > > >> > > > >>> > run zookeeper in the same machines?
> > > > > >> > > > >>> > if the RegionServer lost it's connection to
> Zookeeper,
> > > > > there's
> > > > > >> > > > >>> > a way
> > > > > >> > > > (a
> > > > > >> > > > >>> > configuration perhaps) to re-join the cluster, and
> not
> > > > only
> > > > > >> die?
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it
> from
> > > > > >> happening?
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Any help is appreciated.
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > Best Regards,
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > --
> > > > > >> > > > >>> >
> > > > > >> > > > >>> > *Leonardo Gamas*
> > > > > >> > > > >>> > Software Engineer
> > > > > >> > > > >>> > +557134943514
> > > > > >> > > > >>> > +557581347440
> > > > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > > > >> > > > >>> > www.jusbrasil.com.br
> > > > > >> > > > >>>
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >> --
> > > > > >> > > > >>
> > > > > >> > > > >> *Leonardo Gamas*
> > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71)
> 3494-3514C
> > > > > (75)
> > > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
> > > > > >> > > > >>
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > --
> > > > > >> > > > >
> > > > > >> > > > > *Leonardo Gamas*
> > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71)
> 3494-3514C
> > > > (75)
> > > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > >
> > > > > >> > > *Leonardo Gamas*
> > > > > >> > > Software Engineer
> > > > > >> > > +557134943514
> > > > > >> > > +557581347440
> > > > > >> > > leogamas@jusbrasil.com.br
> > > > > >> > > www.jusbrasil.com.br
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> >
> > > > > >> > *Leonardo Gamas*
> > > > > >> > Software Engineer
> > > > > >> > T +55 (71) 3494-3514
> > > > > >> > C +55 (75) 8134-7440
> > > > > >> > leogamas@jusbrasil.com.br
> > > > > >> > www.jusbrasil.com.br
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > *Leonardo Gamas*
> > > > > >
> > > > > > Software Engineer
> > > > > > T +55 (71) 3494-3514
> > > > > > C +55 (75) 8134-7440
> > > > > > leogamas@jusbrasil.com.br
> > > > > >
> > > > > > www.jusbrasil.com.br
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > > Software Engineer
> > > > > T +55 (71) 3494-3514
> > > > > C +55 (75) 8134-7440
> > > > > leogamas@jusbrasil.com.br
> > > > > www.jusbrasil.com.br
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer
> > > T +55 (71) 3494-3514
> > > C +55 (75) 8134-7440
> > > leogamas@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > >
> >
>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
>

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
Thanks Matt for this insightful article, I will run my cluster with
c1.xlarge to test it's performance. But i'm concerned with this machine,
because the amount of RAM available, only 7GB. How many map/reduce slots do
you configure? And the amount of Heap for HBase? How many regions per
RegionServer could my cluster support?

2012/1/20 Matt Corgan <mc...@hotpads.com>

> I run c1.xlarge servers and have found them very stable.  I see 100 Mbit/s
> sustained bi-directional network throughput (200Mbit/s total), sometimes up
> to 150 * 2 Mbit/s.
>
> Here's a pretty thorough examination of the underlying hardware:
>
>
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/
>
>
> *High-CPU instances*
>
> The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket because
> we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance
> almost takes up the whole physical machine. However, we frequently observe
> steal cycle on a c1.xlarge instance ranging from 0% to 25% with an average
> of about 10%. The amount of steal cycle is not enough to host another
> smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
> Amazon’s software firewall (security group). On Passmark-CPU mark, a
> c1.xlarge machine achieves 7,962.6, actually higher than an average
> dual-sock E5410 system is able to achieve (average is 6,903).
>
>
>
> On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> <le...@jusbrasil.com.br>wrote:
>
> > Thanks Neil for sharing your experience with AWS! Could you tell what
> > instance type are you using?
> > We are using m1.xlarge, that has 4 virtual cores, but i normally see
> > recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge,
> etc.
> > In principle these 8-core machines don't suffer too much with I/O
> problems
> > since they don't share the physical server. Is there any piece of
> > information from Amazon or other source that affirms that or it's based
> in
> > empirical analysis?
> >
> > 2012/1/19 Neil Yalowitz <ne...@gmail.com>
> >
> > > We have experienced many problems with our cluster on EC2.  The blunt
> > > solution was to increase the Zookeeper timeout to 5 minutes or even
> more.
> > >
> > > Even with a long timeout, however, it's not uncommon for us to see an
> EC2
> > > instance to become unresponsive to pings and SSH several times during a
> > > week.  It's been a very bad environment for clusters.
> > >
> > >
> > > Neil
> > >
> > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > <le...@jusbrasil.com.br>wrote:
> > >
> > > > Hi Guys,
> > > >
> > > > I have tested the parameters provided by Sandy, and it solved the GC
> > > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > > > I'm still experiencing some difficulties, the RegionServer continues
> to
> > > > shutdown, but it seems related to I/O. It starts to timeout many
> > > > connections, new connections to/from the machine timeout too, and
> > finally
> > > > the RegionServer dies because of YouAreDeadException. I will collect
> > more
> > > > data, but i think it's an Amazon/Virtualized Environment inherent
> > issue.
> > > >
> > > > Thanks for the great help provided so far.
> > > >
> > > > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> > > >
> > > > > I don't think so, if Amazon stopped the machine it would cause a
> stop
> > > of
> > > > > minutes, not seconds, and since the DataNode, TaskTracker and
> > Zookeeper
> > > > > continue to work normally.
> > > > > But it can be related to the shared environment nature of Amazon,
> > maybe
> > > > > some spike in I/O caused by another virtualized server in the same
> > > > physical
> > > > > machine.
> > > > >
> > > > > But the intance type i'm using:
> > > > >
> > > > > *Extra Large Instance*
> > > > >
> > > > > 15 GB memory
> > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> > > > > 1,690 GB instance storage
> > > > > 64-bit platform
> > > > > I/O Performance: High
> > > > > API name: m1.xlarge
> > > > > I was not expecting to suffer from this problems, or at least not
> > much.
> > > > >
> > > > >
> > > > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > >
> > > > >> You think it's an Amazon problem maybe?  Like they paused or
> > migrated
> > > > >> your virtual machine, and it just happens to be during GC, leaving
> > us
> > > to
> > > > >> think the GC ran long when it didn't?  I don't have a lot of
> > > experience
> > > > >> with Amazon so I don't know if that sort of thing is common.
> > > > >>
> > > > >> > -----Original Message-----
> > > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > >> > To: user@hbase.apache.org
> > > > >> > Subject: Re: RegionServer dying every two or three days
> > > > >> >
> > > > >> > I checked the CPU Utilization graphics provided by Amazon (it's
> > not
> > > > >> accurate,
> > > > >> > since the sample time is about 5 minutes) and don't see any
> > > > >> abnormality. I
> > > > >> > will setup TSDB with Nagios to have a more reliable source of
> > > > >> performance
> > > > >> > data.
> > > > >> >
> > > > >> > The machines don't have swap space, if i run:
> > > > >> >
> > > > >> > $ swapon -s
> > > > >> >
> > > > >> > To display swap usage summary, it returns an empty list.
> > > > >> >
> > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to
> > > tests.
> > > > >> >
> > > > >> > I don't have payed much attention to the value of the new size
> > > param.
> > > > >> >
> > > > >> > Thanks again for the help!!
> > > > >> >
> > > > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > > >> >
> > > > >> > > That size heap doesn't seem like it should cause a 36 second
> GC
> > (a
> > > > >> > > minor GC even if I remember your logs correctly), so I tend to
> > > think
> > > > >> > > that other things are probably going on.
> > > > >> > >
> > > > >> > > This line here:
> > > > >> > >
> > > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K),
> > > 0.0361840
> > > > >> > > secs]
> > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
> > > > >> > > 954388K->sys=0.01,
> > > > >> > > real=36.96 secs]
> > > > >> > >
> > > > >> > > is really mysterious to me.  It seems to indicate that the
> > process
> > > > was
> > > > >> > > blocked for almost 37 seconds during a minor collection.  Note
> > the
> > > > CPU
> > > > >> > > times are very low but the wall time is very high.  If it was
> > > > actually
> > > > >> > > doing GC work, I'd expect to see user time higher than real
> > time,
> > > as
> > > > >> > > it is in other parallel collections (see your log snippet).
> >  Were
> > > > you
> > > > >> > > really so CPU starved that it took 37 seconds to get in 50ms
> of
> > > > work?
> > > > >> > > I can't make sense of that.  I'm trying to think of something
> > that
> > > > >> > > would block you for that long while all your threads are
> stopped
> > > for
> > > > >> > > GC, other than being in swap, but I can't come up with
> anything.
> > > > >>  You're
> > > > >> > certain you're not in swap?
> > > > >> > >
> > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> -XX:+AggressiveOpts
> > > > while
> > > > >> > > you troubleshoot?
> > > > >> > >
> > > > >> > > Why is your new size so small?  This generally means that
> > > relatively
> > > > >> > > more objects are being tenured than would be with a larger new
> > > size.
> > > > >> > > This could make collections of the old gen worse (GC time is
> > said
> > > to
> > > > >> > > be proportional to the number of live objects in the
> generation,
> > > and
> > > > >> > > CMS does indeed cause STW pauses).  A typical new to tenured
> > ratio
> > > > >> > > might be 1:3.  Were the new gen GCs taking too long?  This is
> > > > probably
> > > > >> > > orthogonal to your immediate issue, though.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > -----Original Message-----
> > > > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > >> > > To: user@hbase.apache.org
> > > > >> > > Subject: Re: RegionServer dying every two or three days
> > > > >> > >
> > > > >> > >  St.Ack,
> > > > >> > >
> > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > >> > > I will read the perf section as suggested.
> > > > >> > > I'm currently using Nagios + JMX to monitor the cluster, but
> > it's
> > > > >> > > currently used for alert only, the perfdata is not been
> stored,
> > so
> > > > >> > > it's kind of useless right now, but i was thinking in use TSDB
> > to
> > > > >> > > store it, any known case of integration?
> > > > >> > > ---
> > > > >> > >
> > > > >> > > Sandy,
> > > > >> > >
> > > > >> > > Yes, my timeout is 30 seconds:
> > > > >> > >
> > > > >> > > <property>
> > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > >> > >   <value>30000</value>
> > > > >> > > </property>
> > > > >> > >
> > > > >> > > To our application it's a sufferable time to wait in case a
> > > > >> > > RegionServer go offline.
> > > > >> > >
> > > > >> > > My heap is 4GB and my JVM params are:
> > > > >> > >
> > > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > >> > >
> > > > >> > > I will try the -XX:+UseParallelOldGC param and post my
> feedback
> > > > here.
> > > > >> > > ---
> > > > >> > >
> > > > >> > > Ramkrishna,
> > > > >> > >
> > > > >> > > Seems the GC is the root of all evil in this case.
> > > > >> > > ----
> > > > >> > >
> > > > >> > > Thank you all for the answers. I will try out these valuable
> > > advices
> > > > >> > > given here and post my results.
> > > > >> > >
> > > > >> > > Leo Gamas.
> > > > >> > >
> > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > ramkrishna.vasudevan@huawei.com>
> > > > >> > >
> > > > >> > > > Recently we faced a similar problem and it was due to GC
> > config.
> > > > >> > > > Pls check your GC.
> > > > >> > > >
> > > > >> > > > Regards
> > > > >> > > > Ram
> > > > >> > > >
> > > > >> > > > -----Original Message-----
> > > > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On
> > > Behalf
> > > > Of
> > > > >> > > > Stack
> > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > >> > > > To: user@hbase.apache.org
> > > > >> > > > Subject: Re: RegionServer dying every two or three days
> > > > >> > > >
> > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > >> > > > <le...@jusbrasil.com.br> wrote:
> > > > >> > > > > The third line took 36.96 seconds to execute, can this be
> > > > causing
> > > > >> > > > > this problem?
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > > Reading the code a little it seems that, even if it's
> > > disabled,
> > > > if
> > > > >> > > > > all files are target in a compaction, it's considered a
> > major
> > > > >> > > > > compaction. Is
> > > > >> > > > it
> > > > >> > > > > right?
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > > That is right.  They get 'upgraded' from minor to major.
> > > > >> > > >
> > > > >> > > > This should be fine though.  What you are avoiding setting
> > major
> > > > >> > > > compactions to 0 is all regions being major compacted on a
> > > > period, a
> > > > >> > > > heavy weight effective rewrite of all your data (unless
> > already
> > > > >> major
> > > > >> > > > compacted).   It looks like you have this disabled which is
> > good
> > > > >> until
> > > > >> > > > you've wrestled your cluster into submission.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > > The machines don't have swap, so the swappiness parameter
> > > don't
> > > > >> > > > > seem to apply here. Any other suggestion?
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > > See the perf section of the hbase manual.  It has our
> current
> > > > list.
> > > > >> > > >
> > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > St.Ack
> > > > >> > > >
> > > > >> > > > > Thanks.
> > > > >> > > > >
> > > > >> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > >> > > > >
> > > > >> > > > >> I will investigate this, thanks for the response.
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > > >> > > > >>
> > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > session
> > > > >> > > > >>> timed out, have not heard from server in 61103ms for
> > > sessionid
> > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> > attempting
> > > > >> > > > >>> reconnect
> > > > >> > > > >>>
> > > > >> > > > >>> It looks like the process has been unresponsive for some
> > > time,
> > > > >> > > > >>> so ZK
> > > > >> > > > has
> > > > >> > > > >>> terminated the session.  Did you experience a long GC
> > pause
> > > > >> > > > >>> right
> > > > >> > > > before
> > > > >> > > > >>> this?  If you don't have GC logging enabled for the RS,
> > you
> > > > can
> > > > >> > > > sometimes
> > > > >> > > > >>> tell by noticing a gap in the timestamps of the log
> > > statements
> > > > >> > > > >>> leading
> > > > >> > > > up
> > > > >> > > > >>> to the crash.
> > > > >> > > > >>>
> > > > >> > > > >>> If it turns out to be GC, you might want to look at your
> > > > kernel
> > > > >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> > > > >> > > > >>>
> > > > >> > > > >>> Sandy
> > > > >> > > > >>>
> > > > >> > > > >>>
> > > > >> > > > >>> > -----Original Message-----
> > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> leogamas@jusbrasil.com.br]
> > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > >> > > > >>> > To: user@hbase.apache.org
> > > > >> > > > >>> > Subject: RegionServer dying every two or three days
> > > > >> > > > >>> >
> > > > >> > > > >>> > Hi,
> > > > >> > > > >>> >
> > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1
> > > > Master +
> > > > >> > > > >>> > 3
> > > > >> > > > >>> Slaves),
> > > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory
> Extra
> > > > Large
> > > > >> > > > Instance
> > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> > > > Zookeeper.
> > > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge)
> running
> > > > >> > > > >>> > Datanode,
> > > > >> > > > >>> TaskTracker,
> > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > >> > > > >>> >
> > > > >> > > > >>> > From time to time, every two or three days, one of the
> > > > >> > > > >>> > RegionServers processes goes down, but the other
> > processes
> > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > >> > > > >>> > Zookeeper) continue normally.
> > > > >> > > > >>> >
> > > > >> > > > >>> > Reading the logs:
> > > > >> > > > >>> >
> > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > >> > > > >>> >
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > session
> > > > >> > > > >>> > timed
> > > > >> > > > out,
> > > > >> > > > >>> have
> > > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > >> > > > >>> closing
> > > > >> > > > >>> > socket connection and attempting reconnect
> > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > session
> > > > >> > > > >>> > timed
> > > > >> > > > out,
> > > > >> > > > >>> have
> > > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > > >> > > > >>> > 0x346c561a55953e,
> > > > >> > > > closing
> > > > >> > > > >>> > socket connection and attempting reconnect
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> >
> > > > >> > > > >>> > And the Handlers start to fail:
> > > > >> > > > >>> >
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > Responder,
> > > > >> > > > >>> > call
> > > > >> > > > >>> >
> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > )
> > > > >> > > > >>> > from
> > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > handler
> > > > 81
> > > > >> > > > >>> > on
> > > > >> > > > 60020
> > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > >> > > > 13
> > > > >> > > > >>> > 3)
> > > > >> > > > >>> >         at
> > > > >> > > > >>>
> > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > >> > > > >>> > 1341)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > >> > > > >>> > ns
> > > > >> > > > >>> > e(HB
> > > > >> > > > >>> > aseServer.java:727)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > >> > > > >>> > as
> > > > >> > > > >>> > eSe
> > > > >> > > > >>> > rver.java:792)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > >> > > > :1
> > > > >> > > > >>> > 083)
> > > > >> > > > >>> >
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > Responder,
> > > > >> > > > >>> > call
> > > > >> > > > >>> >
> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > )
> > > > >> > > > >>> > from
> > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > handler
> > > > 62
> > > > >> > > > >>> > on
> > > > >> > > > 60020
> > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > >> > > > 13
> > > > >> > > > >>> > 3)
> > > > >> > > > >>> >         at
> > > > >> > > > >>>
> > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > >> > > > >>> > 1341)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > >> > > > >>> > ns
> > > > >> > > > >>> > e(HB
> > > > >> > > > >>> > aseServer.java:727)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > >> > > > >>> > as
> > > > >> > > > >>> > eSe
> > > > >> > > > >>> > rver.java:792)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > >> > > > :1
> > > > >> > > > >>> > 083)
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> >
> > > > >> > > > >>> > And finally the server throws a YouAreDeadException
> :( :
> > > > >> > > > >>> >
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > socket
> > > > >> > > > connection
> > > > >> > > > >>> to
> > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > connection
> > > > >> > > > >>> > established to
> > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > >> > > > initiating
> > > > >> > > > >>> session
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > 0x23462a4cf93a8fc
> > > > has
> > > > >> > > > >>> > expired, closing
> > > > >> > > > socket
> > > > >> > > > >>> > connection
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > socket
> > > > >> > > > connection
> > > > >> > > > >>> to
> > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > connection
> > > > >> > > > >>> > established to
> > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > >> > > > initiating
> > > > >> > > > >>> session
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > > >> > > > >>> > reconnect to ZooKeeper service, session
> 0x346c561a55953e
> > > has
> > > > >> > > > >>> > expired, closing
> > > > >> > > > socket
> > > > >> > > > >>> > connection
> > > > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer:
> > > ABORTING
> > > > >> > > > >>> > region server
> > > > >> > > > >>> >
> > > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > > > >> > maxHeap=4083):
> > > > >> > > > >>> > Unhandled
> > > > >> > > > >>> > exception:
> org.apache.hadoop.hbase.YouAreDeadException:
> > > > >> > Server
> > > > >> > > > >>> > REPORT rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead
> > > > server
> > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > REPORT
> > > > >> > > > >>> > rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > >> > > > as
> > > > >> > > > >>> > dead server
> > > > >> > > > >>> >         at
> > > > >> > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > >> > > > >>> > Method)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > >> > > > to
> > > > >> > > > r
> > > > >> > > > >>> > AccessorImpl.java:39)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > >> > > > Co
> > > > >> > > > n
> > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > >> > > > >>> >         at
> > > > >> > > > >>>
> > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > >> > > > >>> > ot
> > > > >> > > > >>> > eExce
> > > > >> > > > >>> > ption.java:95)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > >> > > > >>> > mo
> > > > >> > > > >>> > te
> > > > >> > > > >>> > Exception.java:79)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > >> > > > >>> > rv
> > > > >> > > > >>> > erRep
> > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > >> > > > .j
> > > > >> > > > >>> > ava:596)
> > > > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > REPORT
> > > > >> > > > >>> > rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > >> > > > as
> > > > >> > > > >>> > dead server
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > >> > > > >>> > rM
> > > > >> > > > >>> > ana
> > > > >> > > > >>> > ger.java:204)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > >> > > > >>> > t(
> > > > >> > > > >>> > Serv
> > > > >> > > > >>> > erManager.java:262)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > >> > > > >>> > te
> > > > >> > > > >>> > r.jav
> > > > >> > > > >>> > a:669)
> > > > >> > > > >>> >         at
> > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > >> > > > Source)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > >> > > > >>> > od
> > > > >> > > > >>> > Acces
> > > > >> > > > >>> > sorImpl.java:25)
> > > > >> > > > >>> >         at
> > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > >> > > > :1
> > > > >> > > > >>> > 039)
> > > > >> > > > >>> >
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> >
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > >> > > > >>> > av
> > > > >> > > > >>> > a:257
> > > > >> > > > >>> > )
> > > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > >> > > > >>> > rv
> > > > >> > > > >>> > erRep
> > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > >> > > > >>> >         ... 2 more
> > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> Dump
> > of
> > > > >> > metrics:
> > > > >> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> > > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0,
> usedHeap=1672,
> > > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > > >> > > > >>> > blockCacheHitCount=79578618,
> > blockCacheMissCount=3036335,
> > > > >> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
> > > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > > STOPPED:
> > > > >> > > > >>> > Unhandled
> > > > >> > > > >>> > exception:
> org.apache.hadoop.hbase.YouAreDeadException:
> > > > >> > Server
> > > > >> > > > >>> > REPORT rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead
> > > > server
> > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping
> server
> > on
> > > > >> > > > >>> > 60020
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> >
> > > > >> > > > >>> > Then i restart the RegionServer and everything is back
> > to
> > > > >> normal.
> > > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs,
> i
> > > > don't
> > > > >> > > > >>> > see any abnormality in the same time window.
> > > > >> > > > >>> > I think it was caused by the lost of connection to
> > > > zookeeper.
> > > > >> > > > >>> > Is it
> > > > >> > > > >>> advisable to
> > > > >> > > > >>> > run zookeeper in the same machines?
> > > > >> > > > >>> > if the RegionServer lost it's connection to Zookeeper,
> > > > there's
> > > > >> > > > >>> > a way
> > > > >> > > > (a
> > > > >> > > > >>> > configuration perhaps) to re-join the cluster, and not
> > > only
> > > > >> die?
> > > > >> > > > >>> >
> > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it from
> > > > >> happening?
> > > > >> > > > >>> >
> > > > >> > > > >>> > Any help is appreciated.
> > > > >> > > > >>> >
> > > > >> > > > >>> > Best Regards,
> > > > >> > > > >>> >
> > > > >> > > > >>> > --
> > > > >> > > > >>> >
> > > > >> > > > >>> > *Leonardo Gamas*
> > > > >> > > > >>> > Software Engineer
> > > > >> > > > >>> > +557134943514
> > > > >> > > > >>> > +557581347440
> > > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > > >> > > > >>> > www.jusbrasil.com.br
> > > > >> > > > >>>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> --
> > > > >> > > > >>
> > > > >> > > > >> *Leonardo Gamas*
> > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514C
> > > > (75)
> > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > --
> > > > >> > > > >
> > > > >> > > > > *Leonardo Gamas*
> > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514C
> > > (75)
> > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > >
> > > > >> > > *Leonardo Gamas*
> > > > >> > > Software Engineer
> > > > >> > > +557134943514
> > > > >> > > +557581347440
> > > > >> > > leogamas@jusbrasil.com.br
> > > > >> > > www.jusbrasil.com.br
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > *Leonardo Gamas*
> > > > >> > Software Engineer
> > > > >> > T +55 (71) 3494-3514
> > > > >> > C +55 (75) 8134-7440
> > > > >> > leogamas@jusbrasil.com.br
> > > > >> > www.jusbrasil.com.br
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > >
> > > > > Software Engineer
> > > > > T +55 (71) 3494-3514
> > > > > C +55 (75) 8134-7440
> > > > > leogamas@jusbrasil.com.br
> > > > >
> > > > > www.jusbrasil.com.br
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Leonardo Gamas*
> > > > Software Engineer
> > > > T +55 (71) 3494-3514
> > > > C +55 (75) 8134-7440
> > > > leogamas@jusbrasil.com.br
> > > > www.jusbrasil.com.br
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> >
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Matt Corgan <mc...@hotpads.com>.
I run c1.xlarge servers and have found them very stable.  I see 100 Mbit/s
sustained bi-directional network throughput (200Mbit/s total), sometimes up
to 150 * 2 Mbit/s.

Here's a pretty thorough examination of the underlying hardware:

http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2-compute-unit/


*High-CPU instances*

The high-CPU instances (c1.medium, c1.xlarge) run on systems with
dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket because
we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance
almost takes up the whole physical machine. However, we frequently observe
steal cycle on a c1.xlarge instance ranging from 0% to 25% with an average
of about 10%. The amount of steal cycle is not enough to host another
smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
Amazon’s software firewall (security group). On Passmark-CPU mark, a
c1.xlarge machine achieves 7,962.6, actually higher than an average
dual-sock E5410 system is able to achieve (average is 6,903).



On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
<le...@jusbrasil.com.br>wrote:

> Thanks Neil for sharing your experience with AWS! Could you tell what
> instance type are you using?
> We are using m1.xlarge, that has 4 virtual cores, but i normally see
> recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge, etc.
> In principle these 8-core machines don't suffer too much with I/O problems
> since they don't share the physical server. Is there any piece of
> information from Amazon or other source that affirms that or it's based in
> empirical analysis?
>
> 2012/1/19 Neil Yalowitz <ne...@gmail.com>
>
> > We have experienced many problems with our cluster on EC2.  The blunt
> > solution was to increase the Zookeeper timeout to 5 minutes or even more.
> >
> > Even with a long timeout, however, it's not uncommon for us to see an EC2
> > instance to become unresponsive to pings and SSH several times during a
> > week.  It's been a very bad environment for clusters.
> >
> >
> > Neil
> >
> > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > <le...@jusbrasil.com.br>wrote:
> >
> > > Hi Guys,
> > >
> > > I have tested the parameters provided by Sandy, and it solved the GC
> > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > > I'm still experiencing some difficulties, the RegionServer continues to
> > > shutdown, but it seems related to I/O. It starts to timeout many
> > > connections, new connections to/from the machine timeout too, and
> finally
> > > the RegionServer dies because of YouAreDeadException. I will collect
> more
> > > data, but i think it's an Amazon/Virtualized Environment inherent
> issue.
> > >
> > > Thanks for the great help provided so far.
> > >
> > > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> > >
> > > > I don't think so, if Amazon stopped the machine it would cause a stop
> > of
> > > > minutes, not seconds, and since the DataNode, TaskTracker and
> Zookeeper
> > > > continue to work normally.
> > > > But it can be related to the shared environment nature of Amazon,
> maybe
> > > > some spike in I/O caused by another virtualized server in the same
> > > physical
> > > > machine.
> > > >
> > > > But the intance type i'm using:
> > > >
> > > > *Extra Large Instance*
> > > >
> > > > 15 GB memory
> > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> > > > 1,690 GB instance storage
> > > > 64-bit platform
> > > > I/O Performance: High
> > > > API name: m1.xlarge
> > > > I was not expecting to suffer from this problems, or at least not
> much.
> > > >
> > > >
> > > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > >
> > > >> You think it's an Amazon problem maybe?  Like they paused or
> migrated
> > > >> your virtual machine, and it just happens to be during GC, leaving
> us
> > to
> > > >> think the GC ran long when it didn't?  I don't have a lot of
> > experience
> > > >> with Amazon so I don't know if that sort of thing is common.
> > > >>
> > > >> > -----Original Message-----
> > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > >> > Sent: Thursday, January 05, 2012 13:15
> > > >> > To: user@hbase.apache.org
> > > >> > Subject: Re: RegionServer dying every two or three days
> > > >> >
> > > >> > I checked the CPU Utilization graphics provided by Amazon (it's
> not
> > > >> accurate,
> > > >> > since the sample time is about 5 minutes) and don't see any
> > > >> abnormality. I
> > > >> > will setup TSDB with Nagios to have a more reliable source of
> > > >> performance
> > > >> > data.
> > > >> >
> > > >> > The machines don't have swap space, if i run:
> > > >> >
> > > >> > $ swapon -s
> > > >> >
> > > >> > To display swap usage summary, it returns an empty list.
> > > >> >
> > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to
> > tests.
> > > >> >
> > > >> > I don't have payed much attention to the value of the new size
> > param.
> > > >> >
> > > >> > Thanks again for the help!!
> > > >> >
> > > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > > >> >
> > > >> > > That size heap doesn't seem like it should cause a 36 second GC
> (a
> > > >> > > minor GC even if I remember your logs correctly), so I tend to
> > think
> > > >> > > that other things are probably going on.
> > > >> > >
> > > >> > > This line here:
> > > >> > >
> > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K),
> > 0.0361840
> > > >> > > secs]
> > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
> > > >> > > 954388K->sys=0.01,
> > > >> > > real=36.96 secs]
> > > >> > >
> > > >> > > is really mysterious to me.  It seems to indicate that the
> process
> > > was
> > > >> > > blocked for almost 37 seconds during a minor collection.  Note
> the
> > > CPU
> > > >> > > times are very low but the wall time is very high.  If it was
> > > actually
> > > >> > > doing GC work, I'd expect to see user time higher than real
> time,
> > as
> > > >> > > it is in other parallel collections (see your log snippet).
>  Were
> > > you
> > > >> > > really so CPU starved that it took 37 seconds to get in 50ms of
> > > work?
> > > >> > > I can't make sense of that.  I'm trying to think of something
> that
> > > >> > > would block you for that long while all your threads are stopped
> > for
> > > >> > > GC, other than being in swap, but I can't come up with anything.
> > > >>  You're
> > > >> > certain you're not in swap?
> > > >> > >
> > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> > > while
> > > >> > > you troubleshoot?
> > > >> > >
> > > >> > > Why is your new size so small?  This generally means that
> > relatively
> > > >> > > more objects are being tenured than would be with a larger new
> > size.
> > > >> > > This could make collections of the old gen worse (GC time is
> said
> > to
> > > >> > > be proportional to the number of live objects in the generation,
> > and
> > > >> > > CMS does indeed cause STW pauses).  A typical new to tenured
> ratio
> > > >> > > might be 1:3.  Were the new gen GCs taking too long?  This is
> > > probably
> > > >> > > orthogonal to your immediate issue, though.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > >> > > To: user@hbase.apache.org
> > > >> > > Subject: Re: RegionServer dying every two or three days
> > > >> > >
> > > >> > >  St.Ack,
> > > >> > >
> > > >> > > I don't have made any attempt in GC tunning, yet.
> > > >> > > I will read the perf section as suggested.
> > > >> > > I'm currently using Nagios + JMX to monitor the cluster, but
> it's
> > > >> > > currently used for alert only, the perfdata is not been stored,
> so
> > > >> > > it's kind of useless right now, but i was thinking in use TSDB
> to
> > > >> > > store it, any known case of integration?
> > > >> > > ---
> > > >> > >
> > > >> > > Sandy,
> > > >> > >
> > > >> > > Yes, my timeout is 30 seconds:
> > > >> > >
> > > >> > > <property>
> > > >> > >   <name>zookeeper.session.timeout</name>
> > > >> > >   <value>30000</value>
> > > >> > > </property>
> > > >> > >
> > > >> > > To our application it's a sufferable time to wait in case a
> > > >> > > RegionServer go offline.
> > > >> > >
> > > >> > > My heap is 4GB and my JVM params are:
> > > >> > >
> > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > >> > >
> > > >> > > I will try the -XX:+UseParallelOldGC param and post my feedback
> > > here.
> > > >> > > ---
> > > >> > >
> > > >> > > Ramkrishna,
> > > >> > >
> > > >> > > Seems the GC is the root of all evil in this case.
> > > >> > > ----
> > > >> > >
> > > >> > > Thank you all for the answers. I will try out these valuable
> > advices
> > > >> > > given here and post my results.
> > > >> > >
> > > >> > > Leo Gamas.
> > > >> > >
> > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> ramkrishna.vasudevan@huawei.com>
> > > >> > >
> > > >> > > > Recently we faced a similar problem and it was due to GC
> config.
> > > >> > > > Pls check your GC.
> > > >> > > >
> > > >> > > > Regards
> > > >> > > > Ram
> > > >> > > >
> > > >> > > > -----Original Message-----
> > > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On
> > Behalf
> > > Of
> > > >> > > > Stack
> > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > >> > > > To: user@hbase.apache.org
> > > >> > > > Subject: Re: RegionServer dying every two or three days
> > > >> > > >
> > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > >> > > > <le...@jusbrasil.com.br> wrote:
> > > >> > > > > The third line took 36.96 seconds to execute, can this be
> > > causing
> > > >> > > > > this problem?
> > > >> > > > >
> > > >> > > >
> > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > >> > > >
> > > >> > > >
> > > >> > > > > Reading the code a little it seems that, even if it's
> > disabled,
> > > if
> > > >> > > > > all files are target in a compaction, it's considered a
> major
> > > >> > > > > compaction. Is
> > > >> > > > it
> > > >> > > > > right?
> > > >> > > > >
> > > >> > > >
> > > >> > > > That is right.  They get 'upgraded' from minor to major.
> > > >> > > >
> > > >> > > > This should be fine though.  What you are avoiding setting
> major
> > > >> > > > compactions to 0 is all regions being major compacted on a
> > > period, a
> > > >> > > > heavy weight effective rewrite of all your data (unless
> already
> > > >> major
> > > >> > > > compacted).   It looks like you have this disabled which is
> good
> > > >> until
> > > >> > > > you've wrestled your cluster into submission.
> > > >> > > >
> > > >> > > >
> > > >> > > > > The machines don't have swap, so the swappiness parameter
> > don't
> > > >> > > > > seem to apply here. Any other suggestion?
> > > >> > > > >
> > > >> > > >
> > > >> > > > See the perf section of the hbase manual.  It has our current
> > > list.
> > > >> > > >
> > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > >> > > >
> > > >> > > >
> > > >> > > > St.Ack
> > > >> > > >
> > > >> > > > > Thanks.
> > > >> > > > >
> > > >> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > > >> > > > >
> > > >> > > > >> I will investigate this, thanks for the response.
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > >> > > > >>
> > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> session
> > > >> > > > >>> timed out, have not heard from server in 61103ms for
> > sessionid
> > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> attempting
> > > >> > > > >>> reconnect
> > > >> > > > >>>
> > > >> > > > >>> It looks like the process has been unresponsive for some
> > time,
> > > >> > > > >>> so ZK
> > > >> > > > has
> > > >> > > > >>> terminated the session.  Did you experience a long GC
> pause
> > > >> > > > >>> right
> > > >> > > > before
> > > >> > > > >>> this?  If you don't have GC logging enabled for the RS,
> you
> > > can
> > > >> > > > sometimes
> > > >> > > > >>> tell by noticing a gap in the timestamps of the log
> > statements
> > > >> > > > >>> leading
> > > >> > > > up
> > > >> > > > >>> to the crash.
> > > >> > > > >>>
> > > >> > > > >>> If it turns out to be GC, you might want to look at your
> > > kernel
> > > >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> > > >> > > > >>>
> > > >> > > > >>> Sandy
> > > >> > > > >>>
> > > >> > > > >>>
> > > >> > > > >>> > -----Original Message-----
> > > >> > > > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > >> > > > >>> > To: user@hbase.apache.org
> > > >> > > > >>> > Subject: RegionServer dying every two or three days
> > > >> > > > >>> >
> > > >> > > > >>> > Hi,
> > > >> > > > >>> >
> > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1
> > > Master +
> > > >> > > > >>> > 3
> > > >> > > > >>> Slaves),
> > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra
> > > Large
> > > >> > > > Instance
> > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> > > Zookeeper.
> > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> > > >> > > > >>> > Datanode,
> > > >> > > > >>> TaskTracker,
> > > >> > > > >>> > RegionServer and Zookeeper.
> > > >> > > > >>> >
> > > >> > > > >>> > From time to time, every two or three days, one of the
> > > >> > > > >>> > RegionServers processes goes down, but the other
> processes
> > > >> > > > >>> > (DataNode, TaskTracker,
> > > >> > > > >>> > Zookeeper) continue normally.
> > > >> > > > >>> >
> > > >> > > > >>> > Reading the logs:
> > > >> > > > >>> >
> > > >> > > > >>> > The connection with Zookeeper timed out:
> > > >> > > > >>> >
> > > >> > > > >>> > ---------------------------
> > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > session
> > > >> > > > >>> > timed
> > > >> > > > out,
> > > >> > > > >>> have
> > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > >> > > > >>> closing
> > > >> > > > >>> > socket connection and attempting reconnect
> > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > session
> > > >> > > > >>> > timed
> > > >> > > > out,
> > > >> > > > >>> have
> > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > >> > > > >>> > 0x346c561a55953e,
> > > >> > > > closing
> > > >> > > > >>> > socket connection and attempting reconnect
> > > >> > > > >>> > ---------------------------
> > > >> > > > >>> >
> > > >> > > > >>> > And the Handlers start to fail:
> > > >> > > > >>> >
> > > >> > > > >>> > ---------------------------
> > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > Responder,
> > > >> > > > >>> > call
> > > >> > > > >>> >
> multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > )
> > > >> > > > >>> > from
> > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> handler
> > > 81
> > > >> > > > >>> > on
> > > >> > > > 60020
> > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > >> > > > 13
> > > >> > > > >>> > 3)
> > > >> > > > >>> >         at
> > > >> > > > >>>
> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > >> > > > >>> > 1341)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > >> > > > >>> > ns
> > > >> > > > >>> > e(HB
> > > >> > > > >>> > aseServer.java:727)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > >> > > > >>> > as
> > > >> > > > >>> > eSe
> > > >> > > > >>> > rver.java:792)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > >> > > > :1
> > > >> > > > >>> > 083)
> > > >> > > > >>> >
> > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > Responder,
> > > >> > > > >>> > call
> > > >> > > > >>> >
> multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > )
> > > >> > > > >>> > from
> > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> handler
> > > 62
> > > >> > > > >>> > on
> > > >> > > > 60020
> > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > >> > > > 13
> > > >> > > > >>> > 3)
> > > >> > > > >>> >         at
> > > >> > > > >>>
> > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > >> > > > >>> > 1341)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > >> > > > >>> > ns
> > > >> > > > >>> > e(HB
> > > >> > > > >>> > aseServer.java:727)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > >> > > > >>> > as
> > > >> > > > >>> > eSe
> > > >> > > > >>> > rver.java:792)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > >> > > > :1
> > > >> > > > >>> > 083)
> > > >> > > > >>> > ---------------------------
> > > >> > > > >>> >
> > > >> > > > >>> > And finally the server throws a YouAreDeadException :( :
> > > >> > > > >>> >
> > > >> > > > >>> > ---------------------------
> > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > socket
> > > >> > > > connection
> > > >> > > > >>> to
> > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > connection
> > > >> > > > >>> > established to
> > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > >> > > > initiating
> > > >> > > > >>> session
> > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > >> > > > >>> > reconnect to ZooKeeper service, session
> 0x23462a4cf93a8fc
> > > has
> > > >> > > > >>> > expired, closing
> > > >> > > > socket
> > > >> > > > >>> > connection
> > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > socket
> > > >> > > > connection
> > > >> > > > >>> to
> > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > connection
> > > >> > > > >>> > established to
> > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > >> > > > initiating
> > > >> > > > >>> session
> > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > >> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e
> > has
> > > >> > > > >>> > expired, closing
> > > >> > > > socket
> > > >> > > > >>> > connection
> > > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer:
> > ABORTING
> > > >> > > > >>> > region server
> > > >> > > > >>> >
> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > > >> > maxHeap=4083):
> > > >> > > > >>> > Unhandled
> > > >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> > > >> > Server
> > > >> > > > >>> > REPORT rejected; currently processing
> > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> > > server
> > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> REPORT
> > > >> > > > >>> > rejected; currently processing
> > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > >> > > > as
> > > >> > > > >>> > dead server
> > > >> > > > >>> >         at
> > > >> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > >> > > > >>> > Method)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > >> > > > to
> > > >> > > > r
> > > >> > > > >>> > AccessorImpl.java:39)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > >> > > > Co
> > > >> > > > n
> > > >> > > > >>> > structorAccessorImpl.java:27)
> > > >> > > > >>> >         at
> > > >> > > > >>>
> > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > >> > > > >>> > ot
> > > >> > > > >>> > eExce
> > > >> > > > >>> > ption.java:95)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > >> > > > >>> > mo
> > > >> > > > >>> > te
> > > >> > > > >>> > Exception.java:79)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > >> > > > >>> > rv
> > > >> > > > >>> > erRep
> > > >> > > > >>> > ort(HRegionServer.java:735)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > >> > > > .j
> > > >> > > > >>> > ava:596)
> > > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> REPORT
> > > >> > > > >>> > rejected; currently processing
> > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > >> > > > as
> > > >> > > > >>> > dead server
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > >> > > > >>> > rM
> > > >> > > > >>> > ana
> > > >> > > > >>> > ger.java:204)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > >> > > > >>> > t(
> > > >> > > > >>> > Serv
> > > >> > > > >>> > erManager.java:262)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > >> > > > >>> > te
> > > >> > > > >>> > r.jav
> > > >> > > > >>> > a:669)
> > > >> > > > >>> >         at
> > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > >> > > > Source)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > >> > > > >>> > od
> > > >> > > > >>> > Acces
> > > >> > > > >>> > sorImpl.java:25)
> > > >> > > > >>> >         at
> > java.lang.reflect.Method.invoke(Method.java:597)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > > >
> > > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > >> > > > :1
> > > >> > > > >>> > 039)
> > > >> > > > >>> >
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > >> > > > >>> > av
> > > >> > > > >>> > a:257
> > > >> > > > >>> > )
> > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> > > >> > > > >>> >         at
> > > >> > > > >>> >
> > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > >> > > > >>> > rv
> > > >> > > > >>> > erRep
> > > >> > > > >>> > ort(HRegionServer.java:729)
> > > >> > > > >>> >         ... 2 more
> > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump
> of
> > > >> > metrics:
> > > >> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> > > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
> > > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > >> > > > >>> > blockCacheHitCount=79578618,
> blockCacheMissCount=3036335,
> > > >> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
> > > >> > > > >>> > blockCacheHitCachingRatio=98
> > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > STOPPED:
> > > >> > > > >>> > Unhandled
> > > >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> > > >> > Server
> > > >> > > > >>> > REPORT rejected; currently processing
> > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> > > server
> > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server
> on
> > > >> > > > >>> > 60020
> > > >> > > > >>> > ---------------------------
> > > >> > > > >>> >
> > > >> > > > >>> > Then i restart the RegionServer and everything is back
> to
> > > >> normal.
> > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i
> > > don't
> > > >> > > > >>> > see any abnormality in the same time window.
> > > >> > > > >>> > I think it was caused by the lost of connection to
> > > zookeeper.
> > > >> > > > >>> > Is it
> > > >> > > > >>> advisable to
> > > >> > > > >>> > run zookeeper in the same machines?
> > > >> > > > >>> > if the RegionServer lost it's connection to Zookeeper,
> > > there's
> > > >> > > > >>> > a way
> > > >> > > > (a
> > > >> > > > >>> > configuration perhaps) to re-join the cluster, and not
> > only
> > > >> die?
> > > >> > > > >>> >
> > > >> > > > >>> > Any idea what is causing this?? Or to prevent it from
> > > >> happening?
> > > >> > > > >>> >
> > > >> > > > >>> > Any help is appreciated.
> > > >> > > > >>> >
> > > >> > > > >>> > Best Regards,
> > > >> > > > >>> >
> > > >> > > > >>> > --
> > > >> > > > >>> >
> > > >> > > > >>> > *Leonardo Gamas*
> > > >> > > > >>> > Software Engineer
> > > >> > > > >>> > +557134943514
> > > >> > > > >>> > +557581347440
> > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > >> > > > >>> > www.jusbrasil.com.br
> > > >> > > > >>>
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >> --
> > > >> > > > >>
> > > >> > > > >> *Leonardo Gamas*
> > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C
> > > (75)
> > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > >
> > > >> > > > > *Leonardo Gamas*
> > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C
> > (75)
> > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > >
> > > >> > > *Leonardo Gamas*
> > > >> > > Software Engineer
> > > >> > > +557134943514
> > > >> > > +557581347440
> > > >> > > leogamas@jusbrasil.com.br
> > > >> > > www.jusbrasil.com.br
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> >
> > > >> > *Leonardo Gamas*
> > > >> > Software Engineer
> > > >> > T +55 (71) 3494-3514
> > > >> > C +55 (75) 8134-7440
> > > >> > leogamas@jusbrasil.com.br
> > > >> > www.jusbrasil.com.br
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Leonardo Gamas*
> > > >
> > > > Software Engineer
> > > > T +55 (71) 3494-3514
> > > > C +55 (75) 8134-7440
> > > > leogamas@jusbrasil.com.br
> > > >
> > > > www.jusbrasil.com.br
> > > >
> > > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer
> > > T +55 (71) 3494-3514
> > > C +55 (75) 8134-7440
> > > leogamas@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > >
> >
>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
>

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
Thanks Neil for sharing your experience with AWS! Could you tell what
instance type are you using?
We are using m1.xlarge, that has 4 virtual cores, but i normally see
recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge, etc.
In principle these 8-core machines don't suffer too much with I/O problems
since they don't share the physical server. Is there any piece of
information from Amazon or other source that affirms that or it's based in
empirical analysis?

2012/1/19 Neil Yalowitz <ne...@gmail.com>

> We have experienced many problems with our cluster on EC2.  The blunt
> solution was to increase the Zookeeper timeout to 5 minutes or even more.
>
> Even with a long timeout, however, it's not uncommon for us to see an EC2
> instance to become unresponsive to pings and SSH several times during a
> week.  It's been a very bad environment for clusters.
>
>
> Neil
>
> On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> <le...@jusbrasil.com.br>wrote:
>
> > Hi Guys,
> >
> > I have tested the parameters provided by Sandy, and it solved the GC
> > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > I'm still experiencing some difficulties, the RegionServer continues to
> > shutdown, but it seems related to I/O. It starts to timeout many
> > connections, new connections to/from the machine timeout too, and finally
> > the RegionServer dies because of YouAreDeadException. I will collect more
> > data, but i think it's an Amazon/Virtualized Environment inherent issue.
> >
> > Thanks for the great help provided so far.
> >
> > 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
> >
> > > I don't think so, if Amazon stopped the machine it would cause a stop
> of
> > > minutes, not seconds, and since the DataNode, TaskTracker and Zookeeper
> > > continue to work normally.
> > > But it can be related to the shared environment nature of Amazon, maybe
> > > some spike in I/O caused by another virtualized server in the same
> > physical
> > > machine.
> > >
> > > But the intance type i'm using:
> > >
> > > *Extra Large Instance*
> > >
> > > 15 GB memory
> > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> > > 1,690 GB instance storage
> > > 64-bit platform
> > > I/O Performance: High
> > > API name: m1.xlarge
> > > I was not expecting to suffer from this problems, or at least not much.
> > >
> > >
> > > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > >
> > >> You think it's an Amazon problem maybe?  Like they paused or migrated
> > >> your virtual machine, and it just happens to be during GC, leaving us
> to
> > >> think the GC ran long when it didn't?  I don't have a lot of
> experience
> > >> with Amazon so I don't know if that sort of thing is common.
> > >>
> > >> > -----Original Message-----
> > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > >> > Sent: Thursday, January 05, 2012 13:15
> > >> > To: user@hbase.apache.org
> > >> > Subject: Re: RegionServer dying every two or three days
> > >> >
> > >> > I checked the CPU Utilization graphics provided by Amazon (it's not
> > >> accurate,
> > >> > since the sample time is about 5 minutes) and don't see any
> > >> abnormality. I
> > >> > will setup TSDB with Nagios to have a more reliable source of
> > >> performance
> > >> > data.
> > >> >
> > >> > The machines don't have swap space, if i run:
> > >> >
> > >> > $ swapon -s
> > >> >
> > >> > To display swap usage summary, it returns an empty list.
> > >> >
> > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to
> tests.
> > >> >
> > >> > I don't have payed much attention to the value of the new size
> param.
> > >> >
> > >> > Thanks again for the help!!
> > >> >
> > >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> > >> >
> > >> > > That size heap doesn't seem like it should cause a 36 second GC (a
> > >> > > minor GC even if I remember your logs correctly), so I tend to
> think
> > >> > > that other things are probably going on.
> > >> > >
> > >> > > This line here:
> > >> > >
> > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K),
> 0.0361840
> > >> > > secs]
> > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
> > >> > > 954388K->sys=0.01,
> > >> > > real=36.96 secs]
> > >> > >
> > >> > > is really mysterious to me.  It seems to indicate that the process
> > was
> > >> > > blocked for almost 37 seconds during a minor collection.  Note the
> > CPU
> > >> > > times are very low but the wall time is very high.  If it was
> > actually
> > >> > > doing GC work, I'd expect to see user time higher than real time,
> as
> > >> > > it is in other parallel collections (see your log snippet).  Were
> > you
> > >> > > really so CPU starved that it took 37 seconds to get in 50ms of
> > work?
> > >> > > I can't make sense of that.  I'm trying to think of something that
> > >> > > would block you for that long while all your threads are stopped
> for
> > >> > > GC, other than being in swap, but I can't come up with anything.
> > >>  You're
> > >> > certain you're not in swap?
> > >> > >
> > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> > while
> > >> > > you troubleshoot?
> > >> > >
> > >> > > Why is your new size so small?  This generally means that
> relatively
> > >> > > more objects are being tenured than would be with a larger new
> size.
> > >> > > This could make collections of the old gen worse (GC time is said
> to
> > >> > > be proportional to the number of live objects in the generation,
> and
> > >> > > CMS does indeed cause STW pauses).  A typical new to tenured ratio
> > >> > > might be 1:3.  Were the new gen GCs taking too long?  This is
> > probably
> > >> > > orthogonal to your immediate issue, though.
> > >> > >
> > >> > >
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > >> > > To: user@hbase.apache.org
> > >> > > Subject: Re: RegionServer dying every two or three days
> > >> > >
> > >> > >  St.Ack,
> > >> > >
> > >> > > I don't have made any attempt in GC tunning, yet.
> > >> > > I will read the perf section as suggested.
> > >> > > I'm currently using Nagios + JMX to monitor the cluster, but it's
> > >> > > currently used for alert only, the perfdata is not been stored, so
> > >> > > it's kind of useless right now, but i was thinking in use TSDB to
> > >> > > store it, any known case of integration?
> > >> > > ---
> > >> > >
> > >> > > Sandy,
> > >> > >
> > >> > > Yes, my timeout is 30 seconds:
> > >> > >
> > >> > > <property>
> > >> > >   <name>zookeeper.session.timeout</name>
> > >> > >   <value>30000</value>
> > >> > > </property>
> > >> > >
> > >> > > To our application it's a sufferable time to wait in case a
> > >> > > RegionServer go offline.
> > >> > >
> > >> > > My heap is 4GB and my JVM params are:
> > >> > >
> > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > >> > >
> > >> > > I will try the -XX:+UseParallelOldGC param and post my feedback
> > here.
> > >> > > ---
> > >> > >
> > >> > > Ramkrishna,
> > >> > >
> > >> > > Seems the GC is the root of all evil in this case.
> > >> > > ----
> > >> > >
> > >> > > Thank you all for the answers. I will try out these valuable
> advices
> > >> > > given here and post my results.
> > >> > >
> > >> > > Leo Gamas.
> > >> > >
> > >> > > 2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>
> > >> > >
> > >> > > > Recently we faced a similar problem and it was due to GC config.
> > >> > > > Pls check your GC.
> > >> > > >
> > >> > > > Regards
> > >> > > > Ram
> > >> > > >
> > >> > > > -----Original Message-----
> > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On
> Behalf
> > Of
> > >> > > > Stack
> > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > >> > > > To: user@hbase.apache.org
> > >> > > > Subject: Re: RegionServer dying every two or three days
> > >> > > >
> > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > >> > > > <le...@jusbrasil.com.br> wrote:
> > >> > > > > The third line took 36.96 seconds to execute, can this be
> > causing
> > >> > > > > this problem?
> > >> > > > >
> > >> > > >
> > >> > > > Probably.  Have you made any attempt at GC tuning?
> > >> > > >
> > >> > > >
> > >> > > > > Reading the code a little it seems that, even if it's
> disabled,
> > if
> > >> > > > > all files are target in a compaction, it's considered a major
> > >> > > > > compaction. Is
> > >> > > > it
> > >> > > > > right?
> > >> > > > >
> > >> > > >
> > >> > > > That is right.  They get 'upgraded' from minor to major.
> > >> > > >
> > >> > > > This should be fine though.  What you are avoiding setting major
> > >> > > > compactions to 0 is all regions being major compacted on a
> > period, a
> > >> > > > heavy weight effective rewrite of all your data (unless already
> > >> major
> > >> > > > compacted).   It looks like you have this disabled which is good
> > >> until
> > >> > > > you've wrestled your cluster into submission.
> > >> > > >
> > >> > > >
> > >> > > > > The machines don't have swap, so the swappiness parameter
> don't
> > >> > > > > seem to apply here. Any other suggestion?
> > >> > > > >
> > >> > > >
> > >> > > > See the perf section of the hbase manual.  It has our current
> > list.
> > >> > > >
> > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > >> > > >
> > >> > > >
> > >> > > > St.Ack
> > >> > > >
> > >> > > > > Thanks.
> > >> > > > >
> > >> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > >> > > > >
> > >> > > > >> I will investigate this, thanks for the response.
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > >> > > > >>
> > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > >> > > > >>> timed out, have not heard from server in 61103ms for
> sessionid
> > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> > >> > > > >>> reconnect
> > >> > > > >>>
> > >> > > > >>> It looks like the process has been unresponsive for some
> time,
> > >> > > > >>> so ZK
> > >> > > > has
> > >> > > > >>> terminated the session.  Did you experience a long GC pause
> > >> > > > >>> right
> > >> > > > before
> > >> > > > >>> this?  If you don't have GC logging enabled for the RS, you
> > can
> > >> > > > sometimes
> > >> > > > >>> tell by noticing a gap in the timestamps of the log
> statements
> > >> > > > >>> leading
> > >> > > > up
> > >> > > > >>> to the crash.
> > >> > > > >>>
> > >> > > > >>> If it turns out to be GC, you might want to look at your
> > kernel
> > >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> > >> > > > >>>
> > >> > > > >>> Sandy
> > >> > > > >>>
> > >> > > > >>>
> > >> > > > >>> > -----Original Message-----
> > >> > > > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > >> > > > >>> > To: user@hbase.apache.org
> > >> > > > >>> > Subject: RegionServer dying every two or three days
> > >> > > > >>> >
> > >> > > > >>> > Hi,
> > >> > > > >>> >
> > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1
> > Master +
> > >> > > > >>> > 3
> > >> > > > >>> Slaves),
> > >> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra
> > Large
> > >> > > > Instance
> > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> > Zookeeper.
> > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> > >> > > > >>> > Datanode,
> > >> > > > >>> TaskTracker,
> > >> > > > >>> > RegionServer and Zookeeper.
> > >> > > > >>> >
> > >> > > > >>> > From time to time, every two or three days, one of the
> > >> > > > >>> > RegionServers processes goes down, but the other processes
> > >> > > > >>> > (DataNode, TaskTracker,
> > >> > > > >>> > Zookeeper) continue normally.
> > >> > > > >>> >
> > >> > > > >>> > Reading the logs:
> > >> > > > >>> >
> > >> > > > >>> > The connection with Zookeeper timed out:
> > >> > > > >>> >
> > >> > > > >>> > ---------------------------
> > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> session
> > >> > > > >>> > timed
> > >> > > > out,
> > >> > > > >>> have
> > >> > > > >>> > not heard from server in 61103ms for sessionid
> > >> > > > >>> > 0x23462a4cf93a8fc,
> > >> > > > >>> closing
> > >> > > > >>> > socket connection and attempting reconnect
> > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> session
> > >> > > > >>> > timed
> > >> > > > out,
> > >> > > > >>> have
> > >> > > > >>> > not heard from server in 61205ms for sessionid
> > >> > > > >>> > 0x346c561a55953e,
> > >> > > > closing
> > >> > > > >>> > socket connection and attempting reconnect
> > >> > > > >>> > ---------------------------
> > >> > > > >>> >
> > >> > > > >>> > And the Handlers start to fail:
> > >> > > > >>> >
> > >> > > > >>> > ---------------------------
> > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > Responder,
> > >> > > > >>> > call
> > >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> )
> > >> > > > >>> > from
> > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler
> > 81
> > >> > > > >>> > on
> > >> > > > 60020
> > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > >> > > > 13
> > >> > > > >>> > 3)
> > >> > > > >>> >         at
> > >> > > > >>>
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > >> > > > >>> > 1341)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > >> > > > >>> > ns
> > >> > > > >>> > e(HB
> > >> > > > >>> > aseServer.java:727)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > >> > > > >>> > as
> > >> > > > >>> > eSe
> > >> > > > >>> > rver.java:792)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > >> > > > :1
> > >> > > > >>> > 083)
> > >> > > > >>> >
> > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > Responder,
> > >> > > > >>> > call
> > >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> )
> > >> > > > >>> > from
> > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler
> > 62
> > >> > > > >>> > on
> > >> > > > 60020
> > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > >> > > > 13
> > >> > > > >>> > 3)
> > >> > > > >>> >         at
> > >> > > > >>>
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > >> > > > >>> > 1341)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > >> > > > >>> > ns
> > >> > > > >>> > e(HB
> > >> > > > >>> > aseServer.java:727)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > >> > > > >>> > as
> > >> > > > >>> > eSe
> > >> > > > >>> > rver.java:792)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > >> > > > :1
> > >> > > > >>> > 083)
> > >> > > > >>> > ---------------------------
> > >> > > > >>> >
> > >> > > > >>> > And finally the server throws a YouAreDeadException :( :
> > >> > > > >>> >
> > >> > > > >>> > ---------------------------
> > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> socket
> > >> > > > connection
> > >> > > > >>> to
> > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > connection
> > >> > > > >>> > established to
> ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > >> > > > initiating
> > >> > > > >>> session
> > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > >> > > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc
> > has
> > >> > > > >>> > expired, closing
> > >> > > > socket
> > >> > > > >>> > connection
> > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> socket
> > >> > > > connection
> > >> > > > >>> to
> > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > connection
> > >> > > > >>> > established to
> ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > >> > > > initiating
> > >> > > > >>> session
> > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > >> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e
> has
> > >> > > > >>> > expired, closing
> > >> > > > socket
> > >> > > > >>> > connection
> > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer:
> ABORTING
> > >> > > > >>> > region server
> > >> > > > >>> >
> serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > >> > maxHeap=4083):
> > >> > > > >>> > Unhandled
> > >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> > >> > Server
> > >> > > > >>> > REPORT rejected; currently processing
> > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> > server
> > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > >> > > > >>> > rejected; currently processing
> > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > >> > > > as
> > >> > > > >>> > dead server
> > >> > > > >>> >         at
> > >> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >> > > > >>> > Method)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > >> > > > to
> > >> > > > r
> > >> > > > >>> > AccessorImpl.java:39)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > >> > > > Co
> > >> > > > n
> > >> > > > >>> > structorAccessorImpl.java:27)
> > >> > > > >>> >         at
> > >> > > > >>>
> > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > >> > > > >>> > ot
> > >> > > > >>> > eExce
> > >> > > > >>> > ption.java:95)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > >> > > > >>> > mo
> > >> > > > >>> > te
> > >> > > > >>> > Exception.java:79)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > >> > > > >>> > rv
> > >> > > > >>> > erRep
> > >> > > > >>> > ort(HRegionServer.java:735)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > >> > > > .j
> > >> > > > >>> > ava:596)
> > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > >> > > > >>> > rejected; currently processing
> > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > >> > > > as
> > >> > > > >>> > dead server
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > >> > > > >>> > rM
> > >> > > > >>> > ana
> > >> > > > >>> > ger.java:204)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > >> > > > >>> > t(
> > >> > > > >>> > Serv
> > >> > > > >>> > erManager.java:262)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > >> > > > >>> > te
> > >> > > > >>> > r.jav
> > >> > > > >>> > a:669)
> > >> > > > >>> >         at
> > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > >> > > > Source)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > >> > > > >>> > od
> > >> > > > >>> > Acces
> > >> > > > >>> > sorImpl.java:25)
> > >> > > > >>> >         at
> java.lang.reflect.Method.invoke(Method.java:597)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > > >
> > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > >> > > > :1
> > >> > > > >>> > 039)
> > >> > > > >>> >
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > >> > > > >>> > av
> > >> > > > >>> > a:257
> > >> > > > >>> > )
> > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> > >> > > > >>> >         at
> > >> > > > >>> >
> > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > >> > > > >>> > rv
> > >> > > > >>> > erRep
> > >> > > > >>> > ort(HRegionServer.java:729)
> > >> > > > >>> >         ... 2 more
> > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of
> > >> > metrics:
> > >> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> > >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > >> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
> > >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > >> > > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335,
> > >> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
> > >> > > > >>> > blockCacheHitCachingRatio=98
> > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> STOPPED:
> > >> > > > >>> > Unhandled
> > >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> > >> > Server
> > >> > > > >>> > REPORT rejected; currently processing
> > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> > server
> > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on
> > >> > > > >>> > 60020
> > >> > > > >>> > ---------------------------
> > >> > > > >>> >
> > >> > > > >>> > Then i restart the RegionServer and everything is back to
> > >> normal.
> > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i
> > don't
> > >> > > > >>> > see any abnormality in the same time window.
> > >> > > > >>> > I think it was caused by the lost of connection to
> > zookeeper.
> > >> > > > >>> > Is it
> > >> > > > >>> advisable to
> > >> > > > >>> > run zookeeper in the same machines?
> > >> > > > >>> > if the RegionServer lost it's connection to Zookeeper,
> > there's
> > >> > > > >>> > a way
> > >> > > > (a
> > >> > > > >>> > configuration perhaps) to re-join the cluster, and not
> only
> > >> die?
> > >> > > > >>> >
> > >> > > > >>> > Any idea what is causing this?? Or to prevent it from
> > >> happening?
> > >> > > > >>> >
> > >> > > > >>> > Any help is appreciated.
> > >> > > > >>> >
> > >> > > > >>> > Best Regards,
> > >> > > > >>> >
> > >> > > > >>> > --
> > >> > > > >>> >
> > >> > > > >>> > *Leonardo Gamas*
> > >> > > > >>> > Software Engineer
> > >> > > > >>> > +557134943514
> > >> > > > >>> > +557581347440
> > >> > > > >>> > leogamas@jusbrasil.com.br
> > >> > > > >>> > www.jusbrasil.com.br
> > >> > > > >>>
> > >> > > > >>
> > >> > > > >>
> > >> > > > >>
> > >> > > > >> --
> > >> > > > >>
> > >> > > > >> *Leonardo Gamas*
> > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C
> > (75)
> > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > >> > > > >>
> > >> > > > >>
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > >
> > >> > > > > *Leonardo Gamas*
> > >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C
> (75)
> > >> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > >> > > >
> > >> > > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > >
> > >> > > *Leonardo Gamas*
> > >> > > Software Engineer
> > >> > > +557134943514
> > >> > > +557581347440
> > >> > > leogamas@jusbrasil.com.br
> > >> > > www.jusbrasil.com.br
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> > *Leonardo Gamas*
> > >> > Software Engineer
> > >> > T +55 (71) 3494-3514
> > >> > C +55 (75) 8134-7440
> > >> > leogamas@jusbrasil.com.br
> > >> > www.jusbrasil.com.br
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > >
> > > Software Engineer
> > > T +55 (71) 3494-3514
> > > C +55 (75) 8134-7440
> > > leogamas@jusbrasil.com.br
> > >
> > > www.jusbrasil.com.br
> > >
> > >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> >
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Neil Yalowitz <ne...@gmail.com>.
We have experienced many problems with our cluster on EC2.  The blunt
solution was to increase the Zookeeper timeout to 5 minutes or even more.

Even with a long timeout, however, it's not uncommon for us to see an EC2
instance to become unresponsive to pings and SSH several times during a
week.  It's been a very bad environment for clusters.


Neil

On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
<le...@jusbrasil.com.br>wrote:

> Hi Guys,
>
> I have tested the parameters provided by Sandy, and it solved the GC
> problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> I'm still experiencing some difficulties, the RegionServer continues to
> shutdown, but it seems related to I/O. It starts to timeout many
> connections, new connections to/from the machine timeout too, and finally
> the RegionServer dies because of YouAreDeadException. I will collect more
> data, but i think it's an Amazon/Virtualized Environment inherent issue.
>
> Thanks for the great help provided so far.
>
> 2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>
>
> > I don't think so, if Amazon stopped the machine it would cause a stop of
> > minutes, not seconds, and since the DataNode, TaskTracker and Zookeeper
> > continue to work normally.
> > But it can be related to the shared environment nature of Amazon, maybe
> > some spike in I/O caused by another virtualized server in the same
> physical
> > machine.
> >
> > But the intance type i'm using:
> >
> > *Extra Large Instance*
> >
> > 15 GB memory
> > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> > 1,690 GB instance storage
> > 64-bit platform
> > I/O Performance: High
> > API name: m1.xlarge
> > I was not expecting to suffer from this problems, or at least not much.
> >
> >
> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> >
> >> You think it's an Amazon problem maybe?  Like they paused or migrated
> >> your virtual machine, and it just happens to be during GC, leaving us to
> >> think the GC ran long when it didn't?  I don't have a lot of experience
> >> with Amazon so I don't know if that sort of thing is common.
> >>
> >> > -----Original Message-----
> >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >> > Sent: Thursday, January 05, 2012 13:15
> >> > To: user@hbase.apache.org
> >> > Subject: Re: RegionServer dying every two or three days
> >> >
> >> > I checked the CPU Utilization graphics provided by Amazon (it's not
> >> accurate,
> >> > since the sample time is about 5 minutes) and don't see any
> >> abnormality. I
> >> > will setup TSDB with Nagios to have a more reliable source of
> >> performance
> >> > data.
> >> >
> >> > The machines don't have swap space, if i run:
> >> >
> >> > $ swapon -s
> >> >
> >> > To display swap usage summary, it returns an empty list.
> >> >
> >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.
> >> >
> >> > I don't have payed much attention to the value of the new size param.
> >> >
> >> > Thanks again for the help!!
> >> >
> >> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> >> >
> >> > > That size heap doesn't seem like it should cause a 36 second GC (a
> >> > > minor GC even if I remember your logs correctly), so I tend to think
> >> > > that other things are probably going on.
> >> > >
> >> > > This line here:
> >> > >
> >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
> >> > > secs]
> >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
> >> > > 954388K->sys=0.01,
> >> > > real=36.96 secs]
> >> > >
> >> > > is really mysterious to me.  It seems to indicate that the process
> was
> >> > > blocked for almost 37 seconds during a minor collection.  Note the
> CPU
> >> > > times are very low but the wall time is very high.  If it was
> actually
> >> > > doing GC work, I'd expect to see user time higher than real time, as
> >> > > it is in other parallel collections (see your log snippet).  Were
> you
> >> > > really so CPU starved that it took 37 seconds to get in 50ms of
> work?
> >> > > I can't make sense of that.  I'm trying to think of something that
> >> > > would block you for that long while all your threads are stopped for
> >> > > GC, other than being in swap, but I can't come up with anything.
> >>  You're
> >> > certain you're not in swap?
> >> > >
> >> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> while
> >> > > you troubleshoot?
> >> > >
> >> > > Why is your new size so small?  This generally means that relatively
> >> > > more objects are being tenured than would be with a larger new size.
> >> > > This could make collections of the old gen worse (GC time is said to
> >> > > be proportional to the number of live objects in the generation, and
> >> > > CMS does indeed cause STW pauses).  A typical new to tenured ratio
> >> > > might be 1:3.  Were the new gen GCs taking too long?  This is
> probably
> >> > > orthogonal to your immediate issue, though.
> >> > >
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >> > > Sent: Thursday, January 05, 2012 5:33 AM
> >> > > To: user@hbase.apache.org
> >> > > Subject: Re: RegionServer dying every two or three days
> >> > >
> >> > >  St.Ack,
> >> > >
> >> > > I don't have made any attempt in GC tunning, yet.
> >> > > I will read the perf section as suggested.
> >> > > I'm currently using Nagios + JMX to monitor the cluster, but it's
> >> > > currently used for alert only, the perfdata is not been stored, so
> >> > > it's kind of useless right now, but i was thinking in use TSDB to
> >> > > store it, any known case of integration?
> >> > > ---
> >> > >
> >> > > Sandy,
> >> > >
> >> > > Yes, my timeout is 30 seconds:
> >> > >
> >> > > <property>
> >> > >   <name>zookeeper.session.timeout</name>
> >> > >   <value>30000</value>
> >> > > </property>
> >> > >
> >> > > To our application it's a sufferable time to wait in case a
> >> > > RegionServer go offline.
> >> > >
> >> > > My heap is 4GB and my JVM params are:
> >> > >
> >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> >> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> >> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> >> > >
> >> > > I will try the -XX:+UseParallelOldGC param and post my feedback
> here.
> >> > > ---
> >> > >
> >> > > Ramkrishna,
> >> > >
> >> > > Seems the GC is the root of all evil in this case.
> >> > > ----
> >> > >
> >> > > Thank you all for the answers. I will try out these valuable advices
> >> > > given here and post my results.
> >> > >
> >> > > Leo Gamas.
> >> > >
> >> > > 2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>
> >> > >
> >> > > > Recently we faced a similar problem and it was due to GC config.
> >> > > > Pls check your GC.
> >> > > >
> >> > > > Regards
> >> > > > Ram
> >> > > >
> >> > > > -----Original Message-----
> >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf
> Of
> >> > > > Stack
> >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> >> > > > To: user@hbase.apache.org
> >> > > > Subject: Re: RegionServer dying every two or three days
> >> > > >
> >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> >> > > > <le...@jusbrasil.com.br> wrote:
> >> > > > > The third line took 36.96 seconds to execute, can this be
> causing
> >> > > > > this problem?
> >> > > > >
> >> > > >
> >> > > > Probably.  Have you made any attempt at GC tuning?
> >> > > >
> >> > > >
> >> > > > > Reading the code a little it seems that, even if it's disabled,
> if
> >> > > > > all files are target in a compaction, it's considered a major
> >> > > > > compaction. Is
> >> > > > it
> >> > > > > right?
> >> > > > >
> >> > > >
> >> > > > That is right.  They get 'upgraded' from minor to major.
> >> > > >
> >> > > > This should be fine though.  What you are avoiding setting major
> >> > > > compactions to 0 is all regions being major compacted on a
> period, a
> >> > > > heavy weight effective rewrite of all your data (unless already
> >> major
> >> > > > compacted).   It looks like you have this disabled which is good
> >> until
> >> > > > you've wrestled your cluster into submission.
> >> > > >
> >> > > >
> >> > > > > The machines don't have swap, so the swappiness parameter don't
> >> > > > > seem to apply here. Any other suggestion?
> >> > > > >
> >> > > >
> >> > > > See the perf section of the hbase manual.  It has our current
> list.
> >> > > >
> >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> >> > > >
> >> > > >
> >> > > > St.Ack
> >> > > >
> >> > > > > Thanks.
> >> > > > >
> >> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> >> > > > >
> >> > > > >> I will investigate this, thanks for the response.
> >> > > > >>
> >> > > > >>
> >> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> >> > > > >>
> >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >> > > > >>> timed out, have not heard from server in 61103ms for sessionid
> >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> >> > > > >>> reconnect
> >> > > > >>>
> >> > > > >>> It looks like the process has been unresponsive for some time,
> >> > > > >>> so ZK
> >> > > > has
> >> > > > >>> terminated the session.  Did you experience a long GC pause
> >> > > > >>> right
> >> > > > before
> >> > > > >>> this?  If you don't have GC logging enabled for the RS, you
> can
> >> > > > sometimes
> >> > > > >>> tell by noticing a gap in the timestamps of the log statements
> >> > > > >>> leading
> >> > > > up
> >> > > > >>> to the crash.
> >> > > > >>>
> >> > > > >>> If it turns out to be GC, you might want to look at your
> kernel
> >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> >> > > > >>>
> >> > > > >>> Sandy
> >> > > > >>>
> >> > > > >>>
> >> > > > >>> > -----Original Message-----
> >> > > > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> >> > > > >>> > To: user@hbase.apache.org
> >> > > > >>> > Subject: RegionServer dying every two or three days
> >> > > > >>> >
> >> > > > >>> > Hi,
> >> > > > >>> >
> >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1
> Master +
> >> > > > >>> > 3
> >> > > > >>> Slaves),
> >> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra
> Large
> >> > > > Instance
> >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> Zookeeper.
> >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> >> > > > >>> > Datanode,
> >> > > > >>> TaskTracker,
> >> > > > >>> > RegionServer and Zookeeper.
> >> > > > >>> >
> >> > > > >>> > From time to time, every two or three days, one of the
> >> > > > >>> > RegionServers processes goes down, but the other processes
> >> > > > >>> > (DataNode, TaskTracker,
> >> > > > >>> > Zookeeper) continue normally.
> >> > > > >>> >
> >> > > > >>> > Reading the logs:
> >> > > > >>> >
> >> > > > >>> > The connection with Zookeeper timed out:
> >> > > > >>> >
> >> > > > >>> > ---------------------------
> >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >> > > > >>> > timed
> >> > > > out,
> >> > > > >>> have
> >> > > > >>> > not heard from server in 61103ms for sessionid
> >> > > > >>> > 0x23462a4cf93a8fc,
> >> > > > >>> closing
> >> > > > >>> > socket connection and attempting reconnect
> >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >> > > > >>> > timed
> >> > > > out,
> >> > > > >>> have
> >> > > > >>> > not heard from server in 61205ms for sessionid
> >> > > > >>> > 0x346c561a55953e,
> >> > > > closing
> >> > > > >>> > socket connection and attempting reconnect
> >> > > > >>> > ---------------------------
> >> > > > >>> >
> >> > > > >>> > And the Handlers start to fail:
> >> > > > >>> >
> >> > > > >>> > ---------------------------
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> Responder,
> >> > > > >>> > call
> >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf)
> >> > > > >>> > from
> >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler
> 81
> >> > > > >>> > on
> >> > > > 60020
> >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> >> > > > 13
> >> > > > >>> > 3)
> >> > > > >>> >         at
> >> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > > > >>> > 1341)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> >> > > > >>> > ns
> >> > > > >>> > e(HB
> >> > > > >>> > aseServer.java:727)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> >> > > > >>> > as
> >> > > > >>> > eSe
> >> > > > >>> > rver.java:792)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> >> > > > :1
> >> > > > >>> > 083)
> >> > > > >>> >
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> Responder,
> >> > > > >>> > call
> >> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430)
> >> > > > >>> > from
> >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler
> 62
> >> > > > >>> > on
> >> > > > 60020
> >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> >> > > > 13
> >> > > > >>> > 3)
> >> > > > >>> >         at
> >> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > > > >>> > 1341)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> >> > > > >>> > ns
> >> > > > >>> > e(HB
> >> > > > >>> > aseServer.java:727)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> >> > > > >>> > as
> >> > > > >>> > eSe
> >> > > > >>> > rver.java:792)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> >> > > > :1
> >> > > > >>> > 083)
> >> > > > >>> > ---------------------------
> >> > > > >>> >
> >> > > > >>> > And finally the server throws a YouAreDeadException :( :
> >> > > > >>> >
> >> > > > >>> > ---------------------------
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > > > connection
> >> > > > >>> to
> >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> connection
> >> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > > > initiating
> >> > > > >>> session
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> >> > > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc
> has
> >> > > > >>> > expired, closing
> >> > > > socket
> >> > > > >>> > connection
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > > > connection
> >> > > > >>> to
> >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> connection
> >> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > > > initiating
> >> > > > >>> session
> >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> >> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e has
> >> > > > >>> > expired, closing
> >> > > > socket
> >> > > > >>> > connection
> >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> >> > > > >>> > region server
> >> > > > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> >> > maxHeap=4083):
> >> > > > >>> > Unhandled
> >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> >> > Server
> >> > > > >>> > REPORT rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> server
> >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > > > >>> > rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> >> > > > as
> >> > > > >>> > dead server
> >> > > > >>> >         at
> >> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > > > >>> > Method)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> >> > > > to
> >> > > > r
> >> > > > >>> > AccessorImpl.java:39)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> >> > > > Co
> >> > > > n
> >> > > > >>> > structorAccessorImpl.java:27)
> >> > > > >>> >         at
> >> > > > >>>
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> >> > > > >>> > ot
> >> > > > >>> > eExce
> >> > > > >>> > ption.java:95)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> >> > > > >>> > mo
> >> > > > >>> > te
> >> > > > >>> > Exception.java:79)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> >> > > > >>> > rv
> >> > > > >>> > erRep
> >> > > > >>> > ort(HRegionServer.java:735)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> >> > > > .j
> >> > > > >>> > ava:596)
> >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > > > >>> > rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> >> > > > as
> >> > > > >>> > dead server
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> >> > > > >>> > rM
> >> > > > >>> > ana
> >> > > > >>> > ger.java:204)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> >> > > > >>> > t(
> >> > > > >>> > Serv
> >> > > > >>> > erManager.java:262)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> >> > > > >>> > te
> >> > > > >>> > r.jav
> >> > > > >>> > a:669)
> >> > > > >>> >         at
> sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> >> > > > Source)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> >> > > > >>> > od
> >> > > > >>> > Acces
> >> > > > >>> > sorImpl.java:25)
> >> > > > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > > >
> >> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> >> > > > :1
> >> > > > >>> > 039)
> >> > > > >>> >
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> >> > > > >>> > av
> >> > > > >>> > a:257
> >> > > > >>> > )
> >> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> >> > > > >>> >         at
> >> > > > >>> >
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> >> > > > >>> > rv
> >> > > > >>> > erRep
> >> > > > >>> > ort(HRegionServer.java:729)
> >> > > > >>> >         ... 2 more
> >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of
> >> > metrics:
> >> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> >> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> >> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
> >> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> >> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> >> > > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335,
> >> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
> >> > > > >>> > blockCacheHitCachingRatio=98
> >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> >> > > > >>> > Unhandled
> >> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> >> > Server
> >> > > > >>> > REPORT rejected; currently processing
> >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead
> server
> >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on
> >> > > > >>> > 60020
> >> > > > >>> > ---------------------------
> >> > > > >>> >
> >> > > > >>> > Then i restart the RegionServer and everything is back to
> >> normal.
> >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i
> don't
> >> > > > >>> > see any abnormality in the same time window.
> >> > > > >>> > I think it was caused by the lost of connection to
> zookeeper.
> >> > > > >>> > Is it
> >> > > > >>> advisable to
> >> > > > >>> > run zookeeper in the same machines?
> >> > > > >>> > if the RegionServer lost it's connection to Zookeeper,
> there's
> >> > > > >>> > a way
> >> > > > (a
> >> > > > >>> > configuration perhaps) to re-join the cluster, and not only
> >> die?
> >> > > > >>> >
> >> > > > >>> > Any idea what is causing this?? Or to prevent it from
> >> happening?
> >> > > > >>> >
> >> > > > >>> > Any help is appreciated.
> >> > > > >>> >
> >> > > > >>> > Best Regards,
> >> > > > >>> >
> >> > > > >>> > --
> >> > > > >>> >
> >> > > > >>> > *Leonardo Gamas*
> >> > > > >>> > Software Engineer
> >> > > > >>> > +557134943514
> >> > > > >>> > +557581347440
> >> > > > >>> > leogamas@jusbrasil.com.br
> >> > > > >>> > www.jusbrasil.com.br
> >> > > > >>>
> >> > > > >>
> >> > > > >>
> >> > > > >>
> >> > > > >> --
> >> > > > >>
> >> > > > >> *Leonardo Gamas*
> >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C
> (75)
> >> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >> > > > >>
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > >
> >> > > > > *Leonardo Gamas*
> >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> >> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >> > > >
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > *Leonardo Gamas*
> >> > > Software Engineer
> >> > > +557134943514
> >> > > +557581347440
> >> > > leogamas@jusbrasil.com.br
> >> > > www.jusbrasil.com.br
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > *Leonardo Gamas*
> >> > Software Engineer
> >> > T +55 (71) 3494-3514
> >> > C +55 (75) 8134-7440
> >> > leogamas@jusbrasil.com.br
> >> > www.jusbrasil.com.br
> >>
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> >
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leogamas@jusbrasil.com.br
> >
> > www.jusbrasil.com.br
> >
> >
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
>

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
Hi Guys,

I have tested the parameters provided by Sandy, and it solved the GC
problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
I'm still experiencing some difficulties, the RegionServer continues to
shutdown, but it seems related to I/O. It starts to timeout many
connections, new connections to/from the machine timeout too, and finally
the RegionServer dies because of YouAreDeadException. I will collect more
data, but i think it's an Amazon/Virtualized Environment inherent issue.

Thanks for the great help provided so far.

2012/1/5 Leonardo Gamas <le...@jusbrasil.com.br>

> I don't think so, if Amazon stopped the machine it would cause a stop of
> minutes, not seconds, and since the DataNode, TaskTracker and Zookeeper
> continue to work normally.
> But it can be related to the shared environment nature of Amazon, maybe
> some spike in I/O caused by another virtualized server in the same physical
> machine.
>
> But the intance type i'm using:
>
> *Extra Large Instance*
>
> 15 GB memory
> 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
> 1,690 GB instance storage
> 64-bit platform
> I/O Performance: High
> API name: m1.xlarge
> I was not expecting to suffer from this problems, or at least not much.
>
>
> 2012/1/5 Sandy Pratt <pr...@adobe.com>
>
>> You think it's an Amazon problem maybe?  Like they paused or migrated
>> your virtual machine, and it just happens to be during GC, leaving us to
>> think the GC ran long when it didn't?  I don't have a lot of experience
>> with Amazon so I don't know if that sort of thing is common.
>>
>> > -----Original Message-----
>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
>> > Sent: Thursday, January 05, 2012 13:15
>> > To: user@hbase.apache.org
>> > Subject: Re: RegionServer dying every two or three days
>> >
>> > I checked the CPU Utilization graphics provided by Amazon (it's not
>> accurate,
>> > since the sample time is about 5 minutes) and don't see any
>> abnormality. I
>> > will setup TSDB with Nagios to have a more reliable source of
>> performance
>> > data.
>> >
>> > The machines don't have swap space, if i run:
>> >
>> > $ swapon -s
>> >
>> > To display swap usage summary, it returns an empty list.
>> >
>> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.
>> >
>> > I don't have payed much attention to the value of the new size param.
>> >
>> > Thanks again for the help!!
>> >
>> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
>> >
>> > > That size heap doesn't seem like it should cause a 36 second GC (a
>> > > minor GC even if I remember your logs correctly), so I tend to think
>> > > that other things are probably going on.
>> > >
>> > > This line here:
>> > >
>> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
>> > > secs]
>> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
>> > > 954388K->sys=0.01,
>> > > real=36.96 secs]
>> > >
>> > > is really mysterious to me.  It seems to indicate that the process was
>> > > blocked for almost 37 seconds during a minor collection.  Note the CPU
>> > > times are very low but the wall time is very high.  If it was actually
>> > > doing GC work, I'd expect to see user time higher than real time, as
>> > > it is in other parallel collections (see your log snippet).  Were you
>> > > really so CPU starved that it took 37 seconds to get in 50ms of work?
>> > > I can't make sense of that.  I'm trying to think of something that
>> > > would block you for that long while all your threads are stopped for
>> > > GC, other than being in swap, but I can't come up with anything.
>>  You're
>> > certain you're not in swap?
>> > >
>> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while
>> > > you troubleshoot?
>> > >
>> > > Why is your new size so small?  This generally means that relatively
>> > > more objects are being tenured than would be with a larger new size.
>> > > This could make collections of the old gen worse (GC time is said to
>> > > be proportional to the number of live objects in the generation, and
>> > > CMS does indeed cause STW pauses).  A typical new to tenured ratio
>> > > might be 1:3.  Were the new gen GCs taking too long?  This is probably
>> > > orthogonal to your immediate issue, though.
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
>> > > Sent: Thursday, January 05, 2012 5:33 AM
>> > > To: user@hbase.apache.org
>> > > Subject: Re: RegionServer dying every two or three days
>> > >
>> > >  St.Ack,
>> > >
>> > > I don't have made any attempt in GC tunning, yet.
>> > > I will read the perf section as suggested.
>> > > I'm currently using Nagios + JMX to monitor the cluster, but it's
>> > > currently used for alert only, the perfdata is not been stored, so
>> > > it's kind of useless right now, but i was thinking in use TSDB to
>> > > store it, any known case of integration?
>> > > ---
>> > >
>> > > Sandy,
>> > >
>> > > Yes, my timeout is 30 seconds:
>> > >
>> > > <property>
>> > >   <name>zookeeper.session.timeout</name>
>> > >   <value>30000</value>
>> > > </property>
>> > >
>> > > To our application it's a sufferable time to wait in case a
>> > > RegionServer go offline.
>> > >
>> > > My heap is 4GB and my JVM params are:
>> > >
>> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
>> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
>> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
>> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
>> > >
>> > > I will try the -XX:+UseParallelOldGC param and post my feedback here.
>> > > ---
>> > >
>> > > Ramkrishna,
>> > >
>> > > Seems the GC is the root of all evil in this case.
>> > > ----
>> > >
>> > > Thank you all for the answers. I will try out these valuable advices
>> > > given here and post my results.
>> > >
>> > > Leo Gamas.
>> > >
>> > > 2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>
>> > >
>> > > > Recently we faced a similar problem and it was due to GC config.
>> > > > Pls check your GC.
>> > > >
>> > > > Regards
>> > > > Ram
>> > > >
>> > > > -----Original Message-----
>> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> > > > Stack
>> > > > Sent: Thursday, January 05, 2012 2:50 AM
>> > > > To: user@hbase.apache.org
>> > > > Subject: Re: RegionServer dying every two or three days
>> > > >
>> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
>> > > > <le...@jusbrasil.com.br> wrote:
>> > > > > The third line took 36.96 seconds to execute, can this be causing
>> > > > > this problem?
>> > > > >
>> > > >
>> > > > Probably.  Have you made any attempt at GC tuning?
>> > > >
>> > > >
>> > > > > Reading the code a little it seems that, even if it's disabled, if
>> > > > > all files are target in a compaction, it's considered a major
>> > > > > compaction. Is
>> > > > it
>> > > > > right?
>> > > > >
>> > > >
>> > > > That is right.  They get 'upgraded' from minor to major.
>> > > >
>> > > > This should be fine though.  What you are avoiding setting major
>> > > > compactions to 0 is all regions being major compacted on a period, a
>> > > > heavy weight effective rewrite of all your data (unless already
>> major
>> > > > compacted).   It looks like you have this disabled which is good
>> until
>> > > > you've wrestled your cluster into submission.
>> > > >
>> > > >
>> > > > > The machines don't have swap, so the swappiness parameter don't
>> > > > > seem to apply here. Any other suggestion?
>> > > > >
>> > > >
>> > > > See the perf section of the hbase manual.  It has our current list.
>> > > >
>> > > > Are you monitoring your cluster w/ ganglia or tsdb?
>> > > >
>> > > >
>> > > > St.Ack
>> > > >
>> > > > > Thanks.
>> > > > >
>> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
>> > > > >
>> > > > >> I will investigate this, thanks for the response.
>> > > > >>
>> > > > >>
>> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
>> > > > >>
>> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
>> > > > >>> timed out, have not heard from server in 61103ms for sessionid
>> > > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
>> > > > >>> reconnect
>> > > > >>>
>> > > > >>> It looks like the process has been unresponsive for some time,
>> > > > >>> so ZK
>> > > > has
>> > > > >>> terminated the session.  Did you experience a long GC pause
>> > > > >>> right
>> > > > before
>> > > > >>> this?  If you don't have GC logging enabled for the RS, you can
>> > > > sometimes
>> > > > >>> tell by noticing a gap in the timestamps of the log statements
>> > > > >>> leading
>> > > > up
>> > > > >>> to the crash.
>> > > > >>>
>> > > > >>> If it turns out to be GC, you might want to look at your kernel
>> > > > >>> swappiness setting (set it to 0) and your JVM params.
>> > > > >>>
>> > > > >>> Sandy
>> > > > >>>
>> > > > >>>
>> > > > >>> > -----Original Message-----
>> > > > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
>> > > > >>> > Sent: Thursday, December 29, 2011 07:44
>> > > > >>> > To: user@hbase.apache.org
>> > > > >>> > Subject: RegionServer dying every two or three days
>> > > > >>> >
>> > > > >>> > Hi,
>> > > > >>> >
>> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master +
>> > > > >>> > 3
>> > > > >>> Slaves),
>> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra Large
>> > > > Instance
>> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
>> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
>> > > > >>> > Datanode,
>> > > > >>> TaskTracker,
>> > > > >>> > RegionServer and Zookeeper.
>> > > > >>> >
>> > > > >>> > From time to time, every two or three days, one of the
>> > > > >>> > RegionServers processes goes down, but the other processes
>> > > > >>> > (DataNode, TaskTracker,
>> > > > >>> > Zookeeper) continue normally.
>> > > > >>> >
>> > > > >>> > Reading the logs:
>> > > > >>> >
>> > > > >>> > The connection with Zookeeper timed out:
>> > > > >>> >
>> > > > >>> > ---------------------------
>> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
>> > > > >>> > timed
>> > > > out,
>> > > > >>> have
>> > > > >>> > not heard from server in 61103ms for sessionid
>> > > > >>> > 0x23462a4cf93a8fc,
>> > > > >>> closing
>> > > > >>> > socket connection and attempting reconnect
>> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
>> > > > >>> > timed
>> > > > out,
>> > > > >>> have
>> > > > >>> > not heard from server in 61205ms for sessionid
>> > > > >>> > 0x346c561a55953e,
>> > > > closing
>> > > > >>> > socket connection and attempting reconnect
>> > > > >>> > ---------------------------
>> > > > >>> >
>> > > > >>> > And the Handlers start to fail:
>> > > > >>> >
>> > > > >>> > ---------------------------
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
>> > > > >>> > call
>> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf)
>> > > > >>> > from
>> > > > >>> > xx.xx.xx.xx:xxxx: output error
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81
>> > > > >>> > on
>> > > > 60020
>> > > > >>> > caught: java.nio.channels.ClosedChannelException
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
>> > > > 13
>> > > > >>> > 3)
>> > > > >>> >         at
>> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > > > >>> > 1341)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
>> > > > >>> > ns
>> > > > >>> > e(HB
>> > > > >>> > aseServer.java:727)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
>> > > > >>> > as
>> > > > >>> > eSe
>> > > > >>> > rver.java:792)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
>> > > > :1
>> > > > >>> > 083)
>> > > > >>> >
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
>> > > > >>> > call
>> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430)
>> > > > >>> > from
>> > > > >>> > xx.xx.xx.xx:xxxx: output error
>> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62
>> > > > >>> > on
>> > > > 60020
>> > > > >>> > caught: java.nio.channels.ClosedChannelException
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
>> > > > 13
>> > > > >>> > 3)
>> > > > >>> >         at
>> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > > > >>> > 1341)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
>> > > > >>> > ns
>> > > > >>> > e(HB
>> > > > >>> > aseServer.java:727)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
>> > > > >>> > as
>> > > > >>> > eSe
>> > > > >>> > rver.java:792)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
>> > > > :1
>> > > > >>> > 083)
>> > > > >>> > ---------------------------
>> > > > >>> >
>> > > > >>> > And finally the server throws a YouAreDeadException :( :
>> > > > >>> >
>> > > > >>> > ---------------------------
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
>> > > > connection
>> > > > >>> to
>> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
>> > > > initiating
>> > > > >>> session
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
>> > > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc has
>> > > > >>> > expired, closing
>> > > > socket
>> > > > >>> > connection
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
>> > > > connection
>> > > > >>> to
>> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
>> > > > initiating
>> > > > >>> session
>> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
>> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e has
>> > > > >>> > expired, closing
>> > > > socket
>> > > > >>> > connection
>> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
>> > > > >>> > region server
>> > > > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
>> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
>> > maxHeap=4083):
>> > > > >>> > Unhandled
>> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
>> > Server
>> > > > >>> > REPORT rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
>> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > > > >>> > rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
>> > > > as
>> > > > >>> > dead server
>> > > > >>> >         at
>> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > > > >>> > Method)
>> > > > >>> >         at
>> > > > >>> >
>> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
>> > > > to
>> > > > r
>> > > > >>> > AccessorImpl.java:39)
>> > > > >>> >         at
>> > > > >>> >
>> > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
>> > > > Co
>> > > > n
>> > > > >>> > structorAccessorImpl.java:27)
>> > > > >>> >         at
>> > > > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
>> > > > >>> > ot
>> > > > >>> > eExce
>> > > > >>> > ption.java:95)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
>> > > > >>> > mo
>> > > > >>> > te
>> > > > >>> > Exception.java:79)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
>> > > > >>> > rv
>> > > > >>> > erRep
>> > > > >>> > ort(HRegionServer.java:735)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
>> > > > .j
>> > > > >>> > ava:596)
>> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
>> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > > > >>> > rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
>> > > > as
>> > > > >>> > dead server
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
>> > > > >>> > rM
>> > > > >>> > ana
>> > > > >>> > ger.java:204)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
>> > > > >>> > t(
>> > > > >>> > Serv
>> > > > >>> > erManager.java:262)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
>> > > > >>> > te
>> > > > >>> > r.jav
>> > > > >>> > a:669)
>> > > > >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>> > > > Source)
>> > > > >>> >         at
>> > > > >>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
>> > > > >>> > od
>> > > > >>> > Acces
>> > > > >>> > sorImpl.java:25)
>> > > > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> > > > >>> >         at
>> > > > >>> >
>> > > >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
>> > > > :1
>> > > > >>> > 039)
>> > > > >>> >
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
>> > > > >>> > av
>> > > > >>> > a:257
>> > > > >>> > )
>> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
>> > > > >>> >         at
>> > > > >>> >
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
>> > > > >>> > rv
>> > > > >>> > erRep
>> > > > >>> > ort(HRegionServer.java:729)
>> > > > >>> >         ... 2 more
>> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of
>> > metrics:
>> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
>> > > > >>> > storefileIndexSize=78, memstoreSize=796,
>> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
>> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
>> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
>> > > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335,
>> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
>> > > > >>> > blockCacheHitCachingRatio=98
>> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
>> > > > >>> > Unhandled
>> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
>> > Server
>> > > > >>> > REPORT rejected; currently processing
>> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on
>> > > > >>> > 60020
>> > > > >>> > ---------------------------
>> > > > >>> >
>> > > > >>> > Then i restart the RegionServer and everything is back to
>> normal.
>> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
>> > > > >>> > see any abnormality in the same time window.
>> > > > >>> > I think it was caused by the lost of connection to zookeeper.
>> > > > >>> > Is it
>> > > > >>> advisable to
>> > > > >>> > run zookeeper in the same machines?
>> > > > >>> > if the RegionServer lost it's connection to Zookeeper, there's
>> > > > >>> > a way
>> > > > (a
>> > > > >>> > configuration perhaps) to re-join the cluster, and not only
>> die?
>> > > > >>> >
>> > > > >>> > Any idea what is causing this?? Or to prevent it from
>> happening?
>> > > > >>> >
>> > > > >>> > Any help is appreciated.
>> > > > >>> >
>> > > > >>> > Best Regards,
>> > > > >>> >
>> > > > >>> > --
>> > > > >>> >
>> > > > >>> > *Leonardo Gamas*
>> > > > >>> > Software Engineer
>> > > > >>> > +557134943514
>> > > > >>> > +557581347440
>> > > > >>> > leogamas@jusbrasil.com.br
>> > > > >>> > www.jusbrasil.com.br
>> > > > >>>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >>
>> > > > >> *Leonardo Gamas*
>> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
>> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
>> > > > >>
>> > > > >>
>> > > > >
>> > > > >
>> > > > > --
>> > > > >
>> > > > > *Leonardo Gamas*
>> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
>> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
>> > > >
>> > > >
>> > >
>> > >
>> > > --
>> > >
>> > > *Leonardo Gamas*
>> > > Software Engineer
>> > > +557134943514
>> > > +557581347440
>> > > leogamas@jusbrasil.com.br
>> > > www.jusbrasil.com.br
>> > >
>> >
>> >
>> >
>> > --
>> >
>> > *Leonardo Gamas*
>> > Software Engineer
>> > T +55 (71) 3494-3514
>> > C +55 (75) 8134-7440
>> > leogamas@jusbrasil.com.br
>> > www.jusbrasil.com.br
>>
>
>
>
> --
>
> *Leonardo Gamas*
>
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> leogamas@jusbrasil.com.br
>
> www.jusbrasil.com.br
>
>


-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
I don't think so, if Amazon stopped the machine it would cause a stop of
minutes, not seconds, and since the DataNode, TaskTracker and Zookeeper
continue to work normally.
But it can be related to the shared environment nature of Amazon, maybe
some spike in I/O caused by another virtualized server in the same physical
machine.

But the intance type i'm using:

*Extra Large Instance*

15 GB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
1,690 GB instance storage
64-bit platform
I/O Performance: High
API name: m1.xlarge
I was not expecting to suffer from this problems, or at least not much.

2012/1/5 Sandy Pratt <pr...@adobe.com>

> You think it's an Amazon problem maybe?  Like they paused or migrated your
> virtual machine, and it just happens to be during GC, leaving us to think
> the GC ran long when it didn't?  I don't have a lot of experience with
> Amazon so I don't know if that sort of thing is common.
>
> > -----Original Message-----
> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > Sent: Thursday, January 05, 2012 13:15
> > To: user@hbase.apache.org
> > Subject: Re: RegionServer dying every two or three days
> >
> > I checked the CPU Utilization graphics provided by Amazon (it's not
> accurate,
> > since the sample time is about 5 minutes) and don't see any abnormality.
> I
> > will setup TSDB with Nagios to have a more reliable source of performance
> > data.
> >
> > The machines don't have swap space, if i run:
> >
> > $ swapon -s
> >
> > To display swap usage summary, it returns an empty list.
> >
> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.
> >
> > I don't have payed much attention to the value of the new size param.
> >
> > Thanks again for the help!!
> >
> > 2012/1/5 Sandy Pratt <pr...@adobe.com>
> >
> > > That size heap doesn't seem like it should cause a 36 second GC (a
> > > minor GC even if I remember your logs correctly), so I tend to think
> > > that other things are probably going on.
> > >
> > > This line here:
> > >
> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
> > > secs]
> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
> > > 954388K->sys=0.01,
> > > real=36.96 secs]
> > >
> > > is really mysterious to me.  It seems to indicate that the process was
> > > blocked for almost 37 seconds during a minor collection.  Note the CPU
> > > times are very low but the wall time is very high.  If it was actually
> > > doing GC work, I'd expect to see user time higher than real time, as
> > > it is in other parallel collections (see your log snippet).  Were you
> > > really so CPU starved that it took 37 seconds to get in 50ms of work?
> > > I can't make sense of that.  I'm trying to think of something that
> > > would block you for that long while all your threads are stopped for
> > > GC, other than being in swap, but I can't come up with anything.
>  You're
> > certain you're not in swap?
> > >
> > > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while
> > > you troubleshoot?
> > >
> > > Why is your new size so small?  This generally means that relatively
> > > more objects are being tenured than would be with a larger new size.
> > > This could make collections of the old gen worse (GC time is said to
> > > be proportional to the number of live objects in the generation, and
> > > CMS does indeed cause STW pauses).  A typical new to tenured ratio
> > > might be 1:3.  Were the new gen GCs taking too long?  This is probably
> > > orthogonal to your immediate issue, though.
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: RegionServer dying every two or three days
> > >
> > >  St.Ack,
> > >
> > > I don't have made any attempt in GC tunning, yet.
> > > I will read the perf section as suggested.
> > > I'm currently using Nagios + JMX to monitor the cluster, but it's
> > > currently used for alert only, the perfdata is not been stored, so
> > > it's kind of useless right now, but i was thinking in use TSDB to
> > > store it, any known case of integration?
> > > ---
> > >
> > > Sandy,
> > >
> > > Yes, my timeout is 30 seconds:
> > >
> > > <property>
> > >   <name>zookeeper.session.timeout</name>
> > >   <value>30000</value>
> > > </property>
> > >
> > > To our application it's a sufferable time to wait in case a
> > > RegionServer go offline.
> > >
> > > My heap is 4GB and my JVM params are:
> > >
> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > >
> > > I will try the -XX:+UseParallelOldGC param and post my feedback here.
> > > ---
> > >
> > > Ramkrishna,
> > >
> > > Seems the GC is the root of all evil in this case.
> > > ----
> > >
> > > Thank you all for the answers. I will try out these valuable advices
> > > given here and post my results.
> > >
> > > Leo Gamas.
> > >
> > > 2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>
> > >
> > > > Recently we faced a similar problem and it was due to GC config.
> > > > Pls check your GC.
> > > >
> > > > Regards
> > > > Ram
> > > >
> > > > -----Original Message-----
> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > > Stack
> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: RegionServer dying every two or three days
> > > >
> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > <le...@jusbrasil.com.br> wrote:
> > > > > The third line took 36.96 seconds to execute, can this be causing
> > > > > this problem?
> > > > >
> > > >
> > > > Probably.  Have you made any attempt at GC tuning?
> > > >
> > > >
> > > > > Reading the code a little it seems that, even if it's disabled, if
> > > > > all files are target in a compaction, it's considered a major
> > > > > compaction. Is
> > > > it
> > > > > right?
> > > > >
> > > >
> > > > That is right.  They get 'upgraded' from minor to major.
> > > >
> > > > This should be fine though.  What you are avoiding setting major
> > > > compactions to 0 is all regions being major compacted on a period, a
> > > > heavy weight effective rewrite of all your data (unless already major
> > > > compacted).   It looks like you have this disabled which is good
> until
> > > > you've wrestled your cluster into submission.
> > > >
> > > >
> > > > > The machines don't have swap, so the swappiness parameter don't
> > > > > seem to apply here. Any other suggestion?
> > > > >
> > > >
> > > > See the perf section of the hbase manual.  It has our current list.
> > > >
> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > >
> > > >
> > > > St.Ack
> > > >
> > > > > Thanks.
> > > > >
> > > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > > > >
> > > > >> I will investigate this, thanks for the response.
> > > > >>
> > > > >>
> > > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > > >>
> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > > > >>> timed out, have not heard from server in 61103ms for sessionid
> > > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> > > > >>> reconnect
> > > > >>>
> > > > >>> It looks like the process has been unresponsive for some time,
> > > > >>> so ZK
> > > > has
> > > > >>> terminated the session.  Did you experience a long GC pause
> > > > >>> right
> > > > before
> > > > >>> this?  If you don't have GC logging enabled for the RS, you can
> > > > sometimes
> > > > >>> tell by noticing a gap in the timestamps of the log statements
> > > > >>> leading
> > > > up
> > > > >>> to the crash.
> > > > >>>
> > > > >>> If it turns out to be GC, you might want to look at your kernel
> > > > >>> swappiness setting (set it to 0) and your JVM params.
> > > > >>>
> > > > >>> Sandy
> > > > >>>
> > > > >>>
> > > > >>> > -----Original Message-----
> > > > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > >>> > To: user@hbase.apache.org
> > > > >>> > Subject: RegionServer dying every two or three days
> > > > >>> >
> > > > >>> > Hi,
> > > > >>> >
> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master +
> > > > >>> > 3
> > > > >>> Slaves),
> > > > >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> > > > Instance
> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> > > > >>> > Datanode,
> > > > >>> TaskTracker,
> > > > >>> > RegionServer and Zookeeper.
> > > > >>> >
> > > > >>> > From time to time, every two or three days, one of the
> > > > >>> > RegionServers processes goes down, but the other processes
> > > > >>> > (DataNode, TaskTracker,
> > > > >>> > Zookeeper) continue normally.
> > > > >>> >
> > > > >>> > Reading the logs:
> > > > >>> >
> > > > >>> > The connection with Zookeeper timed out:
> > > > >>> >
> > > > >>> > ---------------------------
> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > > > >>> > timed
> > > > out,
> > > > >>> have
> > > > >>> > not heard from server in 61103ms for sessionid
> > > > >>> > 0x23462a4cf93a8fc,
> > > > >>> closing
> > > > >>> > socket connection and attempting reconnect
> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > > > >>> > timed
> > > > out,
> > > > >>> have
> > > > >>> > not heard from server in 61205ms for sessionid
> > > > >>> > 0x346c561a55953e,
> > > > closing
> > > > >>> > socket connection and attempting reconnect
> > > > >>> > ---------------------------
> > > > >>> >
> > > > >>> > And the Handlers start to fail:
> > > > >>> >
> > > > >>> > ---------------------------
> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > > > >>> > call
> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf)
> > > > >>> > from
> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81
> > > > >>> > on
> > > > 60020
> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > >>> >         at
> > > > >>> >
> > > >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > 13
> > > > >>> > 3)
> > > > >>> >         at
> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > >>> >         at
> > > > >>> >
> > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > >>> > 1341)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > >>> > ns
> > > > >>> > e(HB
> > > > >>> > aseServer.java:727)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > >>> > as
> > > > >>> > eSe
> > > > >>> > rver.java:792)
> > > > >>> >         at
> > > > >>> >
> > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > :1
> > > > >>> > 083)
> > > > >>> >
> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > > > >>> > call
> > > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430)
> > > > >>> > from
> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62
> > > > >>> > on
> > > > 60020
> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > >>> >         at
> > > > >>> >
> > > >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > 13
> > > > >>> > 3)
> > > > >>> >         at
> > > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > >>> >         at
> > > > >>> >
> > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > > >>> > 1341)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > >>> > ns
> > > > >>> > e(HB
> > > > >>> > aseServer.java:727)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > >>> > as
> > > > >>> > eSe
> > > > >>> > rver.java:792)
> > > > >>> >         at
> > > > >>> >
> > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > :1
> > > > >>> > 083)
> > > > >>> > ---------------------------
> > > > >>> >
> > > > >>> > And finally the server throws a YouAreDeadException :( :
> > > > >>> >
> > > > >>> > ---------------------------
> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > > > connection
> > > > >>> to
> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > initiating
> > > > >>> session
> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc has
> > > > >>> > expired, closing
> > > > socket
> > > > >>> > connection
> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > > > connection
> > > > >>> to
> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > initiating
> > > > >>> session
> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e has
> > > > >>> > expired, closing
> > > > socket
> > > > >>> > connection
> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> > > > >>> > region server
> > > > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> > maxHeap=4083):
> > > > >>> > Unhandled
> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> > Server
> > > > >>> > REPORT rejected; currently processing
> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > > > >>> > rejected; currently processing
> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > as
> > > > >>> > dead server
> > > > >>> >         at
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > >>> > Method)
> > > > >>> >         at
> > > > >>> >
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > > to
> > > > r
> > > > >>> > AccessorImpl.java:39)
> > > > >>> >         at
> > > > >>> >
> > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > > Co
> > > > n
> > > > >>> > structorAccessorImpl.java:27)
> > > > >>> >         at
> > > > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > >>> > ot
> > > > >>> > eExce
> > > > >>> > ption.java:95)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > >>> > mo
> > > > >>> > te
> > > > >>> > Exception.java:79)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > >>> > rv
> > > > >>> > erRep
> > > > >>> > ort(HRegionServer.java:735)
> > > > >>> >         at
> > > > >>> >
> > > >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > .j
> > > > >>> > ava:596)
> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > > > >>> > rejected; currently processing
> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > as
> > > > >>> > dead server
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > >>> > rM
> > > > >>> > ana
> > > > >>> > ger.java:204)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > >>> > t(
> > > > >>> > Serv
> > > > >>> > erManager.java:262)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > >>> > te
> > > > >>> > r.jav
> > > > >>> > a:669)
> > > > >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > Source)
> > > > >>> >         at
> > > > >>> >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > >>> > od
> > > > >>> > Acces
> > > > >>> > sorImpl.java:25)
> > > > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > >>> >         at
> > > > >>> >
> > > >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > :1
> > > > >>> > 039)
> > > > >>> >
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > >>> > av
> > > > >>> > a:257
> > > > >>> > )
> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> > > > >>> >         at
> > > > >>> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > >>> > rv
> > > > >>> > erRep
> > > > >>> > ort(HRegionServer.java:729)
> > > > >>> >         ... 2 more
> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of
> > metrics:
> > > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> > > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
> > > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335,
> > > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
> > > > >>> > blockCacheHitCachingRatio=98
> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> > > > >>> > Unhandled
> > > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> > Server
> > > > >>> > REPORT rejected; currently processing
> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on
> > > > >>> > 60020
> > > > >>> > ---------------------------
> > > > >>> >
> > > > >>> > Then i restart the RegionServer and everything is back to
> normal.
> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
> > > > >>> > see any abnormality in the same time window.
> > > > >>> > I think it was caused by the lost of connection to zookeeper.
> > > > >>> > Is it
> > > > >>> advisable to
> > > > >>> > run zookeeper in the same machines?
> > > > >>> > if the RegionServer lost it's connection to Zookeeper, there's
> > > > >>> > a way
> > > > (a
> > > > >>> > configuration perhaps) to re-join the cluster, and not only
> die?
> > > > >>> >
> > > > >>> > Any idea what is causing this?? Or to prevent it from
> happening?
> > > > >>> >
> > > > >>> > Any help is appreciated.
> > > > >>> >
> > > > >>> > Best Regards,
> > > > >>> >
> > > > >>> > --
> > > > >>> >
> > > > >>> > *Leonardo Gamas*
> > > > >>> > Software Engineer
> > > > >>> > +557134943514
> > > > >>> > +557581347440
> > > > >>> > leogamas@jusbrasil.com.br
> > > > >>> > www.jusbrasil.com.br
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >>
> > > > >> *Leonardo Gamas*
> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > >
> > > >
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer
> > > +557134943514
> > > +557581347440
> > > leogamas@jusbrasil.com.br
> > > www.jusbrasil.com.br
> > >
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

RE: RegionServer dying every two or three days

Posted by Sandy Pratt <pr...@adobe.com>.
You think it's an Amazon problem maybe?  Like they paused or migrated your virtual machine, and it just happens to be during GC, leaving us to think the GC ran long when it didn't?  I don't have a lot of experience with Amazon so I don't know if that sort of thing is common.

> -----Original Message-----
> From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> Sent: Thursday, January 05, 2012 13:15
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
> I checked the CPU Utilization graphics provided by Amazon (it's not accurate,
> since the sample time is about 5 minutes) and don't see any abnormality. I
> will setup TSDB with Nagios to have a more reliable source of performance
> data.
>
> The machines don't have swap space, if i run:
>
> $ swapon -s
>
> To display swap usage summary, it returns an empty list.
>
> I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.
>
> I don't have payed much attention to the value of the new size param.
>
> Thanks again for the help!!
>
> 2012/1/5 Sandy Pratt <pr...@adobe.com>
>
> > That size heap doesn't seem like it should cause a 36 second GC (a
> > minor GC even if I remember your logs correctly), so I tend to think
> > that other things are probably going on.
> >
> > This line here:
> >
> > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
> > secs]
> > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05
> > 954388K->sys=0.01,
> > real=36.96 secs]
> >
> > is really mysterious to me.  It seems to indicate that the process was
> > blocked for almost 37 seconds during a minor collection.  Note the CPU
> > times are very low but the wall time is very high.  If it was actually
> > doing GC work, I'd expect to see user time higher than real time, as
> > it is in other parallel collections (see your log snippet).  Were you
> > really so CPU starved that it took 37 seconds to get in 50ms of work?
> > I can't make sense of that.  I'm trying to think of something that
> > would block you for that long while all your threads are stopped for
> > GC, other than being in swap, but I can't come up with anything.  You're
> certain you're not in swap?
> >
> > Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while
> > you troubleshoot?
> >
> > Why is your new size so small?  This generally means that relatively
> > more objects are being tenured than would be with a larger new size.
> > This could make collections of the old gen worse (GC time is said to
> > be proportional to the number of live objects in the generation, and
> > CMS does indeed cause STW pauses).  A typical new to tenured ratio
> > might be 1:3.  Were the new gen GCs taking too long?  This is probably
> > orthogonal to your immediate issue, though.
> >
> >
> >
> > -----Original Message-----
> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > Sent: Thursday, January 05, 2012 5:33 AM
> > To: user@hbase.apache.org
> > Subject: Re: RegionServer dying every two or three days
> >
> >  St.Ack,
> >
> > I don't have made any attempt in GC tunning, yet.
> > I will read the perf section as suggested.
> > I'm currently using Nagios + JMX to monitor the cluster, but it's
> > currently used for alert only, the perfdata is not been stored, so
> > it's kind of useless right now, but i was thinking in use TSDB to
> > store it, any known case of integration?
> > ---
> >
> > Sandy,
> >
> > Yes, my timeout is 30 seconds:
> >
> > <property>
> >   <name>zookeeper.session.timeout</name>
> >   <value>30000</value>
> > </property>
> >
> > To our application it's a sufferable time to wait in case a
> > RegionServer go offline.
> >
> > My heap is 4GB and my JVM params are:
> >
> > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m
> > -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts
> > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> >
> > I will try the -XX:+UseParallelOldGC param and post my feedback here.
> > ---
> >
> > Ramkrishna,
> >
> > Seems the GC is the root of all evil in this case.
> > ----
> >
> > Thank you all for the answers. I will try out these valuable advices
> > given here and post my results.
> >
> > Leo Gamas.
> >
> > 2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>
> >
> > > Recently we faced a similar problem and it was due to GC config.
> > > Pls check your GC.
> > >
> > > Regards
> > > Ram
> > >
> > > -----Original Message-----
> > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > > Stack
> > > Sent: Thursday, January 05, 2012 2:50 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: RegionServer dying every two or three days
> > >
> > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > <le...@jusbrasil.com.br> wrote:
> > > > The third line took 36.96 seconds to execute, can this be causing
> > > > this problem?
> > > >
> > >
> > > Probably.  Have you made any attempt at GC tuning?
> > >
> > >
> > > > Reading the code a little it seems that, even if it's disabled, if
> > > > all files are target in a compaction, it's considered a major
> > > > compaction. Is
> > > it
> > > > right?
> > > >
> > >
> > > That is right.  They get 'upgraded' from minor to major.
> > >
> > > This should be fine though.  What you are avoiding setting major
> > > compactions to 0 is all regions being major compacted on a period, a
> > > heavy weight effective rewrite of all your data (unless already major
> > > compacted).   It looks like you have this disabled which is good until
> > > you've wrestled your cluster into submission.
> > >
> > >
> > > > The machines don't have swap, so the swappiness parameter don't
> > > > seem to apply here. Any other suggestion?
> > > >
> > >
> > > See the perf section of the hbase manual.  It has our current list.
> > >
> > > Are you monitoring your cluster w/ ganglia or tsdb?
> > >
> > >
> > > St.Ack
> > >
> > > > Thanks.
> > > >
> > > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > > >
> > > >> I will investigate this, thanks for the response.
> > > >>
> > > >>
> > > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > > >>
> > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > > >>> timed out, have not heard from server in 61103ms for sessionid
> > > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> > > >>> reconnect
> > > >>>
> > > >>> It looks like the process has been unresponsive for some time,
> > > >>> so ZK
> > > has
> > > >>> terminated the session.  Did you experience a long GC pause
> > > >>> right
> > > before
> > > >>> this?  If you don't have GC logging enabled for the RS, you can
> > > sometimes
> > > >>> tell by noticing a gap in the timestamps of the log statements
> > > >>> leading
> > > up
> > > >>> to the crash.
> > > >>>
> > > >>> If it turns out to be GC, you might want to look at your kernel
> > > >>> swappiness setting (set it to 0) and your JVM params.
> > > >>>
> > > >>> Sandy
> > > >>>
> > > >>>
> > > >>> > -----Original Message-----
> > > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > >>> > To: user@hbase.apache.org
> > > >>> > Subject: RegionServer dying every two or three days
> > > >>> >
> > > >>> > Hi,
> > > >>> >
> > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master +
> > > >>> > 3
> > > >>> Slaves),
> > > >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> > > Instance
> > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
> > > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> > > >>> > Datanode,
> > > >>> TaskTracker,
> > > >>> > RegionServer and Zookeeper.
> > > >>> >
> > > >>> > From time to time, every two or three days, one of the
> > > >>> > RegionServers processes goes down, but the other processes
> > > >>> > (DataNode, TaskTracker,
> > > >>> > Zookeeper) continue normally.
> > > >>> >
> > > >>> > Reading the logs:
> > > >>> >
> > > >>> > The connection with Zookeeper timed out:
> > > >>> >
> > > >>> > ---------------------------
> > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > > >>> > timed
> > > out,
> > > >>> have
> > > >>> > not heard from server in 61103ms for sessionid
> > > >>> > 0x23462a4cf93a8fc,
> > > >>> closing
> > > >>> > socket connection and attempting reconnect
> > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > > >>> > timed
> > > out,
> > > >>> have
> > > >>> > not heard from server in 61205ms for sessionid
> > > >>> > 0x346c561a55953e,
> > > closing
> > > >>> > socket connection and attempting reconnect
> > > >>> > ---------------------------
> > > >>> >
> > > >>> > And the Handlers start to fail:
> > > >>> >
> > > >>> > ---------------------------
> > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > > >>> > call
> > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf)
> > > >>> > from
> > > >>> > xx.xx.xx.xx:xxxx: output error
> > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81
> > > >>> > on
> > > 60020
> > > >>> > caught: java.nio.channels.ClosedChannelException
> > > >>> >         at
> > > >>> >
> > >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > 13
> > > >>> > 3)
> > > >>> >         at
> > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > >>> >         at
> > > >>> >
> > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > >>> > 1341)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > >>> > ns
> > > >>> > e(HB
> > > >>> > aseServer.java:727)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > >>> > as
> > > >>> > eSe
> > > >>> > rver.java:792)
> > > >>> >         at
> > > >>> >
> > >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > :1
> > > >>> > 083)
> > > >>> >
> > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > > >>> > call
> > > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430)
> > > >>> > from
> > > >>> > xx.xx.xx.xx:xxxx: output error
> > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62
> > > >>> > on
> > > 60020
> > > >>> > caught: java.nio.channels.ClosedChannelException
> > > >>> >         at
> > > >>> >
> > >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > 13
> > > >>> > 3)
> > > >>> >         at
> > > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > >>> >         at
> > > >>> >
> > >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > > >>> > 1341)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > >>> > ns
> > > >>> > e(HB
> > > >>> > aseServer.java:727)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > >>> > as
> > > >>> > eSe
> > > >>> > rver.java:792)
> > > >>> >         at
> > > >>> >
> > >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > :1
> > > >>> > 083)
> > > >>> > ---------------------------
> > > >>> >
> > > >>> > And finally the server throws a YouAreDeadException :( :
> > > >>> >
> > > >>> > ---------------------------
> > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > > connection
> > > >>> to
> > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > initiating
> > > >>> session
> > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > >>> > reconnect to ZooKeeper service, session 0x23462a4cf93a8fc has
> > > >>> > expired, closing
> > > socket
> > > >>> > connection
> > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > > connection
> > > >>> to
> > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > initiating
> > > >>> session
> > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to
> > > >>> > reconnect to ZooKeeper service, session 0x346c561a55953e has
> > > >>> > expired, closing
> > > socket
> > > >>> > connection
> > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> > > >>> > region server
> > > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > >>> > load=(requests=447, regions=206, usedHeap=1584,
> maxHeap=4083):
> > > >>> > Unhandled
> > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> Server
> > > >>> > REPORT rejected; currently processing
> > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > > >>> > rejected; currently processing
> > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > as
> > > >>> > dead server
> > > >>> >         at
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > >>> > Method)
> > > >>> >         at
> > > >>> >
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstruc
> > > to
> > > r
> > > >>> > AccessorImpl.java:39)
> > > >>> >         at
> > > >>> >
> > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegating
> > > Co
> > > n
> > > >>> > structorAccessorImpl.java:27)
> > > >>> >         at
> > > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > >>> > ot
> > > >>> > eExce
> > > >>> > ption.java:95)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > >>> > mo
> > > >>> > te
> > > >>> > Exception.java:79)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > >>> > rv
> > > >>> > erRep
> > > >>> > ort(HRegionServer.java:735)
> > > >>> >         at
> > > >>> >
> > >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > .j
> > > >>> > ava:596)
> > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > > >>> > rejected; currently processing
> > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > as
> > > >>> > dead server
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > >>> > rM
> > > >>> > ana
> > > >>> > ger.java:204)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > >>> > t(
> > > >>> > Serv
> > > >>> > erManager.java:262)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > >>> > te
> > > >>> > r.jav
> > > >>> > a:669)
> > > >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > Source)
> > > >>> >         at
> > > >>> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > >>> > od
> > > >>> > Acces
> > > >>> > sorImpl.java:25)
> > > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > >>> >         at
> > > >>> >
> > >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > :1
> > > >>> > 039)
> > > >>> >
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > >>> > av
> > > >>> > a:257
> > > >>> > )
> > > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> > > >>> >         at
> > > >>> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > >>> > rv
> > > >>> > erRep
> > > >>> > ort(HRegionServer.java:729)
> > > >>> >         ... 2 more
> > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of
> metrics:
> > > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> > > >>> > storefileIndexSize=78, memstoreSize=796,
> > > >>> > compactionQueueSize=0, flushQueueSize=0, usedHeap=1672,
> > > >>> > maxHeap=4083, blockCacheSize=705907552,
> > > >>> > blockCacheFree=150412064, blockCacheCount=10648,
> > > >>> > blockCacheHitCount=79578618, blockCacheMissCount=3036335,
> > > >>> > blockCacheEvictedCount=1401352, blockCacheHitRatio=96,
> > > >>> > blockCacheHitCachingRatio=98
> > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> > > >>> > Unhandled
> > > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException:
> Server
> > > >>> > REPORT rejected; currently processing
> > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on
> > > >>> > 60020
> > > >>> > ---------------------------
> > > >>> >
> > > >>> > Then i restart the RegionServer and everything is back to normal.
> > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
> > > >>> > see any abnormality in the same time window.
> > > >>> > I think it was caused by the lost of connection to zookeeper.
> > > >>> > Is it
> > > >>> advisable to
> > > >>> > run zookeeper in the same machines?
> > > >>> > if the RegionServer lost it's connection to Zookeeper, there's
> > > >>> > a way
> > > (a
> > > >>> > configuration perhaps) to re-join the cluster, and not only die?
> > > >>> >
> > > >>> > Any idea what is causing this?? Or to prevent it from happening?
> > > >>> >
> > > >>> > Any help is appreciated.
> > > >>> >
> > > >>> > Best Regards,
> > > >>> >
> > > >>> > --
> > > >>> >
> > > >>> > *Leonardo Gamas*
> > > >>> > Software Engineer
> > > >>> > +557134943514
> > > >>> > +557581347440
> > > >>> > leogamas@jusbrasil.com.br
> > > >>> > www.jusbrasil.com.br
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> *Leonardo Gamas*
> > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > *Leonardo Gamas*
> > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > >
> > >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > +557134943514
> > +557581347440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> >
>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> T +55 (71) 3494-3514
> C +55 (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
I checked the CPU Utilization graphics provided by Amazon (it's not
accurate, since the sample time is about 5 minutes) and don't see any
abnormality. I will setup TSDB with Nagios to have a more reliable source
of performance data.

The machines don't have swap space, if i run:

$ swapon -s

To display swap usage summary, it returns an empty list.

I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to tests.

I don't have payed much attention to the value of the new size param.

Thanks again for the help!!

2012/1/5 Sandy Pratt <pr...@adobe.com>

> That size heap doesn't seem like it should cause a 36 second GC (a minor
> GC even if I remember your logs correctly), so I tend to think that other
> things are probably going on.
>
> This line here:
>
> 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs]
> 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
> real=36.96 secs]
>
> is really mysterious to me.  It seems to indicate that the process was
> blocked for almost 37 seconds during a minor collection.  Note the CPU
> times are very low but the wall time is very high.  If it was actually
> doing GC work, I'd expect to see user time higher than real time, as it is
> in other parallel collections (see your log snippet).  Were you really so
> CPU starved that it took 37 seconds to get in 50ms of work?  I can't make
> sense of that.  I'm trying to think of something that would block you for
> that long while all your threads are stopped for GC, other than being in
> swap, but I can't come up with anything.  You're certain you're not in swap?
>
> Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while you
> troubleshoot?
>
> Why is your new size so small?  This generally means that relatively more
> objects are being tenured than would be with a larger new size.  This could
> make collections of the old gen worse (GC time is said to be proportional
> to the number of live objects in the generation, and CMS does indeed cause
> STW pauses).  A typical new to tenured ratio might be 1:3.  Were the new
> gen GCs taking too long?  This is probably orthogonal to your immediate
> issue, though.
>
>
>
> -----Original Message-----
> From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> Sent: Thursday, January 05, 2012 5:33 AM
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
>  St.Ack,
>
> I don't have made any attempt in GC tunning, yet.
> I will read the perf section as suggested.
> I'm currently using Nagios + JMX to monitor the cluster, but it's
> currently used for alert only, the perfdata is not been stored, so it's
> kind of useless right now, but i was thinking in use TSDB to store it, any
> known case of integration?
> ---
>
> Sandy,
>
> Yes, my timeout is 30 seconds:
>
> <property>
>   <name>zookeeper.session.timeout</name>
>   <value>30000</value>
> </property>
>
> To our application it's a sufferable time to wait in case a RegionServer
> go offline.
>
> My heap is 4GB and my JVM params are:
>
> -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m
> -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
>
> I will try the -XX:+UseParallelOldGC param and post my feedback here.
> ---
>
> Ramkrishna,
>
> Seems the GC is the root of all evil in this case.
> ----
>
> Thank you all for the answers. I will try out these valuable advices given
> here and post my results.
>
> Leo Gamas.
>
> 2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>
>
> > Recently we faced a similar problem and it was due to GC config.  Pls
> > check your GC.
> >
> > Regards
> > Ram
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> > Stack
> > Sent: Thursday, January 05, 2012 2:50 AM
> > To: user@hbase.apache.org
> > Subject: Re: RegionServer dying every two or three days
> >
> > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > <le...@jusbrasil.com.br> wrote:
> > > The third line took 36.96 seconds to execute, can this be causing
> > > this problem?
> > >
> >
> > Probably.  Have you made any attempt at GC tuning?
> >
> >
> > > Reading the code a little it seems that, even if it's disabled, if
> > > all files are target in a compaction, it's considered a major
> > > compaction. Is
> > it
> > > right?
> > >
> >
> > That is right.  They get 'upgraded' from minor to major.
> >
> > This should be fine though.  What you are avoiding setting major
> > compactions to 0 is all regions being major compacted on a period, a
> > heavy weight effective rewrite of all your data (unless already major
> > compacted).   It looks like you have this disabled which is good until
> > you've wrestled your cluster into submission.
> >
> >
> > > The machines don't have swap, so the swappiness parameter don't seem
> > > to apply here. Any other suggestion?
> > >
> >
> > See the perf section of the hbase manual.  It has our current list.
> >
> > Are you monitoring your cluster w/ ganglia or tsdb?
> >
> >
> > St.Ack
> >
> > > Thanks.
> > >
> > > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> > >
> > >> I will investigate this, thanks for the response.
> > >>
> > >>
> > >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> > >>
> > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> > >>> out, have not heard from server in 61103ms for sessionid
> > >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> > >>> reconnect
> > >>>
> > >>> It looks like the process has been unresponsive for some time, so
> > >>> ZK
> > has
> > >>> terminated the session.  Did you experience a long GC pause right
> > before
> > >>> this?  If you don't have GC logging enabled for the RS, you can
> > sometimes
> > >>> tell by noticing a gap in the timestamps of the log statements
> > >>> leading
> > up
> > >>> to the crash.
> > >>>
> > >>> If it turns out to be GC, you might want to look at your kernel
> > >>> swappiness setting (set it to 0) and your JVM params.
> > >>>
> > >>> Sandy
> > >>>
> > >>>
> > >>> > -----Original Message-----
> > >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > >>> > Sent: Thursday, December 29, 2011 07:44
> > >>> > To: user@hbase.apache.org
> > >>> > Subject: RegionServer dying every two or three days
> > >>> >
> > >>> > Hi,
> > >>> >
> > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> > >>> Slaves),
> > >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> > Instance
> > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
> > >>> > The slaves are Extra Large Instances (m1.xlarge) running
> > >>> > Datanode,
> > >>> TaskTracker,
> > >>> > RegionServer and Zookeeper.
> > >>> >
> > >>> > From time to time, every two or three days, one of the
> > >>> > RegionServers processes goes down, but the other processes
> > >>> > (DataNode, TaskTracker,
> > >>> > Zookeeper) continue normally.
> > >>> >
> > >>> > Reading the logs:
> > >>> >
> > >>> > The connection with Zookeeper timed out:
> > >>> >
> > >>> > ---------------------------
> > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > >>> > timed
> > out,
> > >>> have
> > >>> > not heard from server in 61103ms for sessionid
> > >>> > 0x23462a4cf93a8fc,
> > >>> closing
> > >>> > socket connection and attempting reconnect
> > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> > >>> > timed
> > out,
> > >>> have
> > >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> > closing
> > >>> > socket connection and attempting reconnect
> > >>> > ---------------------------
> > >>> >
> > >>> > And the Handlers start to fail:
> > >>> >
> > >>> > ---------------------------
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > >>> > call
> > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> > >>> > xx.xx.xx.xx:xxxx: output error
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> > 60020
> > >>> > caught: java.nio.channels.ClosedChannelException
> > >>> >         at
> > >>> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > >>> > 3)
> > >>> >         at
> > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > >>> > 1341)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> > >>> > e(HB
> > >>> > aseServer.java:727)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> > >>> > eSe
> > >>> > rver.java:792)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > >>> > 083)
> > >>> >
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> > >>> > call
> > >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> > >>> > xx.xx.xx.xx:xxxx: output error
> > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> > 60020
> > >>> > caught: java.nio.channels.ClosedChannelException
> > >>> >         at
> > >>> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > >>> > 3)
> > >>> >         at
> > >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > >>> > 1341)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> > >>> > e(HB
> > >>> > aseServer.java:727)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> > >>> > eSe
> > >>> > rver.java:792)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > >>> > 083)
> > >>> > ---------------------------
> > >>> >
> > >>> > And finally the server throws a YouAreDeadException :( :
> > >>> >
> > >>> > ---------------------------
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > connection
> > >>> to
> > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > initiating
> > >>> session
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> > >>> > to ZooKeeper service, session 0x23462a4cf93a8fc has expired,
> > >>> > closing
> > socket
> > >>> > connection
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> > connection
> > >>> to
> > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > initiating
> > >>> session
> > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> > >>> > to ZooKeeper service, session 0x346c561a55953e has expired,
> > >>> > closing
> > socket
> > >>> > connection
> > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> > >>> > region server
> > >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> > >>> > Unhandled
> > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> > >>> > REPORT rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > >>> > rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > as
> > >>> > dead server
> > >>> >         at
> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >>> > Method)
> > >>> >         at
> > >>> >
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> > r
> > >>> > AccessorImpl.java:39)
> > >>> >         at
> > >>> >
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> > n
> > >>> > structorAccessorImpl.java:27)
> > >>> >         at
> > >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > >>> >         at
> > >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Remot
> > >>> > eExce
> > >>> > ption.java:95)
> > >>> >         at
> > >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remo
> > >>> > te
> > >>> > Exception.java:79)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> > >>> > erRep
> > >>> > ort(HRegionServer.java:735)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> > >>> > ava:596)
> > >>> >         at java.lang.Thread.run(Thread.java:662)
> > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > >>> > rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > as
> > >>> > dead server
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerM
> > >>> > ana
> > >>> > ger.java:204)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> > >>> > Serv
> > >>> > erManager.java:262)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaste
> > >>> > r.jav
> > >>> > a:669)
> > >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > Source)
> > >>> >         at
> > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
> > >>> > Acces
> > >>> > sorImpl.java:25)
> > >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > >>> >         at
> > >>> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > >>> > 039)
> > >>> >
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.jav
> > >>> > a:257
> > >>> > )
> > >>> >         at $Proxy6.regionServerReport(Unknown Source)
> > >>> >         at
> > >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> > >>> > erRep
> > >>> > ort(HRegionServer.java:729)
> > >>> >         ... 2 more
> > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> > >>> > requests=66, regions=206, stores=2078, storefiles=970,
> > >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> > >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> > >>> > blockCacheSize=705907552, blockCacheFree=150412064,
> > >>> > blockCacheCount=10648, blockCacheHitCount=79578618,
> > >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> > >>> > blockCacheHitRatio=96,
> > >>> > blockCacheHitCachingRatio=98
> > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> > >>> > Unhandled
> > >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> > >>> > REPORT rejected; currently processing
> > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> > >>> > ---------------------------
> > >>> >
> > >>> > Then i restart the RegionServer and everything is back to normal.
> > >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
> > >>> > see any abnormality in the same time window.
> > >>> > I think it was caused by the lost of connection to zookeeper. Is
> > >>> > it
> > >>> advisable to
> > >>> > run zookeeper in the same machines?
> > >>> > if the RegionServer lost it's connection to Zookeeper, there's a
> > >>> > way
> > (a
> > >>> > configuration perhaps) to re-join the cluster, and not only die?
> > >>> >
> > >>> > Any idea what is causing this?? Or to prevent it from happening?
> > >>> >
> > >>> > Any help is appreciated.
> > >>> >
> > >>> > Best Regards,
> > >>> >
> > >>> > --
> > >>> >
> > >>> > *Leonardo Gamas*
> > >>> > Software Engineer
> > >>> > +557134943514
> > >>> > +557581347440
> > >>> > leogamas@jusbrasil.com.br
> > >>> > www.jusbrasil.com.br
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> *Leonardo Gamas*
> > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > >>
> > >>
> > >
> > >
> > > --
> > >
> > > *Leonardo Gamas*
> > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >
> >
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer
> +557134943514
> +557581347440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
>



-- 

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

RE: RegionServer dying every two or three days

Posted by Sandy Pratt <pr...@adobe.com>.
That size heap doesn't seem like it should cause a 36 second GC (a minor GC even if I remember your logs correctly), so I tend to think that other things are probably going on.

This line here:

14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs]
954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
real=36.96 secs]

is really mysterious to me.  It seems to indicate that the process was blocked for almost 37 seconds during a minor collection.  Note the CPU times are very low but the wall time is very high.  If it was actually doing GC work, I'd expect to see user time higher than real time, as it is in other parallel collections (see your log snippet).  Were you really so CPU starved that it took 37 seconds to get in 50ms of work?  I can't make sense of that.  I'm trying to think of something that would block you for that long while all your threads are stopped for GC, other than being in swap, but I can't come up with anything.  You're certain you're not in swap?

Maybe drop these guys -XX:+DoEscapeAnalysis -XX:+AggressiveOpts while you troubleshoot?

Why is your new size so small?  This generally means that relatively more objects are being tenured than would be with a larger new size.  This could make collections of the old gen worse (GC time is said to be proportional to the number of live objects in the generation, and CMS does indeed cause STW pauses).  A typical new to tenured ratio might be 1:3.  Were the new gen GCs taking too long?  This is probably orthogonal to your immediate issue, though.



-----Original Message-----
From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
Sent: Thursday, January 05, 2012 5:33 AM
To: user@hbase.apache.org
Subject: Re: RegionServer dying every two or three days

 St.Ack,

I don't have made any attempt in GC tunning, yet.
I will read the perf section as suggested.
I'm currently using Nagios + JMX to monitor the cluster, but it's currently used for alert only, the perfdata is not been stored, so it's kind of useless right now, but i was thinking in use TSDB to store it, any known case of integration?
---

Sandy,

Yes, my timeout is 30 seconds:

<property>
   <name>zookeeper.session.timeout</name>
   <value>30000</value>
</property>

To our application it's a sufferable time to wait in case a RegionServer go offline.

My heap is 4GB and my JVM params are:

-Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log

I will try the -XX:+UseParallelOldGC param and post my feedback here.
---

Ramkrishna,

Seems the GC is the root of all evil in this case.
----

Thank you all for the answers. I will try out these valuable advices given here and post my results.

Leo Gamas.

2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>

> Recently we faced a similar problem and it was due to GC config.  Pls
> check your GC.
>
> Regards
> Ram
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Thursday, January 05, 2012 2:50 AM
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
> On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> <le...@jusbrasil.com.br> wrote:
> > The third line took 36.96 seconds to execute, can this be causing
> > this problem?
> >
>
> Probably.  Have you made any attempt at GC tuning?
>
>
> > Reading the code a little it seems that, even if it's disabled, if
> > all files are target in a compaction, it's considered a major
> > compaction. Is
> it
> > right?
> >
>
> That is right.  They get 'upgraded' from minor to major.
>
> This should be fine though.  What you are avoiding setting major
> compactions to 0 is all regions being major compacted on a period, a
> heavy weight effective rewrite of all your data (unless already major
> compacted).   It looks like you have this disabled which is good until
> you've wrestled your cluster into submission.
>
>
> > The machines don't have swap, so the swappiness parameter don't seem
> > to apply here. Any other suggestion?
> >
>
> See the perf section of the hbase manual.  It has our current list.
>
> Are you monitoring your cluster w/ ganglia or tsdb?
>
>
> St.Ack
>
> > Thanks.
> >
> > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> >
> >> I will investigate this, thanks for the response.
> >>
> >>
> >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> >>
> >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >>> out, have not heard from server in 61103ms for sessionid
> >>> 0x23462a4cf93a8fc, closing socket connection and attempting
> >>> reconnect
> >>>
> >>> It looks like the process has been unresponsive for some time, so
> >>> ZK
> has
> >>> terminated the session.  Did you experience a long GC pause right
> before
> >>> this?  If you don't have GC logging enabled for the RS, you can
> sometimes
> >>> tell by noticing a gap in the timestamps of the log statements
> >>> leading
> up
> >>> to the crash.
> >>>
> >>> If it turns out to be GC, you might want to look at your kernel
> >>> swappiness setting (set it to 0) and your JVM params.
> >>>
> >>> Sandy
> >>>
> >>>
> >>> > -----Original Message-----
> >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >>> > Sent: Thursday, December 29, 2011 07:44
> >>> > To: user@hbase.apache.org
> >>> > Subject: RegionServer dying every two or three days
> >>> >
> >>> > Hi,
> >>> >
> >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> >>> Slaves),
> >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> Instance
> >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper.
> >>> > The slaves are Extra Large Instances (m1.xlarge) running
> >>> > Datanode,
> >>> TaskTracker,
> >>> > RegionServer and Zookeeper.
> >>> >
> >>> > From time to time, every two or three days, one of the
> >>> > RegionServers processes goes down, but the other processes
> >>> > (DataNode, TaskTracker,
> >>> > Zookeeper) continue normally.
> >>> >
> >>> > Reading the logs:
> >>> >
> >>> > The connection with Zookeeper timed out:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >>> > timed
> out,
> >>> have
> >>> > not heard from server in 61103ms for sessionid
> >>> > 0x23462a4cf93a8fc,
> >>> closing
> >>> > socket connection and attempting reconnect
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session
> >>> > timed
> out,
> >>> have
> >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> closing
> >>> > socket connection and attempting reconnect
> >>> > ---------------------------
> >>> >
> >>> > And the Handlers start to fail:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> >>> > call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> >>> > e(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> >>> > eSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> >
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder,
> >>> > call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespons
> >>> > e(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBas
> >>> > eSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> > ---------------------------
> >>> >
> >>> > And finally the server throws a YouAreDeadException :( :
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> >>> > to ZooKeeper service, session 0x23462a4cf93a8fc has expired,
> >>> > closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect
> >>> > to ZooKeeper service, session 0x346c561a55953e has expired,
> >>> > closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING
> >>> > region server
> >>> > serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >>> > REPORT rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > org.apache.hadoop.hbase.YouAreDeadException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>> > Method)
> >>> >         at
> >>> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructo
> r
> >>> > AccessorImpl.java:39)
> >>> >         at
> >>> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
> n
> >>> > structorAccessorImpl.java:27)
> >>> >         at
> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(Remot
> >>> > eExce
> >>> > ption.java:95)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remo
> >>> > te
> >>> > Exception.java:79)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> >>> > erRep
> >>> > ort(HRegionServer.java:735)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> >>> > ava:596)
> >>> >         at java.lang.Thread.run(Thread.java:662)
> >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerM
> >>> > ana
> >>> > ger.java:204)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(
> >>> > Serv
> >>> > erManager.java:262)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaste
> >>> > r.jav
> >>> > a:669)
> >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >>> >         at
> >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethod
> >>> > Acces
> >>> > sorImpl.java:25)
> >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 039)
> >>> >
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.jav
> >>> > a:257
> >>> > )
> >>> >         at $Proxy6.regionServerReport(Unknown Source)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServ
> >>> > erRep
> >>> > ort(HRegionServer.java:729)
> >>> >         ... 2 more
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> >>> > requests=66, regions=206, stores=2078, storefiles=970,
> >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> >>> > blockCacheSize=705907552, blockCacheFree=150412064,
> >>> > blockCacheCount=10648, blockCacheHitCount=79578618,
> >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> >>> > blockCacheHitRatio=96,
> >>> > blockCacheHitCachingRatio=98
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >>> > REPORT rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> >>> > ---------------------------
> >>> >
> >>> > Then i restart the RegionServer and everything is back to normal.
> >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't
> >>> > see any abnormality in the same time window.
> >>> > I think it was caused by the lost of connection to zookeeper. Is
> >>> > it
> >>> advisable to
> >>> > run zookeeper in the same machines?
> >>> > if the RegionServer lost it's connection to Zookeeper, there's a
> >>> > way
> (a
> >>> > configuration perhaps) to re-join the cluster, and not only die?
> >>> >
> >>> > Any idea what is causing this?? Or to prevent it from happening?
> >>> >
> >>> > Any help is appreciated.
> >>> >
> >>> > Best Regards,
> >>> >
> >>> > --
> >>> >
> >>> > *Leonardo Gamas*
> >>> > Software Engineer
> >>> > +557134943514
> >>> > +557581347440
> >>> > leogamas@jusbrasil.com.br
> >>> > www.jusbrasil.com.br
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Leonardo Gamas*
> >> Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >>
> >>
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
>
>


--

*Leonardo Gamas*
Software Engineer
+557134943514
+557581347440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
 St.Ack,

I don't have made any attempt in GC tunning, yet.
I will read the perf section as suggested.
I'm currently using Nagios + JMX to monitor the cluster, but it's currently
used for alert only, the perfdata is not been stored, so it's kind of
useless right now, but i was thinking in use TSDB to store it, any known
case of integration?
---

Sandy,

Yes, my timeout is 30 seconds:

<property>
   <name>zookeeper.session.timeout</name>
   <value>30000</value>
</property>

To our application it's a sufferable time to wait in case a RegionServer go
offline.

My heap is 4GB and my JVM params are:

-Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:NewSize=128m -XX:MaxNewSize=128m
-XX:+DoEscapeAnalysis -XX:+AggressiveOpts -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log

I will try the -XX:+UseParallelOldGC param and post my feedback here.
---

Ramkrishna,

Seems the GC is the root of all evil in this case.
----

Thank you all for the answers. I will try out these valuable advices given
here and post my results.

Leo Gamas.

2012/1/5 Ramkrishna S Vasudevan <ra...@huawei.com>

> Recently we faced a similar problem and it was due to GC config.  Pls check
> your GC.
>
> Regards
> Ram
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Thursday, January 05, 2012 2:50 AM
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
> On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> <le...@jusbrasil.com.br> wrote:
> > The third line took 36.96 seconds to execute, can this be causing this
> > problem?
> >
>
> Probably.  Have you made any attempt at GC tuning?
>
>
> > Reading the code a little it seems that, even if it's disabled, if all
> > files are target in a compaction, it's considered a major compaction. Is
> it
> > right?
> >
>
> That is right.  They get 'upgraded' from minor to major.
>
> This should be fine though.  What you are avoiding setting major
> compactions to 0 is all regions being major compacted on a period, a
> heavy weight effective rewrite of all your data (unless already major
> compacted).   It looks like you have this disabled which is good until
> you've wrestled your cluster into submission.
>
>
> > The machines don't have swap, so the swappiness parameter don't seem to
> > apply here. Any other suggestion?
> >
>
> See the perf section of the hbase manual.  It has our current list.
>
> Are you monitoring your cluster w/ ganglia or tsdb?
>
>
> St.Ack
>
> > Thanks.
> >
> > 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
> >
> >> I will investigate this, thanks for the response.
> >>
> >>
> >> 2012/1/3 Sandy Pratt <pr...@adobe.com>
> >>
> >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> >>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> >>> closing socket connection and attempting reconnect
> >>>
> >>> It looks like the process has been unresponsive for some time, so ZK
> has
> >>> terminated the session.  Did you experience a long GC pause right
> before
> >>> this?  If you don't have GC logging enabled for the RS, you can
> sometimes
> >>> tell by noticing a gap in the timestamps of the log statements leading
> up
> >>> to the crash.
> >>>
> >>> If it turns out to be GC, you might want to look at your kernel
> >>> swappiness setting (set it to 0) and your JVM params.
> >>>
> >>> Sandy
> >>>
> >>>
> >>> > -----Original Message-----
> >>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >>> > Sent: Thursday, December 29, 2011 07:44
> >>> > To: user@hbase.apache.org
> >>> > Subject: RegionServer dying every two or three days
> >>> >
> >>> > Hi,
> >>> >
> >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> >>> Slaves),
> >>> > running on Amazon EC2. The master is a High-Memory Extra Large
> Instance
> >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> >>> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
> >>> TaskTracker,
> >>> > RegionServer and Zookeeper.
> >>> >
> >>> > From time to time, every two or three days, one of the RegionServers
> >>> > processes goes down, but the other processes (DataNode, TaskTracker,
> >>> > Zookeeper) continue normally.
> >>> >
> >>> > Reading the logs:
> >>> >
> >>> > The connection with Zookeeper timed out:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> out,
> >>> have
> >>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> >>> closing
> >>> > socket connection and attempting reconnect
> >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> out,
> >>> have
> >>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> closing
> >>> > socket connection and attempting reconnect
> >>> > ---------------------------
> >>> >
> >>> > And the Handlers start to fail:
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> >
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> >>> > xx.xx.xx.xx:xxxx: output error
> >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> 60020
> >>> > caught: java.nio.channels.ClosedChannelException
> >>> >         at
> >>> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> >>> > 3)
> >>> >         at
> >>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >>> > 1341)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> >>> > aseServer.java:727)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >>> > rver.java:792)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 083)
> >>> > ---------------------------
> >>> >
> >>> > And finally the server throws a YouAreDeadException :( :
> >>> >
> >>> > ---------------------------
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> connection
> >>> to
> >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> initiating
> >>> session
> >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >>> > ZooKeeper service, session 0x346c561a55953e has expired, closing
> socket
> >>> > connection
> >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> >>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> >>> > Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > org.apache.hadoop.hbase.YouAreDeadException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >>> > Method)
> >>> >         at
> >>> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
> >>> > AccessorImpl.java:39)
> >>> >         at
> >>> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
> >>> > structorAccessorImpl.java:27)
> >>> >         at
> >>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
> >>> > ption.java:95)
> >>> >         at
> >>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> >>> > Exception.java:79)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> >>> > ort(HRegionServer.java:735)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> >>> > ava:596)
> >>> >         at java.lang.Thread.run(Thread.java:662)
> >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> as
> >>> > dead server
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> >>> > ger.java:204)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
> >>> > erManager.java:262)
> >>> >         at
> >>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
> >>> > a:669)
> >>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >>> >         at
> >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> >>> > sorImpl.java:25)
> >>> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >>> >         at
> >>> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> >>> > 039)
> >>> >
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >>> >         at
> >>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> >>> > )
> >>> >         at $Proxy6.regionServerReport(Unknown Source)
> >>> >         at
> >>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> >>> > ort(HRegionServer.java:729)
> >>> >         ... 2 more
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> >>> > requests=66, regions=206, stores=2078, storefiles=970,
> >>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> >>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> >>> > blockCacheSize=705907552, blockCacheFree=150412064,
> >>> > blockCacheCount=10648, blockCacheHitCount=79578618,
> >>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> >>> > blockCacheHitRatio=96,
> >>> > blockCacheHitCachingRatio=98
> >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
> >>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >>> > rejected; currently processing
> >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> >>> > ---------------------------
> >>> >
> >>> > Then i restart the RegionServer and everything is back to normal.
> >>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
> >>> > abnormality in the same time window.
> >>> > I think it was caused by the lost of connection to zookeeper. Is it
> >>> advisable to
> >>> > run zookeeper in the same machines?
> >>> > if the RegionServer lost it's connection to Zookeeper, there's a way
> (a
> >>> > configuration perhaps) to re-join the cluster, and not only die?
> >>> >
> >>> > Any idea what is causing this?? Or to prevent it from happening?
> >>> >
> >>> > Any help is appreciated.
> >>> >
> >>> > Best Regards,
> >>> >
> >>> > --
> >>> >
> >>> > *Leonardo Gamas*
> >>> > Software Engineer
> >>> > +557134943514
> >>> > +557581347440
> >>> > leogamas@jusbrasil.com.br
> >>> > www.jusbrasil.com.br
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> *Leonardo Gamas*
> >> Software Engineer/Chaos Monkey Engineer
> >> T (71) 3494-3514
> >> C (75) 8134-7440
> >> leogamas@jusbrasil.com.br
> >> www.jusbrasil.com.br
> >>
> >>
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer/Chaos Monkey Engineer
> > T (71) 3494-3514
> > C (75) 8134-7440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
>
>


-- 

*Leonardo Gamas*
Software Engineer
+557134943514
+557581347440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

RE: RegionServer dying every two or three days

Posted by Ramkrishna S Vasudevan <ra...@huawei.com>.
Recently we faced a similar problem and it was due to GC config.  Pls check
your GC.

Regards
Ram

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Thursday, January 05, 2012 2:50 AM
To: user@hbase.apache.org
Subject: Re: RegionServer dying every two or three days

On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
<le...@jusbrasil.com.br> wrote:
> The third line took 36.96 seconds to execute, can this be causing this
> problem?
>

Probably.  Have you made any attempt at GC tuning?


> Reading the code a little it seems that, even if it's disabled, if all
> files are target in a compaction, it's considered a major compaction. Is
it
> right?
>

That is right.  They get 'upgraded' from minor to major.

This should be fine though.  What you are avoiding setting major
compactions to 0 is all regions being major compacted on a period, a
heavy weight effective rewrite of all your data (unless already major
compacted).   It looks like you have this disabled which is good until
you've wrestled your cluster into submission.


> The machines don't have swap, so the swappiness parameter don't seem to
> apply here. Any other suggestion?
>

See the perf section of the hbase manual.  It has our current list.

Are you monitoring your cluster w/ ganglia or tsdb?


St.Ack

> Thanks.
>
> 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
>
>> I will investigate this, thanks for the response.
>>
>>
>> 2012/1/3 Sandy Pratt <pr...@adobe.com>
>>
>>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>>> closing socket connection and attempting reconnect
>>>
>>> It looks like the process has been unresponsive for some time, so ZK has
>>> terminated the session.  Did you experience a long GC pause right before
>>> this?  If you don't have GC logging enabled for the RS, you can
sometimes
>>> tell by noticing a gap in the timestamps of the log statements leading
up
>>> to the crash.
>>>
>>> If it turns out to be GC, you might want to look at your kernel
>>> swappiness setting (set it to 0) and your JVM params.
>>>
>>> Sandy
>>>
>>>
>>> > -----Original Message-----
>>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
>>> > Sent: Thursday, December 29, 2011 07:44
>>> > To: user@hbase.apache.org
>>> > Subject: RegionServer dying every two or three days
>>> >
>>> > Hi,
>>> >
>>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
>>> Slaves),
>>> > running on Amazon EC2. The master is a High-Memory Extra Large
Instance
>>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
>>> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
>>> TaskTracker,
>>> > RegionServer and Zookeeper.
>>> >
>>> > From time to time, every two or three days, one of the RegionServers
>>> > processes goes down, but the other processes (DataNode, TaskTracker,
>>> > Zookeeper) continue normally.
>>> >
>>> > Reading the logs:
>>> >
>>> > The connection with Zookeeper timed out:
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have
>>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>>> closing
>>> > socket connection and attempting reconnect
>>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have
>>> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
closing
>>> > socket connection and attempting reconnect
>>> > ---------------------------
>>> >
>>> > And the Handlers start to fail:
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
>>> > xx.xx.xx.xx:xxxx: output error
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
>>> > caught: java.nio.channels.ClosedChannelException
>>> >         at
>>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>>> > 3)
>>> >         at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>>> > 1341)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>>> > aseServer.java:727)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>>> > rver.java:792)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 083)
>>> >
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
>>> > xx.xx.xx.xx:xxxx: output error
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
>>> > caught: java.nio.channels.ClosedChannelException
>>> >         at
>>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>>> > 3)
>>> >         at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>>> > 1341)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>>> > aseServer.java:727)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>>> > rver.java:792)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 083)
>>> > ---------------------------
>>> >
>>> > And finally the server throws a YouAreDeadException :( :
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to
>>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
initiating
>>> session
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing
socket
>>> > connection
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to
>>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
initiating
>>> session
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> > ZooKeeper service, session 0x346c561a55953e has expired, closing
socket
>>> > connection
>>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
>>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
>>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
>>> > Unhandled
>>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> > rejected; currently processing
>>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>>> > org.apache.hadoop.hbase.YouAreDeadException:
>>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
as
>>> > dead server
>>> >         at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> > Method)
>>> >         at
>>> >
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
>>> > AccessorImpl.java:39)
>>> >         at
>>> >
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
>>> > structorAccessorImpl.java:27)
>>> >         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>> >         at
>>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
>>> > ption.java:95)
>>> >         at
>>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
>>> > Exception.java:79)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>>> > ort(HRegionServer.java:735)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
>>> > ava:596)
>>> >         at java.lang.Thread.run(Thread.java:662)
>>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
as
>>> > dead server
>>> >         at
>>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
>>> > ger.java:204)
>>> >         at
>>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
>>> > erManager.java:262)
>>> >         at
>>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>>> > a:669)
>>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>> >         at
>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>> > sorImpl.java:25)
>>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 039)
>>> >
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>>> > )
>>> >         at $Proxy6.regionServerReport(Unknown Source)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>>> > ort(HRegionServer.java:729)
>>> >         ... 2 more
>>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
>>> > requests=66, regions=206, stores=2078, storefiles=970,
>>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
>>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
>>> > blockCacheSize=705907552, blockCacheFree=150412064,
>>> > blockCacheCount=10648, blockCacheHitCount=79578618,
>>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
>>> > blockCacheHitRatio=96,
>>> > blockCacheHitCachingRatio=98
>>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
>>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> > rejected; currently processing
>>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
>>> > ---------------------------
>>> >
>>> > Then i restart the RegionServer and everything is back to normal.
>>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
>>> > abnormality in the same time window.
>>> > I think it was caused by the lost of connection to zookeeper. Is it
>>> advisable to
>>> > run zookeeper in the same machines?
>>> > if the RegionServer lost it's connection to Zookeeper, there's a way
(a
>>> > configuration perhaps) to re-join the cluster, and not only die?
>>> >
>>> > Any idea what is causing this?? Or to prevent it from happening?
>>> >
>>> > Any help is appreciated.
>>> >
>>> > Best Regards,
>>> >
>>> > --
>>> >
>>> > *Leonardo Gamas*
>>> > Software Engineer
>>> > +557134943514
>>> > +557581347440
>>> > leogamas@jusbrasil.com.br
>>> > www.jusbrasil.com.br
>>>
>>
>>
>>
>> --
>>
>> *Leonardo Gamas*
>> Software Engineer/Chaos Monkey Engineer
>> T (71) 3494-3514
>> C (75) 8134-7440
>> leogamas@jusbrasil.com.br
>> www.jusbrasil.com.br
>>
>>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer/Chaos Monkey Engineer
> T (71) 3494-3514
> C (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br


Re: RegionServer dying every two or three days

Posted by Stack <st...@duboce.net>.
On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
<le...@jusbrasil.com.br> wrote:
> The third line took 36.96 seconds to execute, can this be causing this
> problem?
>

Probably.  Have you made any attempt at GC tuning?


> Reading the code a little it seems that, even if it's disabled, if all
> files are target in a compaction, it's considered a major compaction. Is it
> right?
>

That is right.  They get 'upgraded' from minor to major.

This should be fine though.  What you are avoiding setting major
compactions to 0 is all regions being major compacted on a period, a
heavy weight effective rewrite of all your data (unless already major
compacted).   It looks like you have this disabled which is good until
you've wrestled your cluster into submission.


> The machines don't have swap, so the swappiness parameter don't seem to
> apply here. Any other suggestion?
>

See the perf section of the hbase manual.  It has our current list.

Are you monitoring your cluster w/ ganglia or tsdb?


St.Ack

> Thanks.
>
> 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
>
>> I will investigate this, thanks for the response.
>>
>>
>> 2012/1/3 Sandy Pratt <pr...@adobe.com>
>>
>>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>>> closing socket connection and attempting reconnect
>>>
>>> It looks like the process has been unresponsive for some time, so ZK has
>>> terminated the session.  Did you experience a long GC pause right before
>>> this?  If you don't have GC logging enabled for the RS, you can sometimes
>>> tell by noticing a gap in the timestamps of the log statements leading up
>>> to the crash.
>>>
>>> If it turns out to be GC, you might want to look at your kernel
>>> swappiness setting (set it to 0) and your JVM params.
>>>
>>> Sandy
>>>
>>>
>>> > -----Original Message-----
>>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
>>> > Sent: Thursday, December 29, 2011 07:44
>>> > To: user@hbase.apache.org
>>> > Subject: RegionServer dying every two or three days
>>> >
>>> > Hi,
>>> >
>>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
>>> Slaves),
>>> > running on Amazon EC2. The master is a High-Memory Extra Large Instance
>>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
>>> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
>>> TaskTracker,
>>> > RegionServer and Zookeeper.
>>> >
>>> > From time to time, every two or three days, one of the RegionServers
>>> > processes goes down, but the other processes (DataNode, TaskTracker,
>>> > Zookeeper) continue normally.
>>> >
>>> > Reading the logs:
>>> >
>>> > The connection with Zookeeper timed out:
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have
>>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>>> closing
>>> > socket connection and attempting reconnect
>>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>>> have
>>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
>>> > socket connection and attempting reconnect
>>> > ---------------------------
>>> >
>>> > And the Handlers start to fail:
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
>>> > xx.xx.xx.xx:xxxx: output error
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
>>> > caught: java.nio.channels.ClosedChannelException
>>> >         at
>>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>>> > 3)
>>> >         at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>>> > 1341)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>>> > aseServer.java:727)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>>> > rver.java:792)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 083)
>>> >
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
>>> > xx.xx.xx.xx:xxxx: output error
>>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
>>> > caught: java.nio.channels.ClosedChannelException
>>> >         at
>>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>>> > 3)
>>> >         at
>>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>>> > 1341)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>>> > aseServer.java:727)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>>> > rver.java:792)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 083)
>>> > ---------------------------
>>> >
>>> > And finally the server throws a YouAreDeadException :( :
>>> >
>>> > ---------------------------
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to
>>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>>> session
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
>>> > connection
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>>> to
>>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>>> session
>>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>>> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket
>>> > connection
>>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
>>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
>>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
>>> > Unhandled
>>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> > rejected; currently processing
>>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>>> > org.apache.hadoop.hbase.YouAreDeadException:
>>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>>> > dead server
>>> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> > Method)
>>> >         at
>>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
>>> > AccessorImpl.java:39)
>>> >         at
>>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
>>> > structorAccessorImpl.java:27)
>>> >         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>> >         at
>>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
>>> > ption.java:95)
>>> >         at
>>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
>>> > Exception.java:79)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>>> > ort(HRegionServer.java:735)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
>>> > ava:596)
>>> >         at java.lang.Thread.run(Thread.java:662)
>>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>>> > dead server
>>> >         at
>>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
>>> > ger.java:204)
>>> >         at
>>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
>>> > erManager.java:262)
>>> >         at
>>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>>> > a:669)
>>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>> >         at
>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>> > sorImpl.java:25)
>>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>>> > 039)
>>> >
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>>> >         at
>>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>>> > )
>>> >         at $Proxy6.regionServerReport(Unknown Source)
>>> >         at
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>>> > ort(HRegionServer.java:729)
>>> >         ... 2 more
>>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
>>> > requests=66, regions=206, stores=2078, storefiles=970,
>>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
>>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
>>> > blockCacheSize=705907552, blockCacheFree=150412064,
>>> > blockCacheCount=10648, blockCacheHitCount=79578618,
>>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
>>> > blockCacheHitRatio=96,
>>> > blockCacheHitCachingRatio=98
>>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
>>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>>> > rejected; currently processing
>>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
>>> > ---------------------------
>>> >
>>> > Then i restart the RegionServer and everything is back to normal.
>>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
>>> > abnormality in the same time window.
>>> > I think it was caused by the lost of connection to zookeeper. Is it
>>> advisable to
>>> > run zookeeper in the same machines?
>>> > if the RegionServer lost it's connection to Zookeeper, there's a way (a
>>> > configuration perhaps) to re-join the cluster, and not only die?
>>> >
>>> > Any idea what is causing this?? Or to prevent it from happening?
>>> >
>>> > Any help is appreciated.
>>> >
>>> > Best Regards,
>>> >
>>> > --
>>> >
>>> > *Leonardo Gamas*
>>> > Software Engineer
>>> > +557134943514
>>> > +557581347440
>>> > leogamas@jusbrasil.com.br
>>> > www.jusbrasil.com.br
>>>
>>
>>
>>
>> --
>>
>> *Leonardo Gamas*
>> Software Engineer/Chaos Monkey Engineer
>> T (71) 3494-3514
>> C (75) 8134-7440
>> leogamas@jusbrasil.com.br
>> www.jusbrasil.com.br
>>
>>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer/Chaos Monkey Engineer
> T (71) 3494-3514
> C (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br

RE: RegionServer dying every two or three days

Posted by Sandy Pratt <pr...@adobe.com>.
It seems like that 36 second GC could do it if it was a stop the world pause.  Is your ZK session timeout less than 36 seconds?

How large is your heap?  What JVM params are you running with?

If you're not collecting swapped out memory, then is it the case that concurrent mode failures are just too expensive on your nodes?  The concurrent collector fails back to serial old gen collection when the concurrent collector fails, which can be pretty slow.  I've heard 10 sec per GB of memory as a rule of thumb.  You could try the parallel compacting old gen collector and see what you get (-XX:+UseParallelOldGC).  That should make your worst case GC much less expensive at the cost of more frequent latency spikes.  I usually see about 1-3 sec per GB of memory with that collector (which makes me think the 10 sec per GB number for the serial collector is wrong, but whatever).  With some young gen tuning, you can wind up with infrequent full GCs which you know you can survive, as opposed to infrequent but often deadly concurrent mode failures.

Another factor to consider is the role the full GCs play in cleaning up the finalization queue, which I think can interact with off heap direct buffer allocations in a bad way.  I don't pretend to be an expert here, but here are some clues:

http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=ae283c11508fb97ede5fe27a1554b?bug_id=4469299
https://groups.google.com/group/asynchbase/browse_thread/thread/c45bc7ba788b2357?pli=1

On some of my app servers, I've noticed a curious situation where stage (with low traffic) would be gigs into swap for no good reason while prod (with high traffic) would be running stably with a reasonably sized heap (same config and hardware on both envs).  Switching to the parallel old gen collector, which in turn causes full GC to be much more frequent, corrected this problem in an immediate and obvious way.  The time of the change was instantly apparent in the cacti graphs, for example.  I can't prove it, but it's my hunch that NIO and off-heap byte buffer allocations (sometimes behind the scenes, see above) have made the CMS collector very risky for some long-running java applications.  For those that use it for HBase, I'd be curious to hear how frequent your full GCs are.

I think that's far enough afield for now!  Hope this helps somewhat.

Sandy

> -----Original Message-----
> From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> Sent: Wednesday, January 04, 2012 12:17
> To: user@hbase.apache.org
> Subject: Re: RegionServer dying every two or three days
>
> Sandy,
>
> It happened again:
>
> -----
> 12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 46571ms for sessionid 0x334a8c09b080033, closing
> socket connection and attempting reconnect
> 12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
> not heard from server in 46513ms for sessionid 0x334a8c09b080034, closing
> socket connection and attempting reconnect
> -----
>
> Comparing with the lines in the gc log, i found these lines around the time
> client session expired:
>
> -----
> 14248.678: [GC 14248.679: [ParNew: 105291K->457K(118016K), 0.0221210
> secs]
> 954217K->849457K(1705776K), 0.0222870 secs] [Times: user=0.05 sys=0.00,
> real=0.03 secs]
> 14249.959: [GC 14249.960: [ParNew: 105417K->392K(118016K), 0.0087140
> secs]
> 954417K->849428K(1705776K), 0.0089260 secs] [Times: user=0.01 sys=0.01,
> real=0.01 secs]
> 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840
> secs]
> 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
> real=36.96 secs]
> 14296.604: [GC 14296.604: [ParNew: 105369K->523K(118016K), 0.0119440
> secs]
> 954434K->849708K(1705776K), 0.0121650 secs] [Times: user=0.03 sys=0.00,
> real=0.01 secs]
> -----
>
> The third line took 36.96 seconds to execute, can this be causing this
> problem?
>
> I notice too some major compactions happening, even after i disabled it,
> with:
>
> <property>
>   <name>hbase.hregion.majorcompaction</name>
>   <value>0</value>
> </property>
>
> Reading the code a little it seems that, even if it's disabled, if all files are
> target in a compaction, it's considered a major compaction. Is it right?
>
> The machines don't have swap, so the swappiness parameter don't seem to
> apply here. Any other suggestion?
>
> Thanks.
>
> 2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>
>
> > I will investigate this, thanks for the response.
> >
> >
> > 2012/1/3 Sandy Pratt <pr...@adobe.com>
> >
> >> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >> out, have not heard from server in 61103ms for sessionid
> >> 0x23462a4cf93a8fc, closing socket connection and attempting reconnect
> >>
> >> It looks like the process has been unresponsive for some time, so ZK
> >> has terminated the session.  Did you experience a long GC pause right
> >> before this?  If you don't have GC logging enabled for the RS, you
> >> can sometimes tell by noticing a gap in the timestamps of the log
> >> statements leading up to the crash.
> >>
> >> If it turns out to be GC, you might want to look at your kernel
> >> swappiness setting (set it to 0) and your JVM params.
> >>
> >> Sandy
> >>
> >>
> >> > -----Original Message-----
> >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> >> > Sent: Thursday, December 29, 2011 07:44
> >> > To: user@hbase.apache.org
> >> > Subject: RegionServer dying every two or three days
> >> >
> >> > Hi,
> >> >
> >> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
> >> Slaves),
> >> > running on Amazon EC2. The master is a High-Memory Extra Large
> >> > Instance
> >> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> >> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
> >> TaskTracker,
> >> > RegionServer and Zookeeper.
> >> >
> >> > From time to time, every two or three days, one of the
> >> > RegionServers processes goes down, but the other processes
> >> > (DataNode, TaskTracker,
> >> > Zookeeper) continue normally.
> >> >
> >> > Reading the logs:
> >> >
> >> > The connection with Zookeeper timed out:
> >> >
> >> > ---------------------------
> >> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >> > out,
> >> have
> >> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> >> closing
> >> > socket connection and attempting reconnect
> >> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed
> >> > out,
> >> have
> >> > not heard from server in 61205ms for sessionid 0x346c561a55953e,
> >> > closing socket connection and attempting reconnect
> >> > ---------------------------
> >> >
> >> > And the Handlers start to fail:
> >> >
> >> > ---------------------------
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> >> > xx.xx.xx.xx:xxxx: output error
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on
> >> > 60020
> >> > caught: java.nio.channels.ClosedChannelException
> >> >         at
> >> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java
> >> > :13
> >> > 3)
> >> >         at
> >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > 1341)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(H
> >> > B
> >> > aseServer.java:727)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >> > rver.java:792)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.jav
> >> > a:1
> >> > 083)
> >> >
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> >> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> >> > xx.xx.xx.xx:xxxx: output error
> >> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on
> >> > 60020
> >> > caught: java.nio.channels.ClosedChannelException
> >> >         at
> >> >
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java
> >> > :13
> >> > 3)
> >> >         at
> >> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> >> > 1341)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(H
> >> > B
> >> > aseServer.java:727)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> >> > rver.java:792)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.jav
> >> > a:1
> >> > 083)
> >> > ---------------------------
> >> >
> >> > And finally the server throws a YouAreDeadException :( :
> >> >
> >> > ---------------------------
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > connection
> >> to
> >> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > initiating
> >> session
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing
> >> > socket connection
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket
> >> > connection
> >> to
> >> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> >> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> >> > initiating
> >> session
> >> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> >> > ZooKeeper service, session 0x346c561a55953e has expired, closing
> >> > socket connection
> >> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> >> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> >> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> >> > Unhandled
> >> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >> > REPORT rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> > org.apache.hadoop.hbase.YouAreDeadException:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> >         at
> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > Method)
> >> >         at
> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstru
> >> > ctor
> >> > AccessorImpl.java:39)
> >> >         at
> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegatin
> >> > gCon
> >> > structorAccessorImpl.java:27)
> >> >         at
> >> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >> >         at
> >> >
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteEx
> >> > ce
> >> > ption.java:95)
> >> >         at
> >> >
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> >> > Exception.java:79)
> >> >         at
> >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerR
> >> > ep
> >> > ort(HRegionServer.java:735)
> >> >         at
> >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServe
> >> > r.j
> >> > ava:596)
> >> >         at java.lang.Thread.run(Thread.java:662)
> >> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> >> > rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> >         at
> >> >
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> >> > ger.java:204)
> >> >         at
> >> >
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Ser
> >> > v
> >> > erManager.java:262)
> >> >         at
> >> >
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.j
> >> > av
> >> > a:669)
> >> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> Source)
> >> >         at
> >> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
> >> > es
> >> > sorImpl.java:25)
> >> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.jav
> >> > a:1
> >> > 039)
> >> >
> >> >         at
> >> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >> >         at
> >> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:2
> >> > 57
> >> > )
> >> >         at $Proxy6.regionServerReport(Unknown Source)
> >> >         at
> >> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerR
> >> > ep
> >> > ort(HRegionServer.java:729)
> >> >         ... 2 more
> >> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> >> > requests=66, regions=206, stores=2078, storefiles=970,
> >> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> >> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> >> > blockCacheSize=705907552, blockCacheFree=150412064,
> >> > blockCacheCount=10648, blockCacheHitCount=79578618,
> >> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> >> > blockCacheHitRatio=96,
> >> > blockCacheHitCachingRatio=98
> >> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED:
> >> > Unhandled
> >> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server
> >> > REPORT rejected; currently processing
> >> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> >> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> >> > ---------------------------
> >> >
> >> > Then i restart the RegionServer and everything is back to normal.
> >> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see
> >> > any abnormality in the same time window.
> >> > I think it was caused by the lost of connection to zookeeper. Is it
> >> advisable to
> >> > run zookeeper in the same machines?
> >> > if the RegionServer lost it's connection to Zookeeper, there's a
> >> > way (a configuration perhaps) to re-join the cluster, and not only die?
> >> >
> >> > Any idea what is causing this?? Or to prevent it from happening?
> >> >
> >> > Any help is appreciated.
> >> >
> >> > Best Regards,
> >> >
> >> > --
> >> >
> >> > *Leonardo Gamas*
> >> > Software Engineer
> >> > +557134943514
> >> > +557581347440
> >> > leogamas@jusbrasil.com.br
> >> > www.jusbrasil.com.br
> >>
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514 C (75)
> > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> >
> >
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer/Chaos Monkey Engineer
> T (71) 3494-3514
> C (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
Sandy,

It happened again:

-----
12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 46571ms for sessionid 0x334a8c09b080033, closing
socket connection and attempting reconnect
12/01/04 14:51:41 INFO zookeeper.ClientCnxn: Client session timed out, have
not heard from server in 46513ms for sessionid 0x334a8c09b080034, closing
socket connection and attempting reconnect
-----

Comparing with the lines in the gc log, i found these lines around the time
client session expired:

-----
14248.678: [GC 14248.679: [ParNew: 105291K->457K(118016K), 0.0221210 secs]
954217K->849457K(1705776K), 0.0222870 secs] [Times: user=0.05 sys=0.00,
real=0.03 secs]
14249.959: [GC 14249.960: [ParNew: 105417K->392K(118016K), 0.0087140 secs]
954417K->849428K(1705776K), 0.0089260 secs] [Times: user=0.01 sys=0.01,
real=0.01 secs]
14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), 0.0361840 secs]
954388K->849478K(1705776K), 0.0364200 secs] [Times: user=0.05 sys=0.01,
real=36.96 secs]
14296.604: [GC 14296.604: [ParNew: 105369K->523K(118016K), 0.0119440 secs]
954434K->849708K(1705776K), 0.0121650 secs] [Times: user=0.03 sys=0.00,
real=0.01 secs]
-----

The third line took 36.96 seconds to execute, can this be causing this
problem?

I notice too some major compactions happening, even after i disabled it,
with:

<property>
  <name>hbase.hregion.majorcompaction</name>
  <value>0</value>
</property>

Reading the code a little it seems that, even if it's disabled, if all
files are target in a compaction, it's considered a major compaction. Is it
right?

The machines don't have swap, so the swappiness parameter don't seem to
apply here. Any other suggestion?

Thanks.

2012/1/4 Leonardo Gamas <le...@jusbrasil.com.br>

> I will investigate this, thanks for the response.
>
>
> 2012/1/3 Sandy Pratt <pr...@adobe.com>
>
>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>> closing socket connection and attempting reconnect
>>
>> It looks like the process has been unresponsive for some time, so ZK has
>> terminated the session.  Did you experience a long GC pause right before
>> this?  If you don't have GC logging enabled for the RS, you can sometimes
>> tell by noticing a gap in the timestamps of the log statements leading up
>> to the crash.
>>
>> If it turns out to be GC, you might want to look at your kernel
>> swappiness setting (set it to 0) and your JVM params.
>>
>> Sandy
>>
>>
>> > -----Original Message-----
>> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
>> > Sent: Thursday, December 29, 2011 07:44
>> > To: user@hbase.apache.org
>> > Subject: RegionServer dying every two or three days
>> >
>> > Hi,
>> >
>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3
>> Slaves),
>> > running on Amazon EC2. The master is a High-Memory Extra Large Instance
>> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
>> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
>> TaskTracker,
>> > RegionServer and Zookeeper.
>> >
>> > From time to time, every two or three days, one of the RegionServers
>> > processes goes down, but the other processes (DataNode, TaskTracker,
>> > Zookeeper) continue normally.
>> >
>> > Reading the logs:
>> >
>> > The connection with Zookeeper timed out:
>> >
>> > ---------------------------
>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>> have
>> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
>> closing
>> > socket connection and attempting reconnect
>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
>> have
>> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
>> > socket connection and attempting reconnect
>> > ---------------------------
>> >
>> > And the Handlers start to fail:
>> >
>> > ---------------------------
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
>> > xx.xx.xx.xx:xxxx: output error
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
>> > caught: java.nio.channels.ClosedChannelException
>> >         at
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>> > 3)
>> >         at
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > 1341)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>> > aseServer.java:727)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>> > rver.java:792)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>> > 083)
>> >
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
>> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
>> > xx.xx.xx.xx:xxxx: output error
>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
>> > caught: java.nio.channels.ClosedChannelException
>> >         at
>> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
>> > 3)
>> >         at
>> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
>> > 1341)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
>> > aseServer.java:727)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
>> > rver.java:792)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>> > 083)
>> > ---------------------------
>> >
>> > And finally the server throws a YouAreDeadException :( :
>> >
>> > ---------------------------
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>> to
>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>> session
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
>> > connection
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection
>> to
>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
>> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
>> session
>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
>> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket
>> > connection
>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
>> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
>> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
>> > Unhandled
>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > rejected; currently processing
>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > org.apache.hadoop.hbase.YouAreDeadException:
>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>> > dead server
>> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> >         at
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
>> > AccessorImpl.java:39)
>> >         at
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
>> > structorAccessorImpl.java:27)
>> >         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>> >         at
>> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
>> > ption.java:95)
>> >         at
>> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
>> > Exception.java:79)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>> > ort(HRegionServer.java:735)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
>> > ava:596)
>> >         at java.lang.Thread.run(Thread.java:662)
>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
>> > dead server
>> >         at
>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
>> > ger.java:204)
>> >         at
>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
>> > erManager.java:262)
>> >         at
>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
>> > a:669)
>> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >         at
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>> > sorImpl.java:25)
>> >         at java.lang.reflect.Method.invoke(Method.java:597)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
>> > 039)
>> >
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>> >         at
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>> > )
>> >         at $Proxy6.regionServerReport(Unknown Source)
>> >         at
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
>> > ort(HRegionServer.java:729)
>> >         ... 2 more
>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
>> > requests=66, regions=206, stores=2078, storefiles=970,
>> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
>> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
>> > blockCacheSize=705907552, blockCacheFree=150412064,
>> > blockCacheCount=10648, blockCacheHitCount=79578618,
>> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
>> > blockCacheHitRatio=96,
>> > blockCacheHitCachingRatio=98
>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
>> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> > rejected; currently processing
>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
>> > ---------------------------
>> >
>> > Then i restart the RegionServer and everything is back to normal.
>> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
>> > abnormality in the same time window.
>> > I think it was caused by the lost of connection to zookeeper. Is it
>> advisable to
>> > run zookeeper in the same machines?
>> > if the RegionServer lost it's connection to Zookeeper, there's a way (a
>> > configuration perhaps) to re-join the cluster, and not only die?
>> >
>> > Any idea what is causing this?? Or to prevent it from happening?
>> >
>> > Any help is appreciated.
>> >
>> > Best Regards,
>> >
>> > --
>> >
>> > *Leonardo Gamas*
>> > Software Engineer
>> > +557134943514
>> > +557581347440
>> > leogamas@jusbrasil.com.br
>> > www.jusbrasil.com.br
>>
>
>
>
> --
>
> *Leonardo Gamas*
> Software Engineer/Chaos Monkey Engineer
> T (71) 3494-3514
> C (75) 8134-7440
> leogamas@jusbrasil.com.br
> www.jusbrasil.com.br
>
>


-- 

*Leonardo Gamas*
Software Engineer/Chaos Monkey Engineer
T (71) 3494-3514
C (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

Re: RegionServer dying every two or three days

Posted by Leonardo Gamas <le...@jusbrasil.com.br>.
I will investigate this, thanks for the response.

2012/1/3 Sandy Pratt <pr...@adobe.com>

> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> have not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc,
> closing socket connection and attempting reconnect
>
> It looks like the process has been unresponsive for some time, so ZK has
> terminated the session.  Did you experience a long GC pause right before
> this?  If you don't have GC logging enabled for the RS, you can sometimes
> tell by noticing a gap in the timestamps of the log statements leading up
> to the crash.
>
> If it turns out to be GC, you might want to look at your kernel swappiness
> setting (set it to 0) and your JVM params.
>
> Sandy
>
>
> > -----Original Message-----
> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > Sent: Thursday, December 29, 2011 07:44
> > To: user@hbase.apache.org
> > Subject: RegionServer dying every two or three days
> >
> > Hi,
> >
> > I have a HBase/Hadoop CDH3 cluster with 4 machines (1 Master + 3 Slaves),
> > running on Amazon EC2. The master is a High-Memory Extra Large Instance
> > (m2.xlarge) with NameNode, JobTracker, HMaster and Zookeeper. The
> > slaves are Extra Large Instances (m1.xlarge) running Datanode,
> TaskTracker,
> > RegionServer and Zookeeper.
> >
> > From time to time, every two or three days, one of the RegionServers
> > processes goes down, but the other processes (DataNode, TaskTracker,
> > Zookeeper) continue normally.
> >
> > Reading the logs:
> >
> > The connection with Zookeeper timed out:
> >
> > ---------------------------
> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> have
> > not heard from server in 61103ms for sessionid 0x23462a4cf93a8fc, closing
> > socket connection and attempting reconnect
> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client session timed out,
> have
> > not heard from server in 61205ms for sessionid 0x346c561a55953e, closing
> > socket connection and attempting reconnect
> > ---------------------------
> >
> > And the Handlers start to fail:
> >
> > ---------------------------
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf) from
> > xx.xx.xx.xx:xxxx: output error
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 81 on 60020
> > caught: java.nio.channels.ClosedChannelException
> >         at
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > 3)
> >         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > 1341)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> > aseServer.java:727)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> > rver.java:792)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > 083)
> >
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server Responder, call
> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430) from
> > xx.xx.xx.xx:xxxx: output error
> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server handler 62 on 60020
> > caught: java.nio.channels.ClosedChannelException
> >         at
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:13
> > 3)
> >         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:
> > 1341)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HB
> > aseServer.java:727)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseSe
> > rver.java:792)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > 083)
> > ---------------------------
> >
> > And finally the server throws a YouAreDeadException :( :
> >
> > ---------------------------
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
> session
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> > ZooKeeper service, session 0x23462a4cf93a8fc has expired, closing socket
> > connection
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening socket connection to
> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket connection
> > established to ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, initiating
> session
> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable to reconnect to
> > ZooKeeper service, session 0x346c561a55953e has expired, closing socket
> > connection
> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: ABORTING region
> > server serverName=ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > load=(requests=447, regions=206, usedHeap=1584, maxHeap=4083):
> > Unhandled
> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > rejected; currently processing
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > org.apache.hadoop.hbase.YouAreDeadException:
> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead server
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructor
> > AccessorImpl.java:39)
> >         at
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCon
> > structorAccessorImpl.java:27)
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> >         at
> > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteExce
> > ption.java:95)
> >         at
> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Remote
> > Exception.java:79)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> > ort(HRegionServer.java:735)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.j
> > ava:596)
> >         at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> > currently processing ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> > dead server
> >         at
> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerMana
> > ger.java:204)
> >         at
> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(Serv
> > erManager.java:262)
> >         at
> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.jav
> > a:669)
> >         at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> > sorImpl.java:25)
> >         at java.lang.reflect.Method.invoke(Method.java:597)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1
> > 039)
> >
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> >         at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> > )
> >         at $Proxy6.regionServerReport(Unknown Source)
> >         at
> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerRep
> > ort(HRegionServer.java:729)
> >         ... 2 more
> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: Dump of metrics:
> > requests=66, regions=206, stores=2078, storefiles=970,
> > storefileIndexSize=78, memstoreSize=796, compactionQueueSize=0,
> > flushQueueSize=0, usedHeap=1672, maxHeap=4083,
> > blockCacheSize=705907552, blockCacheFree=150412064,
> > blockCacheCount=10648, blockCacheHitCount=79578618,
> > blockCacheMissCount=3036335, blockCacheEvictedCount=1401352,
> > blockCacheHitRatio=96,
> > blockCacheHitCachingRatio=98
> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: STOPPED: Unhandled
> > exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> > rejected; currently processing
> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as dead server
> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping server on 60020
> > ---------------------------
> >
> > Then i restart the RegionServer and everything is back to normal.
> > Reading the DataNode, Zookeeper and TaskTracker logs, i don't see any
> > abnormality in the same time window.
> > I think it was caused by the lost of connection to zookeeper. Is it
> advisable to
> > run zookeeper in the same machines?
> > if the RegionServer lost it's connection to Zookeeper, there's a way (a
> > configuration perhaps) to re-join the cluster, and not only die?
> >
> > Any idea what is causing this?? Or to prevent it from happening?
> >
> > Any help is appreciated.
> >
> > Best Regards,
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > +557134943514
> > +557581347440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
>



-- 

*Leonardo Gamas*
Software Engineer/Chaos Monkey Engineer
T (71) 3494-3514
C (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br