You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by ChingShen <ch...@gmail.com> on 2011/01/06 08:44:32 UTC

Long GC pause question

Hi all,

   I encounter a problem about long gc pause cause the region server's local
zookeeper client cannot send heartbeats, the session times out.
 But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op to
region sever to stop its services rather than reinitialize a new zookeeper
client or restart region server?

  There are 3 RS/DN/TT and 1 MS/NN/JT in my cluster(Hadoop-0.20.2,  HBase
0.20.6), and set vm.swappiness to zero.

hbase-ites-master-clusterPC1.log
2011-01-06 13:10:57,003 INFO
org.apache.hadoop.hbase.master.ServerManager:
*clusterPC4,60020,1294280765301
znode expired*
2011-01-06 13:10:57,004 DEBUG org.apache.hadoop.hbase.master.HMaster:
Processing todo: ProcessServerShutdown of
ites-clusterPC4,60020,1294280765301
2011-01-06 13:10:57,004 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
server clusterPC4,60020,1294280765301: logSplit: false, rootRescanned:
false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
2011-01-06 13:10:57,007 INFO org.apache.hadoop.hbase.regionserver.HLog:
Splitting 1 hlog(s) in
hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301
2011-01-06 13:10:57,007 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
Splitting hlog 1 of 1:
hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301/hlog.dat.1294280765667,
length=0
.............

hbase-ites-regionserver-clusterPC4.log:
2011-01-06 12:21:15,773 DEBUG
org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms
elapsed
*2011-01-06 13:11:03,849 WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x12d59208f560000 to sun.nio.ch.SelectionKeyImpl@402f0df1
java.io.IOException: TIMED OUT
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2011-01-06 13:11:09,628 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP*
2011-01-06 13:11:31,491 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 60020
............

Please see the attach files.
Thanks.

Shen

Re: Long GC pause question

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Your MR job is likely generating a lot of IO and possibly starving
HBase while it's running (it would require some monitoring on your end
to figure that out). Less tasks per machine will leave more breathing
room, there's not that many ways to unload overloaded machines.

J-D

On Fri, Jan 7, 2011 at 12:29 AM, ChingShen <ch...@gmail.com> wrote:
> Hi J-D,
>
>  Yes, I run a MR job on my cluster, and when I set the MR configs as below
> that long gc pause is occurred.
>  MR config: (4-core cpu per RS/DN/TT node)
>  mapred.tasktracker.reduce.tasks.maximum = 3
>  mapred.tasktracker.map.tasks.maximum = 4
>  mapred.reduce.slowstart.completed.maps = 0.05
>
>  So, I try to adjust some configs as below, and run same job to find that
> error is occurred either map or reduce phase.
>  MR config:
>  mapred.tasktracker.reduce.tasks.maximum = 7
>  mapred.tasktracker.map.tasks.maximum = 7
>  mapred.reduce.slowstart.completed.maps = 1
>
>  Finally, I got the long gc pause is occurred in reduce phase, and my CPU
> usage is almost always at 100% in reduce phase, does it mean that high cpu
> usage causes long gc pause? In addition to decrease number of tasks.maximum,
> what else to avoid the problem, or is there any criteria? because when I
> decrease number of tasks.maximum to run same job, it's ok.
>
> Shen
>
> On Fri, Jan 7, 2011 at 3:20 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Shen,
>>
>> It's a design decision, and we historically preferred to let cluster
>> managers decide whether they want to restart the processes that died
>> or investigate why it has died then decide on what they want to do.
>> You can easily write tools that will restart the region servers if
>> they die, but the fact that they die in the first place is the real
>> issue.
>>
>> Looking at your logs, I cannot tell why exactly your region server
>> died (also the master log you gave is refers to the death of PC3 one
>> hour later, not PC4). I do see that the zookeeper server expired the
>> session almost a whole minute before the region server figured it out,
>> but the RS is really quiet... is there anything else running on that
>> cluster that doesn't touch hbase but that could affect it? Like MR
>> jobs that don't use HBase or something like that?
>>
>> J-D
>>
>> On Wed, Jan 5, 2011 at 11:44 PM, ChingShen <ch...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> >    I encounter a problem about long gc pause cause the region server's
>> local
>> > zookeeper client cannot send heartbeats, the session times out.
>> >  But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op
>> to
>> > region sever to stop its services rather than reinitialize a new
>> zookeeper
>> > client or restart region server?
>> >
>> >   There are 3 RS/DN/TT and 1 MS/NN/JT in my cluster(Hadoop-0.20.2,  HBase
>> > 0.20.6), and set vm.swappiness to zero.
>> >
>> > hbase-ites-master-clusterPC1.log
>> > 2011-01-06 13:10:57,003 INFO
>> org.apache.hadoop.hbase.master.ServerManager:
>> > clusterPC4,60020,1294280765301 znode expired
>> > 2011-01-06 13:10:57,004 DEBUG org.apache.hadoop.hbase.master.HMaster:
>> > Processing todo: ProcessServerShutdown of
>> > ites-clusterPC4,60020,1294280765301
>> > 2011-01-06 13:10:57,004 INFO
>> > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
>> > server clusterPC4,60020,1294280765301: logSplit: false, rootRescanned:
>> > false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
>> > 2011-01-06 13:10:57,007 INFO org.apache.hadoop.hbase.regionserver.HLog:
>> > Splitting 1 hlog(s) in
>> >
>> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301
>> > 2011-01-06 13:10:57,007 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
>> > Splitting hlog 1 of 1:
>> >
>> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301/hlog.dat.1294280765667,
>> > length=0
>> > .............
>> >
>> > hbase-ites-regionserver-clusterPC4.log:
>> > 2011-01-06 12:21:15,773 DEBUG
>> > org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period
>> 3600000ms
>> > elapsed
>> > 2011-01-06 13:11:03,849 WARN org.apache.zookeeper.ClientCnxn: Exception
>> > closing session 0x12d59208f560000 to sun.nio.ch.SelectionKeyImpl@402f0df1
>> > java.io.IOException: TIMED OUT
>> >         at
>> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
>> > 2011-01-06 13:11:09,628 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
>> > 2011-01-06 13:11:31,491 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
>> > server on 60020
>> > ............
>> >
>> > Please see the attach files.
>> > Thanks.
>> >
>> > Shen
>> >
>>
>

Re: Long GC pause question

Posted by ChingShen <ch...@gmail.com>.

Hi J-D,

  Yes, I run a MR job on my cluster, and when I set the MR configs as below
that long gc pause is occurred.
  MR config: (4-core cpu per RS/DN/TT node)
  mapred.tasktracker.reduce.tasks.maximum = 3
  mapred.tasktracker.map.tasks.maximum = 4
  mapred.reduce.slowstart.completed.maps = 0.05

  So, I try to adjust some configs as below, and run same job to find that
error is occurred either map or reduce phase.
  MR config:
  mapred.tasktracker.reduce.tasks.maximum = 7
  mapred.tasktracker.map.tasks.maximum = 7
  mapred.reduce.slowstart.completed.maps = 1

  Finally, I got the long gc pause is occurred in reduce phase, and my CPU
usage is almost always at 100% in reduce phase, does it mean that high cpu
usage causes long gc pause? In addition to decrease number of tasks.maximum,
what else to avoid the problem, or is there any criteria? because when I
decrease number of tasks.maximum to run same job, it's ok.

Shen

On Fri, Jan 7, 2011 at 3:20 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Shen,
>
> It's a design decision, and we historically preferred to let cluster
> managers decide whether they want to restart the processes that died
> or investigate why it has died then decide on what they want to do.
> You can easily write tools that will restart the region servers if
> they die, but the fact that they die in the first place is the real
> issue.
>
> Looking at your logs, I cannot tell why exactly your region server
> died (also the master log you gave is refers to the death of PC3 one
> hour later, not PC4). I do see that the zookeeper server expired the
> session almost a whole minute before the region server figured it out,
> but the RS is really quiet... is there anything else running on that
> cluster that doesn't touch hbase but that could affect it? Like MR
> jobs that don't use HBase or something like that?
>
> J-D
>
> On Wed, Jan 5, 2011 at 11:44 PM, ChingShen <ch...@gmail.com>
> wrote:
> > Hi all,
> >
> >    I encounter a problem about long gc pause cause the region server's
> local
> > zookeeper client cannot send heartbeats, the session times out.
> >  But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op
> to
> > region sever to stop its services rather than reinitialize a new
> zookeeper
> > client or restart region server?
> >
> >   There are 3 RS/DN/TT and 1 MS/NN/JT in my cluster(Hadoop-0.20.2,  HBase
> > 0.20.6), and set vm.swappiness to zero.
> >
> > hbase-ites-master-clusterPC1.log
> > 2011-01-06 13:10:57,003 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > clusterPC4,60020,1294280765301 znode expired
> > 2011-01-06 13:10:57,004 DEBUG org.apache.hadoop.hbase.master.HMaster:
> > Processing todo: ProcessServerShutdown of
> > ites-clusterPC4,60020,1294280765301
> > 2011-01-06 13:10:57,004 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> > server clusterPC4,60020,1294280765301: logSplit: false, rootRescanned:
> > false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> > 2011-01-06 13:10:57,007 INFO org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting 1 hlog(s) in
> >
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301
> > 2011-01-06 13:10:57,007 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
> > Splitting hlog 1 of 1:
> >
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301/hlog.dat.1294280765667,
> > length=0
> > .............
> >
> > hbase-ites-regionserver-clusterPC4.log:
> > 2011-01-06 12:21:15,773 DEBUG
> > org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period
> 3600000ms
> > elapsed
> > 2011-01-06 13:11:03,849 WARN org.apache.zookeeper.ClientCnxn: Exception
> > closing session 0x12d59208f560000 to sun.nio.ch.SelectionKeyImpl@402f0df1
> > java.io.IOException: TIMED OUT
> >         at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > 2011-01-06 13:11:09,628 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
> > 2011-01-06 13:11:31,491 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > server on 60020
> > ............
> >
> > Please see the attach files.
> > Thanks.
> >
> > Shen
> >
>

Re: Long GC pause question

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Shen,

It's a design decision, and we historically preferred to let cluster
managers decide whether they want to restart the processes that died
or investigate why it has died then decide on what they want to do.
You can easily write tools that will restart the region servers if
they die, but the fact that they die in the first place is the real
issue.

Looking at your logs, I cannot tell why exactly your region server
died (also the master log you gave is refers to the death of PC3 one
hour later, not PC4). I do see that the zookeeper server expired the
session almost a whole minute before the region server figured it out,
but the RS is really quiet... is there anything else running on that
cluster that doesn't touch hbase but that could affect it? Like MR
jobs that don't use HBase or something like that?

J-D

On Wed, Jan 5, 2011 at 11:44 PM, ChingShen <ch...@gmail.com> wrote:
> Hi all,
>
>    I encounter a problem about long gc pause cause the region server's local
> zookeeper client cannot send heartbeats, the session times out.
>  But I want to know why the HBase master sends a MSG_REGIONSERVER_STOP op to
> region sever to stop its services rather than reinitialize a new zookeeper
> client or restart region server?
>
>   There are 3 RS/DN/TT and 1 MS/NN/JT in my cluster(Hadoop-0.20.2,  HBase
> 0.20.6), and set vm.swappiness to zero.
>
> hbase-ites-master-clusterPC1.log
> 2011-01-06 13:10:57,003 INFO org.apache.hadoop.hbase.master.ServerManager:
> clusterPC4,60020,1294280765301 znode expired
> 2011-01-06 13:10:57,004 DEBUG org.apache.hadoop.hbase.master.HMaster:
> Processing todo: ProcessServerShutdown of
> ites-clusterPC4,60020,1294280765301
> 2011-01-06 13:10:57,004 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> server clusterPC4,60020,1294280765301: logSplit: false, rootRescanned:
> false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2011-01-06 13:10:57,007 INFO org.apache.hadoop.hbase.regionserver.HLog:
> Splitting 1 hlog(s) in
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301
> 2011-01-06 13:10:57,007 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
> Splitting hlog 1 of 1:
> hdfs://clusterPC1:54001/hbase20_6/.logs/ites-clusterPC4,60020,1294280765301/hlog.dat.1294280765667,
> length=0
> .............
>
> hbase-ites-regionserver-clusterPC4.log:
> 2011-01-06 12:21:15,773 DEBUG
> org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms
> elapsed
> 2011-01-06 13:11:03,849 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x12d59208f560000 to sun.nio.ch.SelectionKeyImpl@402f0df1
> java.io.IOException: TIMED OUT
>         at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2011-01-06 13:11:09,628 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGIONSERVER_STOP
> 2011-01-06 13:11:31,491 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60020
> ............
>
> Please see the attach files.
> Thanks.
>
> Shen
>