You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by 聪聪 <17...@qq.com> on 2015/11/29 01:44:35 UTC
How to optimize the GC in HBase
hi,all:
This morning regionserver shutdown.By analyzing the log, I guess it's about GC.Then,How to optimize the GC in HBase.
I use the HBase version is hbase-0.98.6-cdh5.2.0.I look for your help in this.
JVM configuration is as follows:
export HBASE_REGIONSERVER_OPTS="-Xms92g -Xmx92g -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseG1GC -server -XX:+DisableExplicitGC -XX:+UseFastAccessorMethods -XX:SoftRefLRUPolicyMSPerMB=0 -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40 -XX:ConcGCThreads=18 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=18 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -Xloggc:/home/q/hbase/hbase-0.98.6-cdh5.2.0/logs/gc-$(hostname)-hbase.log"
Regionserver on the log is as follows:
2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/2becc4de83bc4102802a4f69adfba433, keycount=1244806, bloomtype=ROW, size=462.2 M, encoding=NONE, seqNum=22878
2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/52669d6cf3a34dc89af6faf23ea03fa0, keycount=613700, bloomtype=ROW, size=236.4 M, encoding=NONE, seqNum=22885
2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/1ff32b168134412ea3f6942126623ef6, keycount=373741, bloomtype=ROW, size=143.9 M, encoding=NONE, seqNum=22889
2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8, keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-11-29 00:49:46,912 INFO [SplitLogWorker-l-hbase27.dba.cn2.qunar.com,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and attempting reconnect
2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker] util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 74953ms for sessionid 0x34f16f641cc2e77, closing socket connection and attempting reconnect
2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 70587ms
No GCs detected
2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept 72658ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 81433ms for sessionid 0x54f16f4aa3363fa, closing socket connection and attempting reconnect
2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 79798ms for sessionid 0x54f16f4aa3363fe, closing socket connection and attempting reconnect
2015-11-29 00:49:46,922 WARN [ResponseProcessor for block BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549] hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, 10.90.18.60:50010, 10.90.18.63:50010]
2015-11-29 00:49:46,921 WARN [ResponseProcessor for block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
2015-11-29 00:49:46,914 DEBUG [LruStats #0] hfile.LruBlockCache: Total=20.98 MB, free=18.38 GB, max=18.40 GB, blocks=19756849152, accesses=564477317, hits=166122280, hitRatio=29.43%, , cachingAccesses=166131865, cachingHits=166015128, cachingHitsRatio=99.93%, evictions=149453, evicted=116562, evictedPerRun=0.7799240946769714
2015-11-29 00:49:46,929 WARN [DataStreamer for file /hbase/tc_qtrace/WALs/l-hbase27.dba.cn2.qunar.com,60020,1447234715888/l-hbase27.dba.cn2.qunar.com%2C60020%2C1447234715888.1448726130837.meta block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823] hdfs.DFSClient: Error Recovery for block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 in pipeline 10.90.18.47:50010, 10.90.18.61:50010, 10.90.18.53:50010: bad datanode 10.90.18.47:50010
Re: How to optimize the GC in HBase
Posted by 伍照坤 <to...@gmail.com>.
i met similar problem before, I suggest you find out the gc log and cpu
usage log.
My experience
#1 see gc too long, cause zookeeper timeout, then region server determine
itself as dead node, and shutdown itself.
#2 optimize the gc, and monitor gc log. Still happen again.
#3 finally found out the reason is region server too busy, cpu usage are
too high, cause all the process slow or blocked.
Hope useful
2015年11月30日星期一,Stack <st...@duboce.net> 写道:
> On Sat, Nov 28, 2015 at 4:44 PM, 聪聪 <175998806@qq.com <javascript:;>>
> wrote:
>
> > hi,all:
> > This morning regionserver shutdown.By analyzing the log, I guess it's
> > about GC.Then,How to optimize the GC in HBase.
> > I use the HBase version is hbase-0.98.6-cdh5.2.0.I look for your help in
> > this.
> >
> >
> > JVM configuration is as follows:
> > export HBASE_REGIONSERVER_OPTS="-Xms92g -Xmx92g -XX:PermSize=256m
> > -XX:MaxPermSize=256m -XX:+UseG1GC -server -XX:+DisableExplicitGC
> > -XX:+UseFastAccessorMethods -XX:SoftRefLRUPolicyMSPerMB=0
> > -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40
> > -XX:ConcGCThreads=18 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB
> > -XX:ParallelGCThreads=18 -XX:+PrintClassHistogram -XX:+PrintGCDetails
> > -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
> >
> -Xloggc:/home/q/hbase/hbase-0.98.6-cdh5.2.0/logs/gc-$(hostname)-hbase.log"
> >
> >
> What the others have said.
>
> How did you arrive at the above configurations?
>
> Looks like you are keeping the gc log in the logs dir at
> gc-HOSTNAME-hbase.log. Looks like your GC output should be rich with lots
> of helpful detail.
>
> What is your loading like? All reads? All writes? Big objects? A mix?
>
> St.Ack
>
>
> >
> > Regionserver on the log is as follows:
> > 2015-11-29 00:48:30,521 DEBUG
> > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> > Compacting
> >
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/2becc4de83bc4102802a4f69adfba433,
> > keycount=1244806, bloomtype=ROW, size=462.2 M, encoding=NONE,
> seqNum=22878
> > 2015-11-29 00:48:30,521 DEBUG
> > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> > Compacting
> >
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/52669d6cf3a34dc89af6faf23ea03fa0,
> > keycount=613700, bloomtype=ROW, size=236.4 M, encoding=NONE, seqNum=22885
> > 2015-11-29 00:48:30,521 DEBUG
> > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> > Compacting
> >
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/1ff32b168134412ea3f6942126623ef6,
> > keycount=373741, bloomtype=ROW, size=143.9 M, encoding=NONE, seqNum=22889
> > 2015-11-29 00:48:30,521 DEBUG
> > [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> > Compacting
> >
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8,
> > keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
> > 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher]
> > util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to
> a
> > long garbage collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2015-11-29 00:49:46,912 INFO [
> SplitLogWorker-l-hbase27.dba.cn2.qunar.com
> > ,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
> > zookeeper.ClientCnxn: Client session timed out, have not heard from
> server
> > in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and
> > attempting reconnect
> > 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker]
> > util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to
> a
> > long garbage collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(
> > l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session
> timed
> > out, have not heard from server in 74953ms for sessionid
> 0x34f16f641cc2e77,
> > closing socket connection and attempting reconnect
> > 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
> > Detected pause in JVM or host machine (eg GC): pause of approximately
> > 70587ms
> > No GCs detected
> > 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept
> > 72658ms instead of 3000ms, this is likely due to a long garbage
> collecting
> > pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> > l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session
> timed
> > out, have not heard from server in 81433ms for sessionid
> 0x54f16f4aa3363fa,
> > closing socket connection and attempting reconnect
> > 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> > l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session
> timed
> > out, have not heard from server in 79798ms for sessionid
> 0x54f16f4aa3363fe,
> > closing socket connection and attempting reconnect
> > 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block
> > BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549]
> > hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms
> > (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR
> > downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010,
> 10.90.18.60:50010,
> > 10.90.18.63:50010]
> > 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block
> > BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
> > hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
> > BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
> > java.io.EOFException: Premature EOF: no length prefix available
> > at
> >
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
> > at
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
> > at
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
> > 2015-11-29 00:49:46,914 DEBUG [LruStats #0] hfile.LruBlockCache:
> > Total=20.98 MB, free=18.38 GB, max=18.40 GB, blocks=19756849152,
> > accesses=564477317, hits=166122280, hitRatio=29.43%, ,
> > cachingAccesses=166131865, cachingHits=166015128,
> cachingHitsRatio=99.93%,
> > evictions=149453, evicted=116562, evictedPerRun=0.7799240946769714
> > 2015-11-29 00:49:46,929 WARN [DataStreamer for file
> /hbase/tc_qtrace/WALs/
> > l-hbase27.dba.cn2.qunar.com,60020,1447234715888/
> > l-hbase27.dba.cn2.qunar.com%2C60020%2C1447234715888.1448726130837.meta
> > block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
> > hdfs.DFSClient: Error Recovery for block
> > BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 in pipeline
> > 10.90.18.47:50010, 10.90.18.61:50010, 10.90.18.53:50010: bad datanode
> > 10.90.18.47:50010
>
--
Sincerely.
伍 涛 | Tony Wu
Re: How to optimize the GC in HBase
Posted by Stack <st...@duboce.net>.
On Sat, Nov 28, 2015 at 4:44 PM, 聪聪 <17...@qq.com> wrote:
> hi,all:
> This morning regionserver shutdown.By analyzing the log, I guess it's
> about GC.Then,How to optimize the GC in HBase.
> I use the HBase version is hbase-0.98.6-cdh5.2.0.I look for your help in
> this.
>
>
> JVM configuration is as follows:
> export HBASE_REGIONSERVER_OPTS="-Xms92g -Xmx92g -XX:PermSize=256m
> -XX:MaxPermSize=256m -XX:+UseG1GC -server -XX:+DisableExplicitGC
> -XX:+UseFastAccessorMethods -XX:SoftRefLRUPolicyMSPerMB=0
> -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40
> -XX:ConcGCThreads=18 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB
> -XX:ParallelGCThreads=18 -XX:+PrintClassHistogram -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
> -Xloggc:/home/q/hbase/hbase-0.98.6-cdh5.2.0/logs/gc-$(hostname)-hbase.log"
>
>
What the others have said.
How did you arrive at the above configurations?
Looks like you are keeping the gc log in the logs dir at
gc-HOSTNAME-hbase.log. Looks like your GC output should be rich with lots
of helpful detail.
What is your loading like? All reads? All writes? Big objects? A mix?
St.Ack
>
> Regionserver on the log is as follows:
> 2015-11-29 00:48:30,521 DEBUG
> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> Compacting
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/2becc4de83bc4102802a4f69adfba433,
> keycount=1244806, bloomtype=ROW, size=462.2 M, encoding=NONE, seqNum=22878
> 2015-11-29 00:48:30,521 DEBUG
> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> Compacting
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/52669d6cf3a34dc89af6faf23ea03fa0,
> keycount=613700, bloomtype=ROW, size=236.4 M, encoding=NONE, seqNum=22885
> 2015-11-29 00:48:30,521 DEBUG
> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> Compacting
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/1ff32b168134412ea3f6942126623ef6,
> keycount=373741, bloomtype=ROW, size=143.9 M, encoding=NONE, seqNum=22889
> 2015-11-29 00:48:30,521 DEBUG
> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> Compacting
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8,
> keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
> 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [SplitLogWorker-l-hbase27.dba.cn2.qunar.com
> ,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and
> attempting reconnect
> 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker]
> util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(
> l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 74953ms for sessionid 0x34f16f641cc2e77,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 70587ms
> No GCs detected
> 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept
> 72658ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 81433ms for sessionid 0x54f16f4aa3363fa,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 79798ms for sessionid 0x54f16f4aa3363fe,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549]
> hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms
> (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR
> downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, 10.90.18.60:50010,
> 10.90.18.63:50010]
> 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
> 2015-11-29 00:49:46,914 DEBUG [LruStats #0] hfile.LruBlockCache:
> Total=20.98 MB, free=18.38 GB, max=18.40 GB, blocks=19756849152,
> accesses=564477317, hits=166122280, hitRatio=29.43%, ,
> cachingAccesses=166131865, cachingHits=166015128, cachingHitsRatio=99.93%,
> evictions=149453, evicted=116562, evictedPerRun=0.7799240946769714
> 2015-11-29 00:49:46,929 WARN [DataStreamer for file /hbase/tc_qtrace/WALs/
> l-hbase27.dba.cn2.qunar.com,60020,1447234715888/
> l-hbase27.dba.cn2.qunar.com%2C60020%2C1447234715888.1448726130837.meta
> block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
> hdfs.DFSClient: Error Recovery for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 in pipeline
> 10.90.18.47:50010, 10.90.18.61:50010, 10.90.18.53:50010: bad datanode
> 10.90.18.47:50010
Re: How to optimize the GC in HBase
Posted by Samir Ahmic <ah...@gmail.com>.
Hi,
What sort of load did you have on regionserver when server died? I see
exception happens around compaction and flush operations. Did you check
server's cpu, disks, memory, network... If pause was not caused by GC it
may be that root cause is somewhere else. There is also hdfs exception in
logs try checking hdfs to.
Best regards
Samir
On Nov 30, 2015 11:43 AM, "吴国泉" <wg...@qunar.com> wrote:
> hi:
> I also met this problem, cong,pu is my partner,
> hbase regionserver is configured with 92G RAM,
> sometimes the regionserver would crash,i checked the log, but there
> was no GC happened ,that is very strange,
> here is part of the regionserver log:
>
>
>
> 2015-11-29 00:48:30,521 DEBUG
> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> Compacting
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8,
> keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
> 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [SplitLogWorker-l-hbase27.dba.cn2.qunar.com
> ,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and
> attempting reconnect
> 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker]
> util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(
> l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 74953ms for sessionid 0x34f16f641cc2e77,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 70587ms
> No GCs detected
> 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept
> 72658ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 81433ms for sessionid 0x54f16f4aa3363fa,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 79798ms for sessionid 0x54f16f4aa3363fe,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549]
> hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms
> (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR
> downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, 10.90.18.60:50010,
> 10.90.18.63:50010]
> 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
>
>
> as we can see , the log shows that :"We slept 74029ms instead of
> 10000ms, this is likely due to a long garbage collecting pause ",but
> following :" it No GCs deteted ",
> this confused me. Can you give me some advice?
> I use java7, G1 garbage collector,
>
>
> On 2015年11月29日 08:44, 聪聪 wrote:
>
>> hi,all:
>> This morning regionserver shutdown.By analyzing the log, I guess it's
>> about GC.Then,How to optimize the GC in HBase.
>> I use the HBase version is hbase-0.98.6-cdh5.2.0.I look for your help in
>> this.
>>
>>
>> JVM configuration is as follows:
>> export HBASE_REGIONSERVER_OPTS="-Xms92g -Xmx92g -XX:PermSize=256m
>> -XX:MaxPermSize=256m -XX:+UseG1GC -server -XX:+DisableExplicitGC
>> -XX:+UseFastAccessorMethods -XX:SoftRefLRUPolicyMSPerMB=0
>> -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40
>> -XX:ConcGCThreads=18 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB
>> -XX:ParallelGCThreads=18 -XX:+PrintClassHistogram -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
>> -Xloggc:/home/q/hbase/hbase-0.98.6-cdh5.2.0/logs/gc-$(hostname)-hbase.log"
>>
>>
>> Regionserver on the log is as follows:
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/2becc4de83bc4102802a4f69adfba433,
>> keycount=1244806, bloomtype=ROW, size=462.2 M, encoding=NONE, seqNum=22878
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/52669d6cf3a34dc89af6faf23ea03fa0,
>> keycount=613700, bloomtype=ROW, size=236.4 M, encoding=NONE, seqNum=22885
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/1ff32b168134412ea3f6942126623ef6,
>> keycount=373741, bloomtype=ROW, size=143.9 M, encoding=NONE, seqNum=22889
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8,
>> keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
>> 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher]
>> util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to a
>> long garbage collecting pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> 2015-11-29 00:49:46,912 INFO [SplitLogWorker-l-hbase27.dba.cn2.qunar.com
>> ,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
>> zookeeper.ClientCnxn: Client session timed out, have not heard from server
>> in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and
>> attempting reconnect
>> 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker]
>> util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to a
>> long garbage collecting pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(
>> l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session
>> timed out, have not heard from server in 74953ms for sessionid
>> 0x34f16f641cc2e77, closing socket connection and attempting reconnect
>> 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
>> Detected pause in JVM or host machine (eg GC): pause of approximately
>> 70587ms
>> No GCs detected
>> 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept
>> 72658ms instead of 3000ms, this is likely due to a long garbage collecting
>> pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
>> l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session
>> timed out, have not heard from server in 81433ms for sessionid
>> 0x54f16f4aa3363fa, closing socket connection and attempting reconnect
>> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
>> l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session
>> timed out, have not heard from server in 79798ms for sessionid
>> 0x54f16f4aa3363fe, closing socket connection and attempting reconnect
>> 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549]
>> hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms
>> (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR
>> downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, 10.90.18.60:50010,
>> 10.90.18.63:50010]
>> 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
>> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
>> java.io.EOFException: Premature EOF: no length prefix available
>> at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
>> 2015-11-29 00:49:46,914 DEBUG [LruStats #0] hfile.LruBlockCache:
>> Total=20.98 MB, free=18.38 GB, max=18.40 GB, blocks=19756849152,
>> accesses=564477317, hits=166122280, hitRatio=29.43%, ,
>> cachingAccesses=166131865, cachingHits=166015128, cachingHitsRatio=99.93%,
>> evictions=149453, evicted=116562, evictedPerRun=0.7799240946769714
>> 2015-11-29 00:49:46,929 WARN [DataStreamer for file
>> /hbase/tc_qtrace/WALs/l-hbase27.dba.cn2.qunar.com,60020,1447234715888/
>> l-hbase27.dba.cn2.qunar.com%2C60020%2C1447234715888.1448726130837.meta
>> block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
>> hdfs.DFSClient: Error Recovery for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 in pipeline
>> 10.90.18.47:50010, 10.90.18.61:50010, 10.90.18.53:50010: bad datanode
>> 10.90.18.47:50010
>>
>>
>> 安全提示:本邮件非QUNAR内部邮件,请注意保护个人及公司信息安全,如有索取帐号密码等可疑情况请向 secteam发送邮件
>>
>
> --
> *吴国泉 wgq.wu*
> ----------------------------------------------------
> *Position*:技术部DBA-HBase
> *Email*:wgq.wu@qunar.com
> *Mob*:13036134151
> *Location*:北京市海淀区丹棱街3号中国电子大厦 1708
> *Post Code*:100080
>
Re: How to optimize the GC in HBase
Posted by 张铎 <pa...@gmail.com>.
2015-11-29 00:48:30,521 DEBUG
2015-11-29 00:49:46,912 WARN
Seems there is no log entry for about 76 seconds so I think there is a Full
GC. Have you guy turned on gc log? Please check the gc log file(maybe
xxx.out) to see if there is a long-paused GC.
And what is your GC options along with G1 garbage collector? I used to meet
a problem called "Humongous Objects" which causes a bad impact on the
performance of G1 GC. So if you guys are not expert of G1 GC, I suggest you
switch back to CMS in the production environment first.
Thanks.
2015-11-30 18:42 GMT+08:00 吴国泉 <wg...@qunar.com>:
> hi:
> I also met this problem, cong,pu is my partner,
> hbase regionserver is configured with 92G RAM,
> sometimes the regionserver would crash,i checked the log, but there
> was no GC happened ,that is very strange,
> here is part of the regionserver log:
>
>
>
> 2015-11-29 00:48:30,521 DEBUG
> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
> Compacting
> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8,
> keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
> 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [SplitLogWorker-l-hbase27.dba.cn2.qunar.com
> ,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
> zookeeper.ClientCnxn: Client session timed out, have not heard from server
> in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and
> attempting reconnect
> 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker]
> util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(
> l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 74953ms for sessionid 0x34f16f641cc2e77,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately
> 70587ms
> No GCs detected
> 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept
> 72658ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 81433ms for sessionid 0x54f16f4aa3363fa,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
> l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed
> out, have not heard from server in 79798ms for sessionid 0x54f16f4aa3363fe,
> closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549]
> hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms
> (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR
> downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, 10.90.18.60:50010,
> 10.90.18.63:50010]
> 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
> java.io.EOFException: Premature EOF: no length prefix available
> at
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
> at
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
>
>
> as we can see , the log shows that :"We slept 74029ms instead of
> 10000ms, this is likely due to a long garbage collecting pause ",but
> following :" it No GCs deteted ",
> this confused me. Can you give me some advice?
> I use java7, G1 garbage collector,
>
>
>
> On 2015年11月29日 08:44, 聪聪 wrote:
>
>> hi,all:
>> This morning regionserver shutdown.By analyzing the log, I guess it's
>> about GC.Then,How to optimize the GC in HBase.
>> I use the HBase version is hbase-0.98.6-cdh5.2.0.I look for your help in
>> this.
>>
>>
>> JVM configuration is as follows:
>> export HBASE_REGIONSERVER_OPTS="-Xms92g -Xmx92g -XX:PermSize=256m
>> -XX:MaxPermSize=256m -XX:+UseG1GC -server -XX:+DisableExplicitGC
>> -XX:+UseFastAccessorMethods -XX:SoftRefLRUPolicyMSPerMB=0
>> -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40
>> -XX:ConcGCThreads=18 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB
>> -XX:ParallelGCThreads=18 -XX:+PrintClassHistogram -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC
>> -Xloggc:/home/q/hbase/hbase-0.98.6-cdh5.2.0/logs/gc-$(hostname)-hbase.log"
>>
>>
>> Regionserver on the log is as follows:
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/2becc4de83bc4102802a4f69adfba433,
>> keycount=1244806, bloomtype=ROW, size=462.2 M, encoding=NONE, seqNum=22878
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/52669d6cf3a34dc89af6faf23ea03fa0,
>> keycount=613700, bloomtype=ROW, size=236.4 M, encoding=NONE, seqNum=22885
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/1ff32b168134412ea3f6942126623ef6,
>> keycount=373741, bloomtype=ROW, size=143.9 M, encoding=NONE, seqNum=22889
>> 2015-11-29 00:48:30,521 DEBUG
>> [regionserver60020-smallCompactions-1447234955800] compactions.Compactor:
>> Compacting
>> hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8,
>> keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
>> 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher]
>> util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to a
>> long garbage collecting pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> 2015-11-29 00:49:46,912 INFO [SplitLogWorker-l-hbase27.dba.cn2.qunar.com
>> ,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
>> zookeeper.ClientCnxn: Client session timed out, have not heard from server
>> in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and
>> attempting reconnect
>> 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker]
>> util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to a
>> long garbage collecting pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(
>> l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session
>> timed out, have not heard from server in 74953ms for sessionid
>> 0x34f16f641cc2e77, closing socket connection and attempting reconnect
>> 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
>> Detected pause in JVM or host machine (eg GC): pause of approximately
>> 70587ms
>> No GCs detected
>> 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept
>> 72658ms instead of 3000ms, this is likely due to a long garbage collecting
>> pause and it's usually bad, see
>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
>> l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session
>> timed out, have not heard from server in 81433ms for sessionid
>> 0x54f16f4aa3363fa, closing socket connection and attempting reconnect
>> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(
>> l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session
>> timed out, have not heard from server in 79798ms for sessionid
>> 0x54f16f4aa3363fe, closing socket connection and attempting reconnect
>> 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549]
>> hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms
>> (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR
>> downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, 10.90.18.60:50010,
>> 10.90.18.63:50010]
>> 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
>> hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
>> java.io.EOFException: Premature EOF: no length prefix available
>> at
>> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
>> at
>> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
>> at
>> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
>> 2015-11-29 00:49:46,914 DEBUG [LruStats #0] hfile.LruBlockCache:
>> Total=20.98 MB, free=18.38 GB, max=18.40 GB, blocks=19756849152,
>> accesses=564477317, hits=166122280, hitRatio=29.43%, ,
>> cachingAccesses=166131865, cachingHits=166015128, cachingHitsRatio=99.93%,
>> evictions=149453, evicted=116562, evictedPerRun=0.7799240946769714
>> 2015-11-29 00:49:46,929 WARN [DataStreamer for file
>> /hbase/tc_qtrace/WALs/l-hbase27.dba.cn2.qunar.com,60020,1447234715888/
>> l-hbase27.dba.cn2.qunar.com%2C60020%2C1447234715888.1448726130837.meta
>> block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
>> hdfs.DFSClient: Error Recovery for block
>> BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 in pipeline
>> 10.90.18.47:50010, 10.90.18.61:50010, 10.90.18.53:50010: bad datanode
>> 10.90.18.47:50010
>>
>>
>> 安全提示:本邮件非QUNAR内部邮件,请注意保护个人及公司信息安全,如有索取帐号密码等可疑情况请向 secteam发送邮件
>>
>
> --
> *吴国泉 wgq.wu*
> ----------------------------------------------------
> *Position*:技术部DBA-HBase
> *Email*:wgq.wu@qunar.com
> *Mob*:13036134151
> *Location*:北京市海淀区丹棱街3号中国电子大厦 1708
> *Post Code*:100080
>
Re: How to optimize the GC in HBase
Posted by 吴国泉 <wg...@qunar.com>.
hi:
I also met this problem, cong,pu is my partner,
hbase regionserver is configured with 92G RAM,
sometimes the regionserver would crash,i checked the log, but there
was no GC happened ,that is very strange,
here is part of the regionserver log:
2015-11-29 00:48:30,521 DEBUG
[regionserver60020-smallCompactions-1447234955800]
compactions.Compactor: Compacting
hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8,
keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher]
util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to
a long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-11-29 00:49:46,912 INFO
[SplitLogWorker-l-hbase27.dba.cn2.qunar.com,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from
server in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket
connection and attempting reconnect
2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker]
util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to
a long garbage collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-11-29 00:49:46,912 INFO
[regionserver60020-SendThread(l-zk1.dba.cn6.qunar.com:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from
server in 74953ms for sessionid 0x34f16f641cc2e77, closing socket
connection and attempting reconnect
2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
Detected pause in JVM or host machine (eg GC): pause of approximately
70587ms
No GCs detected
2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept
72658ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-11-29 00:49:46,913 INFO
[regionserver60020-SendThread(l-zk4.dba.cn8.qunar.com:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from
server in 81433ms for sessionid 0x54f16f4aa3363fa, closing socket
connection and attempting reconnect
2015-11-29 00:49:46,913 INFO
[regionserver60020-SendThread(l-zk2.dba.cn6.qunar.com:2181)]
zookeeper.ClientCnxn: Client session timed out, have not heard from
server in 79798ms for sessionid 0x54f16f4aa3363fe, closing socket
connection and attempting reconnect
2015-11-29 00:49:46,922 WARN [ResponseProcessor for block
BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549]
hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms
(threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR
downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010,
10.90.18.60:50010, 10.90.18.63:50010]
2015-11-29 00:49:46,921 WARN [ResponseProcessor for block
BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823]
hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block
BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
as we can see , the log shows that :"We slept 74029ms instead of
10000ms, this is likely due to a long garbage collecting pause ",but
following :" it No GCs deteted ",
this confused me. Can you give me some advice?
I use java7, G1 garbage collector,
On 2015年11月29日 08:44, 聪聪 wrote:
> hi,all:
> This morning regionserver shutdown.By analyzing the log, I guess it's about GC.Then,How to optimize the GC in HBase.
> I use the HBase version is hbase-0.98.6-cdh5.2.0.I look for your help in this.
>
>
> JVM configuration is as follows:
> export HBASE_REGIONSERVER_OPTS="-Xms92g -Xmx92g -XX:PermSize=256m -XX:MaxPermSize=256m -XX:+UseG1GC -server -XX:+DisableExplicitGC -XX:+UseFastAccessorMethods -XX:SoftRefLRUPolicyMSPerMB=0 -XX:G1ReservePercent=15 -XX:InitiatingHeapOccupancyPercent=40 -XX:ConcGCThreads=18 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=18 -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -Xloggc:/home/q/hbase/hbase-0.98.6-cdh5.2.0/logs/gc-$(hostname)-hbase.log"
>
>
> Regionserver on the log is as follows:
> 2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/2becc4de83bc4102802a4f69adfba433, keycount=1244806, bloomtype=ROW, size=462.2 M, encoding=NONE, seqNum=22878
> 2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/52669d6cf3a34dc89af6faf23ea03fa0, keycount=613700, bloomtype=ROW, size=236.4 M, encoding=NONE, seqNum=22885
> 2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/1ff32b168134412ea3f6942126623ef6, keycount=373741, bloomtype=ROW, size=143.9 M, encoding=NONE, seqNum=22889
> 2015-11-29 00:48:30,521 DEBUG [regionserver60020-smallCompactions-1447234955800] compactions.Compactor: Compacting hdfs://cluster-tc-qtrace:8020/hbase/tc_qtrace/data/default/qtrace/3449dc007b05ec535d5a7e3cf3cafc81/t/b993875f0f384597900d2a2d3e69fde8, keycount=136626, bloomtype=ROW, size=49.6 M, encoding=NONE, seqNum=22891
> 2015-11-29 00:49:46,912 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 71814ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [SplitLogWorker-l-hbase27.dba.cn2.qunar.com,60020,1447234715888-SendThread(l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 77297ms for sessionid 0x150ad3cf038d6e6, closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,912 WARN [regionserver60020.compactionChecker] util.Sleeper: We slept 74029ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,912 INFO [regionserver60020-SendThread(l-zk1.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 74953ms for sessionid 0x34f16f641cc2e77, closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 70587ms
> No GCs detected
> 2015-11-29 00:49:46,913 WARN [regionserver60020] util.Sleeper: We slept 72658ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(l-zk4.dba.cn8.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 81433ms for sessionid 0x54f16f4aa3363fa, closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,913 INFO [regionserver60020-SendThread(l-zk2.dba.cn6.qunar.com:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 79798ms for sessionid 0x54f16f4aa3363fe, closing socket connection and attempting reconnect
> 2015-11-29 00:49:46,922 WARN [ResponseProcessor for block BP-329468547-10.90.18.19-1444480193809:blk_1078012346_4271549] hdfs.DFSClient: Slow ReadProcessor read fields took 72827ms (threshold=30000ms); ack: seqno: -2 status: SUCCESS status: ERROR downstreamAckTimeNanos: 0, targets: [10.90.18.47:50010, 10.90.18.60:50010, 10.90.18.63:50010]
> 2015-11-29 00:49:46,921 WARN [ResponseProcessor for block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823
> java.io.EOFException: Premature EOF: no length prefix available
> at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
> at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
> 2015-11-29 00:49:46,914 DEBUG [LruStats #0] hfile.LruBlockCache: Total=20.98 MB, free=18.38 GB, max=18.40 GB, blocks=19756849152, accesses=564477317, hits=166122280, hitRatio=29.43%, , cachingAccesses=166131865, cachingHits=166015128, cachingHitsRatio=99.93%, evictions=149453, evicted=116562, evictedPerRun=0.7799240946769714
> 2015-11-29 00:49:46,929 WARN [DataStreamer for file /hbase/tc_qtrace/WALs/l-hbase27.dba.cn2.qunar.com,60020,1447234715888/l-hbase27.dba.cn2.qunar.com%2C60020%2C1447234715888.1448726130837.meta block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823] hdfs.DFSClient: Error Recovery for block BP-329468547-10.90.18.19-1444480193809:blk_1078010620_4269823 in pipeline 10.90.18.47:50010, 10.90.18.61:50010, 10.90.18.53:50010: bad datanode 10.90.18.47:50010
>
>
> 安全提示:本邮件非QUNAR内部邮件,请注意保护个人及公司信息安全,如有索取帐号密码等可疑情况请向 secteam发送邮件
--
*吴国泉 wgq.wu*
----------------------------------------------------
*Position*:技术部DBA-HBase
*Email*:wgq.wu@qunar.com
*Mob*:13036134151
*Location*:北京市海淀区丹棱街3号中国电子大厦 1708
*Post Code*:100080