You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Rural Hunter <ru...@gmail.com> on 2014/07/08 08:06:23 UTC

Region server not accept connections intermittently

Hi,

I'm using hbase-0.96.2. I saw sometimes my region servers don't accept 
connections from clients. this could last like 10 minutes to half hour. 
I was not able to connect to the 60020 port even with telnet command 
when it happened. After a while, the problem disappeared and the region 
server behaves normally. My region server is configured with 8G heap and 
the usage always kepes around at 3-5G. I checked the log and don't see 
error log. The log seems normal:

2014-07-08 10:13:26,192 INFO  [MemStoreFlusher.0] 
regionserver.DefaultStoreFlusher: Flushed, sequenceid=1450565470, 
memsize=937.9 K, hasBloomFilter=true, into tmp file 
hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/.tmp/d1a4e8fa945444958141dcdaa07320ab
2014-07-08 10:13:26,208 INFO  [MemStoreFlusher.0] regionserver.HStore: 
Added 
hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/cf1/d1a4e8fa945444958141dcdaa07320ab, 
entries=4494, sequenceid=1450565470, filesize=146.8 K
2014-07-08 10:13:26,209 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
Finished memstore flush of ~957.3 K/980232, currentsize=0/0 for region 
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. in 40ms, 
sequenceid=1450565470, compaction requested=true
2014-07-08 10:13:26,209 INFO 
[regionserver60020-smallCompactions-1397552664633] regionserver.HRegion: 
Starting compaction on cf1 in region 
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b.
2014-07-08 10:13:26,209 INFO 
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore: 
Starting compaction of 3 file(s) in cf1 of 
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. into 
tmpdir=hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/.tmp, 
totalSize=705.6 K
2014-07-08 10:13:26,390 INFO 
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore: 
Completed compaction of 3 file(s) in cf1 of 
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. into 
69b7c19f59994dffbb236faa03dc1090(size=693.8 K), total size for store is 
1.1 G. This selection was in queue for 0sec, and took 0sec to execute.
2014-07-08 10:13:26,390 INFO 
[regionserver60020-smallCompactions-1397552664633] 
regionserver.CompactSplitThread: Completed compaction: Request = 
regionName=tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b., 
storeName=cf1, fileCount=3, fileSize=705.6 K, priority=6, 
time=47043081604802310; duration=0sec
2014-07-08 10:29:06,383 INFO  [regionserver60020.periodicFlusher] 
regionserver.HRegionServer: regionserver60020.periodicFlusher requesting 
flush for region 
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. 
after a delay of 10113
2014-07-08 10:29:16,383 INFO  [regionserver60020.periodicFlusher] 
regionserver.HRegionServer: regionserver60020.periodicFlusher requesting 
flush for region 
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. 
after a delay of 3456
2014-07-08 10:29:16,537 INFO  [MemStoreFlusher.0] 
regionserver.DefaultStoreFlusher: Flushed, sequenceid=1450574203, 
memsize=425.0 K, hasBloomFilter=true, into tmp file 
hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/.tmp/cabc231951cb473695c3bac045456cee
2014-07-08 10:29:16,549 INFO  [MemStoreFlusher.0] regionserver.HStore: 
Added 
hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/cf1/cabc231951cb473695c3bac045456cee, 
entries=1984, sequenceid=1450574203, filesize=83.0 K
2014-07-08 10:29:16,549 INFO  [MemStoreFlusher.0] regionserver.HRegion: 
Finished memstore flush of ~429 K/439296, currentsize=0/0 for region 
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. 
in 53ms, sequenceid=1450574203, compaction requested=true
2014-07-08 10:29:16,550 INFO 
[regionserver60020-smallCompactions-1397552664633] regionserver.HRegion: 
Starting compaction on cf1 in region 
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
2014-07-08 10:29:16,550 INFO 
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore: 
Starting compaction of 3 file(s) in cf1 of 
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. 
into 
tmpdir=hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/.tmp, 
totalSize=9.0 M
2014-07-08 10:29:17,053 INFO 
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore: 
Completed compaction of 3 file(s) in cf1 of 
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. 
into e923e5ad17c54bf4934ef07e3d0b5496(size=9.0 M), total size for store 
is 582.3 M. This selection was in queue for 0sec, and took 0sec to execute.

I suspect it's the gc problem so I checked the gc log and found it kept 
print logs like this(both when the problem happens or not):
2014-07-08T10:26:45.273+0800: 7233757.674: [GC 7233757.674: [ParNew: 
112969K->8702K(118016K), 0.0652040 secs] 5262052K->5160850K(8375552K), 
0.0653760 secs] [Times: user=0.00 sys=0.00, real=0.07 secs]
2014-07-08T10:26:46.836+0800: 7233759.237: [GC 7233759.237: [ParNew: 
113662K->6949K(118016K), 0.0342530 secs] 5265810K->5163641K(8375552K), 
0.0343870 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:48.422+0800: 7233760.823: [GC 7233760.823: [ParNew: 
111909K->4855K(118016K), 0.0269810 secs] 5268601K->5165318K(8375552K), 
0.0271120 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:49.896+0800: 7233762.296: [GC 7233762.296: [ParNew: 
109815K->7721K(118016K), 0.0251810 secs] 5270278K->5168184K(8375552K), 
0.0253300 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:26:51.334+0800: 7233763.734: [GC 7233763.735: [ParNew: 
112681K->8670K(118016K), 0.0233590 secs] 5273144K->5171935K(8375552K), 
0.0237200 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:52.701+0800: 7233765.102: [GC 7233765.102: [ParNew: 
113390K->7880K(118016K), 0.0266700 secs] 5276656K->5175725K(8375552K), 
0.0268250 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:54.166+0800: 7233766.566: [GC 7233766.566: [ParNew: 
112840K->7022K(118016K), 0.0249410 secs] 5280685K->5178575K(8375552K), 
0.0250850 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:26:55.586+0800: 7233767.986: [GC 7233767.987: [ParNew: 
111982K->6807K(118016K), 0.0317690 secs] 5283535K->5182134K(8375552K), 
0.0319390 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:57.006+0800: 7233769.407: [GC 7233769.407: [ParNew: 
111767K->7417K(118016K), 0.0199740 secs] 5287094K->5185610K(8375552K), 
0.0201160 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:26:58.520+0800: 7233770.921: [GC 7233770.921: [ParNew: 
112377K->6285K(118016K), 0.0219640 secs] 5290570K->5188051K(8375552K), 
0.0220970 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:00.077+0800: 7233772.478: [GC 7233772.478: [ParNew: 
111245K->8191K(118016K), 0.0215280 secs] 5293011K->5189957K(8375552K), 
0.0216740 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:01.571+0800: 7233773.971: [GC 7233773.972: [ParNew: 
113151K->8510K(118016K), 0.0185730 secs] 5294917K->5193731K(8375552K), 
0.0187220 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:02.958+0800: 7233775.359: [GC 7233775.359: [ParNew: 
113470K->9384K(118016K), 0.0252470 secs] 5298691K->5197072K(8375552K), 
0.0253900 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:04.416+0800: 7233776.817: [GC 7233776.817: [ParNew: 
114344K->6721K(118016K), 0.0203040 secs] 5302032K->5200076K(8375552K), 
0.0204510 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:05.891+0800: 7233778.291: [GC 7233778.291: [ParNew: 
111681K->6341K(118016K), 0.0434460 secs] 5305036K->5203021K(8375552K), 
0.0436020 secs] [Times: user=0.00 sys=0.00, real=0.05 secs]
2014-07-08T10:27:07.497+0800: 7233779.898: [GC 7233779.898: [ParNew: 
111301K->9838K(118016K), 0.0177230 secs] 5307981K->5206518K(8375552K), 
0.0178770 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:08.904+0800: 7233781.305: [GC 7233781.305: [ParNew: 
114788K->7575K(118016K), 0.0231210 secs] 5311469K->5209993K(8375552K), 
0.0233670 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:10.305+0800: 7233782.706: [GC 7233782.706: [ParNew: 
112535K->7616K(118016K), 0.0226150 secs] 5314953K->5213487K(8375552K), 
0.0227620 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:11.942+0800: 7233784.343: [GC 7233784.343: [ParNew: 
112576K->7835K(118016K), 0.0244290 secs] 5318447K->5217156K(8375552K), 
0.0245700 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:27:13.565+0800: 7233785.966: [GC 7233785.966: [ParNew: 
112795K->7254K(118016K), 0.0206670 secs] 5322116K->5220093K(8375552K), 
0.0207760 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:15.107+0800: 7233787.507: [GC 7233787.508: [ParNew: 
112101K->5551K(118016K), 0.0213290 secs] 5324940K->5222345K(8375552K), 
0.0214360 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:16.818+0800: 7233789.218: [GC 7233789.218: [ParNew: 
110511K->8559K(118016K), 0.0326800 secs] 5327305K->5225353K(8375552K), 
0.0328540 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]

I also anaylzed the gc log and found the max pause of gc is only around 
2 seconds. So I don't see any gc problem. Can anyone point out the 
possible cause of this problem?

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

No. I didn't touch ipc.server.listen.queue.size. Anyway my change 
mitigated my problem as I stated in another thread: From the observation 
in these 2 days after the action was taken, the frequency of the problem 
has been reduced. The huge improvement is, even when the problem 
happens, the RS can recover in less than 1 minute comparing with 30-60 
minutes before.

于 2014/7/14 12:32, Esteban Gutierrez 写道:
> Hello Rural,
>
> Thats interesting, unless you have changed ipc.server.listen.queue.size in
> the HBase Region Server (and other Hadoop daemons) to value higher than
> 128, you might have worked around the issue by increasing the listen queue
> (globally) for a service that doesn't explicitly set the queue size. See
> the notes section in http://linux.die.net/man/2/listen
>
> cheers,
> esteban.
>
>
>
> --
> Cloudera, Inc.
>
>
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

I don't know. I think the other parameter is more important:
net.core.somaxconn=1024 (original 128)
net.ipv4.tcp_synack_retries=2 (original 5)

Since I found many connections were in SYN_RECV status, my purpose of 
changing these 2 parameters are:
net.ipv4.tcp_synack_retries: Reduce the waiting time in SYN_RECV status 
so the queue reduce more quickly. I believe after this change, clients 
will fail more quickly.
net.core.somaxconn: Increase the allowed connections in SYN_RECV queue.

After I made the changes, I kept track the netstat output and found 
there was no SYN_RECV peak any more. There was usually several 
connections in SYN_RECV.

于 2014/7/22 15:45, Esteban Gutierrez 写道:
> Thanks for keeping us updated Rural!
>
> I'm still curious why changing net.core.somaxconn in the kernel helped here
> if you didn't change ipc.server.listen.queue.size. Perhaps that property is
> in hdfs-site.xml or core-site.xml with a higher value?
>
> cheers,
> esteban.
>
>

Re: Region server not accept connections intermittently

Posted by Esteban Gutierrez <es...@cloudera.com>.

Thanks for keeping us updated Rural!

I'm still curious why changing net.core.somaxconn in the kernel helped here
if you didn't change ipc.server.listen.queue.size. Perhaps that property is
in hdfs-site.xml or core-site.xml with a higher value?

cheers,
esteban.

--
Cloudera, Inc.

On Mon, Jul 21, 2014 at 9:07 PM, Rural Hunter <ru...@gmail.com> wrote:

> Just update my result: Since HBASE-11277 was applied, I have not seen any
> connection problem for a week. Before, the connection problem almost
> occurred everyday.
>
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

Just update my result: Since HBASE-11277 was applied, I have not seen 
any connection problem for a week. Before, the connection problem almost 
occurred everyday.

Re: Region server not accept connections intermittently

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hello Rural,

Thats interesting, unless you have changed ipc.server.listen.queue.size in
the HBase Region Server (and other Hadoop daemons) to value higher than
128, you might have worked around the issue by increasing the listen queue
(globally) for a service that doesn't explicitly set the queue size. See
the notes section in http://linux.die.net/man/2/listen

cheers,
esteban.



--
Cloudera, Inc.



On Fri, Jul 11, 2014 at 6:40 PM, Rural Hunter <ru...@gmail.com> wrote:

> the max number of files has already been set to 32768 for the user running
> hbase/hadoop. I think there should be errors in log if it's the file number
> problem. The count of connections in SYN_RECV state is about 100. I also
> checked the source of those connections and they are from the hosts of my
> hbase clients. Anyway, I just changed these kernel settings:
> net.core.somaxconn=1024 (original 128)
> net.ipv4.tcp_synack_retries=2 (original 5)
>
> I will wait and see if this will happen again.
>
> 于 2014/7/12 1:57, Esteban Gutierrez 写道:
>
>  For how long you noticed that connections? when you say "many" do you mean
>> 1000s? the problem with having too many syn_recv is that you could end
>> running out of file descriptors, which makes me wonder know what is the
>> maximum number of open files that you have configured for the RS process
>> (see all the root->hbase user->hbase proc chain of ulimits) and if there
>> could be a rogue client connecting to 60020 (maybe a port scanner?) and
>> triggering that problem.
>>
>> cheers,
>> esteban.
>>
>>
>>
>>
>> --
>> Cloudera, Inc.
>>
>>
>>
>> On Fri, Jul 11, 2014 at 3:02 AM, Rural Hunter <ru...@gmail.com>
>> wrote:
>>
>>  One additional info, I did 'netstat -an |grep 60020' when the problem
>>> happened, I saw many connections from remote to local port 60020 are on
>>> state "SYN_RECV". Not sure if that indicates anything.
>>>
>>>
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

the max number of files has already been set to 32768 for the user 
running hbase/hadoop. I think there should be errors in log if it's the 
file number problem. The count of connections in SYN_RECV state is about 
100. I also checked the source of those connections and they are from 
the hosts of my hbase clients. Anyway, I just changed these kernel settings:
net.core.somaxconn=1024 (original 128)
net.ipv4.tcp_synack_retries=2 (original 5)

I will wait and see if this will happen again.

于 2014/7/12 1:57, Esteban Gutierrez 写道:
> For how long you noticed that connections? when you say "many" do you mean
> 1000s? the problem with having too many syn_recv is that you could end
> running out of file descriptors, which makes me wonder know what is the
> maximum number of open files that you have configured for the RS process
> (see all the root->hbase user->hbase proc chain of ulimits) and if there
> could be a rogue client connecting to 60020 (maybe a port scanner?) and
> triggering that problem.
>
> cheers,
> esteban.
>
>
>
>
> --
> Cloudera, Inc.
>
>
>
> On Fri, Jul 11, 2014 at 3:02 AM, Rural Hunter <ru...@gmail.com> wrote:
>
>> One additional info, I did 'netstat -an |grep 60020' when the problem
>> happened, I saw many connections from remote to local port 60020 are on
>> state "SYN_RECV". Not sure if that indicates anything.
>>

Re: Region server not accept connections intermittently

Posted by Esteban Gutierrez <es...@cloudera.com>.

For how long you noticed that connections? when you say "many" do you mean
1000s? the problem with having too many syn_recv is that you could end
running out of file descriptors, which makes me wonder know what is the
maximum number of open files that you have configured for the RS process
(see all the root->hbase user->hbase proc chain of ulimits) and if there
could be a rogue client connecting to 60020 (maybe a port scanner?) and
triggering that problem.

cheers,
esteban.

--
Cloudera, Inc.

On Fri, Jul 11, 2014 at 3:02 AM, Rural Hunter <ru...@gmail.com> wrote:

> One additional info, I did 'netstat -an |grep 60020' when the problem
> happened, I saw many connections from remote to local port 60020 are on
> state "SYN_RECV". Not sure if that indicates anything.
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

One additional info, I did 'netstat -an |grep 60020' when the problem 
happened, I saw many connections from remote to local port 60020 are on 
state "SYN_RECV". Not sure if that indicates anything.

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

Yes, I can take more if needed when it happens next time.

于 2014/7/10 17:11, Ted Yu 写道:
> I noticed the blockSeek() call in HFileReaderV2.
>
> Did you take only one dump during the 20 minute hang ?
>
> Cheers
>
>

Re: Region server not accept connections intermittently

Posted by Ted Yu <yu...@gmail.com>.

I noticed the blockSeek() call in HFileReaderV2. 

Did you take only one dump during the 20 minute hang ?

Cheers

On Jul 10, 2014, at 1:54 AM, Rural Hunter <ru...@gmail.com> wrote:

> I got the dump of the problematic rs from web ui: http://pastebin.com/4hfhkDUw
> output of "top -H -p <PID>": http://pastebin.com/LtzkScYY
> I also got the output of jstack but I believe it's already in the dump so I do not paste it again. This time the hang lasted about 20 minutes.
> 
> 于 2014/7/9 12:48, Esteban Gutierrez 写道:
>> Hi Rural,
>> 
>> Thats interesting. Since you are passing
>> hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
>> HBase? If you experience the issue again, can you try to obtain a jstack
>> (as the user that started the hbase process or try from the RS UI if
>> responsive rs:port/dump) as Ted suggested? the output of "top -H -p <PID>"
>> might be useful too where <PID> is the pid of the RS. If you have some
>> metrics monitoring it would be interesting to see how callQueueLength and
>> the blocked threads change over time.
>> 
>> cheers,
>> esteban.
>> 
>> 
>> --
>> Cloudera, Inc.
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

I got the dump of the problematic rs from web ui: 
http://pastebin.com/4hfhkDUw
output of "top -H -p <PID>": http://pastebin.com/LtzkScYY
I also got the output of jstack but I believe it's already in the dump 
so I do not paste it again. This time the hang lasted about 20 minutes.

于 2014/7/9 12:48, Esteban Gutierrez 写道:
> Hi Rural,
>
> Thats interesting. Since you are passing
> hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
> HBase? If you experience the issue again, can you try to obtain a jstack
> (as the user that started the hbase process or try from the RS UI if
> responsive rs:port/dump) as Ted suggested? the output of "top -H -p <PID>"
> might be useful too where <PID> is the pid of the RS. If you have some
> metrics monitoring it would be interesting to see how callQueueLength and
> the blocked threads change over time.
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

Hi Esteban,

Yes I use the ZK managed by hbase. I will try to get the jstack and 
other info when this happens again.

于 2014/7/9 12:48, Esteban Gutierrez 写道:
> Hi Rural,
>
> Thats interesting. Since you are passing
> hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
> HBase? If you experience the issue again, can you try to obtain a jstack
> (as the user that started the hbase process or try from the RS UI if
> responsive rs:port/dump) as Ted suggested? the output of "top -H -p <PID>"
> might be useful too where <PID> is the pid of the RS. If you have some
> metrics monitoring it would be interesting to see how callQueueLength and
> the blocked threads change over time.
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>

Re: Region server not accept connections intermittently

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hi Rural,

Thats interesting. Since you are passing
hbase.zookeeper.property.maxClientCnxns does it means that ZK is managed by
HBase? If you experience the issue again, can you try to obtain a jstack
(as the user that started the hbase process or try from the RS UI if
responsive rs:port/dump) as Ted suggested? the output of "top -H -p <PID>"
might be useful too where <PID> is the pid of the RS. If you have some
metrics monitoring it would be interesting to see how callQueueLength and
the blocked threads change over time.

cheers,
esteban.


--
Cloudera, Inc.



On Tue, Jul 8, 2014 at 6:58 PM, Rural Hunter <ru...@gmail.com> wrote:

> No. I used the standard log4j file and there is not any network problem
> from the client. I checked the web admin ui and the master still take the
> slave as working. Just the request count is very small(about 10 while
> others are several hundreds). I sshed on the slave server and I can see the
> 60020 is open by netstat command. But I am not able to telnet the port even
> on the server itself. It just timed out. This situation is same as the
> client from other servers. After it recovered automatically, I can telnet
> to the 60020 port on both the slave server and other servers.
>
> This is my server configuration: http://pastebin.com/Ks4cCiaE
>
> Client configuration:
>         myConf.set("hbase.zookeeper.quorum", hbaseQuorum);
>         myConf.set("hbase.client.retries.number", "3");
>         myConf.set("hbase.client.pause", "1000");
>         myConf.set("hbase.client.max.perserver.tasks", "10");
>         myConf.set("hbase.client.max.perregion.tasks", "10");
>         myConf.set("hbase.client.ipc.pool.size", "5");
>         myConf.set("zookeeper.recovery.retry", "1");
>
> The error of the client:
> Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Failed after attempts=3, exceptions:
> Mon Jul 07 19:10:35 CST 2014, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@69eb9518, org.apache.hadoop.net.ConnectTimeoutException:
> 20000 millis timeout while waiting for channel to be ready for connect. ch
> : java.nio.channels.SocketChannel[connection-pending remote=slave2/
> 192.168.2.88:60020]
> Mon Jul 07 19:10:58 CST 2014, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@69eb9518, org.apache.hadoop.net.ConnectTimeoutException:
> 20000 millis timeout while waiting for channel to be ready for connect. ch
> : java.nio.channels.SocketChannel[connection-pending remote=slave2/
> 192.168.2.88:60020]
> Mon Jul 07 19:11:23 CST 2014, org.apache.hadoop.hbase.
> client.RpcRetryingCaller@69eb9518, org.apache.hadoop.net.ConnectTimeoutException:
> 20000 millis timeout while waiting for channel to be ready for connect. ch
> : java.nio.channels.SocketChannel[connection-pending remote=slave2/
> 192.168.2.88:60020]
>
>     at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(
> RpcRetryingCaller.java:134)
>     at org.apache.hadoop.hbase.client.HTable.delete(HTable.java:831)
>
> 于 2014/7/9 1:02, Esteban Gutierrez 写道:
>
>  Hello Rural,
>>
>> It doesn't seem to be a problem from the region server from what I can
>> tell. The RS is not showing in the logs any message about a long pause
>> (unless you have a non standard log4j.properties file) and also if the RS
>> was in a very long pause due GC or any other issue, then the master should
>> have considered this region server as dead and from the logs doesn't look
>> like that happened. Have you double checked from the client side for any
>> connectivity issue to the RS? can you pastebin the client and the HBase
>> cluster confs?
>>
>> cheers,
>> esteban.
>>
>>
>> --
>> Cloudera, Inc.
>>
>>
>>
>> On Tue, Jul 8, 2014 at 2:14 AM, Rural Hunter <ru...@gmail.com>
>> wrote:
>>
>>  OK, I will try to do that when it happens again. Thanks.
>>>
>>> 于 2014/7/8 17:06, Ted Yu 写道:
>>>
>>>   Next time this happens, can you take jstack of the region server and
>>>
>>>> pastebin it ?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

No. I used the standard log4j file and there is not any network problem 
from the client. I checked the web admin ui and the master still take 
the slave as working. Just the request count is very small(about 10 
while others are several hundreds). I sshed on the slave server and I 
can see the 60020 is open by netstat command. But I am not able to 
telnet the port even on the server itself. It just timed out. This 
situation is same as the client from other servers. After it recovered 
automatically, I can telnet to the 60020 port on both the slave server 
and other servers.

This is my server configuration: http://pastebin.com/Ks4cCiaE

Client configuration:
         myConf.set("hbase.zookeeper.quorum", hbaseQuorum);
         myConf.set("hbase.client.retries.number", "3");
         myConf.set("hbase.client.pause", "1000");
         myConf.set("hbase.client.max.perserver.tasks", "10");
         myConf.set("hbase.client.max.perregion.tasks", "10");
         myConf.set("hbase.client.ipc.pool.size", "5");
         myConf.set("zookeeper.recovery.retry", "1");

The error of the client:
Exception in thread "main" 
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=3, exceptions:
Mon Jul 07 19:10:35 CST 2014, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@69eb9518, 
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout 
while waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=slave2/192.168.2.88:60020]
Mon Jul 07 19:10:58 CST 2014, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@69eb9518, 
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout 
while waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=slave2/192.168.2.88:60020]
Mon Jul 07 19:11:23 CST 2014, 
org.apache.hadoop.hbase.client.RpcRetryingCaller@69eb9518, 
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout 
while waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending 
remote=slave2/192.168.2.88:60020]

     at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:134)
     at org.apache.hadoop.hbase.client.HTable.delete(HTable.java:831)

于 2014/7/9 1:02, Esteban Gutierrez 写道:
> Hello Rural,
>
> It doesn't seem to be a problem from the region server from what I can
> tell. The RS is not showing in the logs any message about a long pause
> (unless you have a non standard log4j.properties file) and also if the RS
> was in a very long pause due GC or any other issue, then the master should
> have considered this region server as dead and from the logs doesn't look
> like that happened. Have you double checked from the client side for any
> connectivity issue to the RS? can you pastebin the client and the HBase
> cluster confs?
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
>
> On Tue, Jul 8, 2014 at 2:14 AM, Rural Hunter <ru...@gmail.com> wrote:
>
>> OK, I will try to do that when it happens again. Thanks.
>>
>> 于 2014/7/8 17:06, Ted Yu 写道:
>>
>>   Next time this happens, can you take jstack of the region server and
>>> pastebin it ?
>>>
>>> Thanks
>>>
>>>
>>>

Re: Region server not accept connections intermittently

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hello Rural,

It doesn't seem to be a problem from the region server from what I can
tell. The RS is not showing in the logs any message about a long pause
(unless you have a non standard log4j.properties file) and also if the RS
was in a very long pause due GC or any other issue, then the master should
have considered this region server as dead and from the logs doesn't look
like that happened. Have you double checked from the client side for any
connectivity issue to the RS? can you pastebin the client and the HBase
cluster confs?

cheers,
esteban.

--
Cloudera, Inc.

On Tue, Jul 8, 2014 at 2:14 AM, Rural Hunter <ru...@gmail.com> wrote:

> OK, I will try to do that when it happens again. Thanks.
>
> 于 2014/7/8 17:06, Ted Yu 写道:
>
>  Next time this happens, can you take jstack of the region server and
>> pastebin it ?
>>
>> Thanks
>>
>>
>>
>

Re: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

OK, I will try to do that when it happens again. Thanks.

于 2014/7/8 17:06, Ted Yu 写道:
> Next time this happens, can you take jstack of the region server and pastebin it ?
>
> Thanks
>
>

Re: Region server not accept connections intermittently

Posted by Ted Yu <yu...@gmail.com>.

Next time this happens, can you take jstack of the region server and pastebin it ?

Thanks

On Jul 7, 2014, at 11:06 PM, Rural Hunter <ru...@gmail.com> wrote:

> Hi,
> 
> I'm using hbase-0.96.2. I saw sometimes my region servers don't accept connections from clients. this could last like 10 minutes to half hour. I was not able to connect to the 60020 port even with telnet command when it happened. After a while, the problem disappeared and the region server behaves normally. My region server is configured with 8G heap and the usage always kepes around at 3-5G. I checked the log and don't see error log. The log seems normal:
> 
> 2014-07-08 10:13:26,192 INFO  [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1450565470, memsize=937.9 K, hasBloomFilter=true, into tmp file hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/.tmp/d1a4e8fa945444958141dcdaa07320ab
> 2014-07-08 10:13:26,208 INFO  [MemStoreFlusher.0] regionserver.HStore: Added hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/cf1/d1a4e8fa945444958141dcdaa07320ab, entries=4494, sequenceid=1450565470, filesize=146.8 K
> 2014-07-08 10:13:26,209 INFO  [MemStoreFlusher.0] regionserver.HRegion: Finished memstore flush of ~957.3 K/980232, currentsize=0/0 for region tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. in 40ms, sequenceid=1450565470, compaction requested=true
> 2014-07-08 10:13:26,209 INFO [regionserver60020-smallCompactions-1397552664633] regionserver.HRegion: Starting compaction on cf1 in region tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b.
> 2014-07-08 10:13:26,209 INFO [regionserver60020-smallCompactions-1397552664633] regionserver.HStore: Starting compaction of 3 file(s) in cf1 of tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. into tmpdir=hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/.tmp, totalSize=705.6 K
> 2014-07-08 10:13:26,390 INFO [regionserver60020-smallCompactions-1397552664633] regionserver.HStore: Completed compaction of 3 file(s) in cf1 of tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. into 69b7c19f59994dffbb236faa03dc1090(size=693.8 K), total size for store is 1.1 G. This selection was in queue for 0sec, and took 0sec to execute.
> 2014-07-08 10:13:26,390 INFO [regionserver60020-smallCompactions-1397552664633] regionserver.CompactSplitThread: Completed compaction: Request = regionName=tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b., storeName=cf1, fileCount=3, fileSize=705.6 K, priority=6, time=47043081604802310; duration=0sec
> 2014-07-08 10:29:06,383 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. after a delay of 10113
> 2014-07-08 10:29:16,383 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. after a delay of 3456
> 2014-07-08 10:29:16,537 INFO  [MemStoreFlusher.0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1450574203, memsize=425.0 K, hasBloomFilter=true, into tmp file hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/.tmp/cabc231951cb473695c3bac045456cee
> 2014-07-08 10:29:16,549 INFO  [MemStoreFlusher.0] regionserver.HStore: Added hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/cf1/cabc231951cb473695c3bac045456cee, entries=1984, sequenceid=1450574203, filesize=83.0 K
> 2014-07-08 10:29:16,549 INFO  [MemStoreFlusher.0] regionserver.HRegion: Finished memstore flush of ~429 K/439296, currentsize=0/0 for region tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. in 53ms, sequenceid=1450574203, compaction requested=true
> 2014-07-08 10:29:16,550 INFO [regionserver60020-smallCompactions-1397552664633] regionserver.HRegion: Starting compaction on cf1 in region tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
> 2014-07-08 10:29:16,550 INFO [regionserver60020-smallCompactions-1397552664633] regionserver.HStore: Starting compaction of 3 file(s) in cf1 of tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. into tmpdir=hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/.tmp, totalSize=9.0 M
> 2014-07-08 10:29:17,053 INFO [regionserver60020-smallCompactions-1397552664633] regionserver.HStore: Completed compaction of 3 file(s) in cf1 of tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0. into e923e5ad17c54bf4934ef07e3d0b5496(size=9.0 M), total size for store is 582.3 M. This selection was in queue for 0sec, and took 0sec to execute.
> 
> I suspect it's the gc problem so I checked the gc log and found it kept print logs like this(both when the problem happens or not):
> 2014-07-08T10:26:45.273+0800: 7233757.674: [GC 7233757.674: [ParNew: 112969K->8702K(118016K), 0.0652040 secs] 5262052K->5160850K(8375552K), 0.0653760 secs] [Times: user=0.00 sys=0.00, real=0.07 secs]
> 2014-07-08T10:26:46.836+0800: 7233759.237: [GC 7233759.237: [ParNew: 113662K->6949K(118016K), 0.0342530 secs] 5265810K->5163641K(8375552K), 0.0343870 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
> 2014-07-08T10:26:48.422+0800: 7233760.823: [GC 7233760.823: [ParNew: 111909K->4855K(118016K), 0.0269810 secs] 5268601K->5165318K(8375552K), 0.0271120 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
> 2014-07-08T10:26:49.896+0800: 7233762.296: [GC 7233762.296: [ParNew: 109815K->7721K(118016K), 0.0251810 secs] 5270278K->5168184K(8375552K), 0.0253300 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:26:51.334+0800: 7233763.734: [GC 7233763.735: [ParNew: 112681K->8670K(118016K), 0.0233590 secs] 5273144K->5171935K(8375552K), 0.0237200 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
> 2014-07-08T10:26:52.701+0800: 7233765.102: [GC 7233765.102: [ParNew: 113390K->7880K(118016K), 0.0266700 secs] 5276656K->5175725K(8375552K), 0.0268250 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
> 2014-07-08T10:26:54.166+0800: 7233766.566: [GC 7233766.566: [ParNew: 112840K->7022K(118016K), 0.0249410 secs] 5280685K->5178575K(8375552K), 0.0250850 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:26:55.586+0800: 7233767.986: [GC 7233767.987: [ParNew: 111982K->6807K(118016K), 0.0317690 secs] 5283535K->5182134K(8375552K), 0.0319390 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
> 2014-07-08T10:26:57.006+0800: 7233769.407: [GC 7233769.407: [ParNew: 111767K->7417K(118016K), 0.0199740 secs] 5287094K->5185610K(8375552K), 0.0201160 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:26:58.520+0800: 7233770.921: [GC 7233770.921: [ParNew: 112377K->6285K(118016K), 0.0219640 secs] 5290570K->5188051K(8375552K), 0.0220970 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:00.077+0800: 7233772.478: [GC 7233772.478: [ParNew: 111245K->8191K(118016K), 0.0215280 secs] 5293011K->5189957K(8375552K), 0.0216740 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:01.571+0800: 7233773.971: [GC 7233773.972: [ParNew: 113151K->8510K(118016K), 0.0185730 secs] 5294917K->5193731K(8375552K), 0.0187220 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:02.958+0800: 7233775.359: [GC 7233775.359: [ParNew: 113470K->9384K(118016K), 0.0252470 secs] 5298691K->5197072K(8375552K), 0.0253900 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:04.416+0800: 7233776.817: [GC 7233776.817: [ParNew: 114344K->6721K(118016K), 0.0203040 secs] 5302032K->5200076K(8375552K), 0.0204510 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:05.891+0800: 7233778.291: [GC 7233778.291: [ParNew: 111681K->6341K(118016K), 0.0434460 secs] 5305036K->5203021K(8375552K), 0.0436020 secs] [Times: user=0.00 sys=0.00, real=0.05 secs]
> 2014-07-08T10:27:07.497+0800: 7233779.898: [GC 7233779.898: [ParNew: 111301K->9838K(118016K), 0.0177230 secs] 5307981K->5206518K(8375552K), 0.0178770 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:08.904+0800: 7233781.305: [GC 7233781.305: [ParNew: 114788K->7575K(118016K), 0.0231210 secs] 5311469K->5209993K(8375552K), 0.0233670 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:10.305+0800: 7233782.706: [GC 7233782.706: [ParNew: 112535K->7616K(118016K), 0.0226150 secs] 5314953K->5213487K(8375552K), 0.0227620 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:11.942+0800: 7233784.343: [GC 7233784.343: [ParNew: 112576K->7835K(118016K), 0.0244290 secs] 5318447K->5217156K(8375552K), 0.0245700 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
> 2014-07-08T10:27:13.565+0800: 7233785.966: [GC 7233785.966: [ParNew: 112795K->7254K(118016K), 0.0206670 secs] 5322116K->5220093K(8375552K), 0.0207760 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:15.107+0800: 7233787.507: [GC 7233787.508: [ParNew: 112101K->5551K(118016K), 0.0213290 secs] 5324940K->5222345K(8375552K), 0.0214360 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
> 2014-07-08T10:27:16.818+0800: 7233789.218: [GC 7233789.218: [ParNew: 110511K->8559K(118016K), 0.0326800 secs] 5327305K->5225353K(8375552K), 0.0328540 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
> 
> I also anaylzed the gc log and found the max pause of gc is only around 2 seconds. So I don't see any gc problem. Can anyone point out the possible cause of this problem?
>

Re: 答复: Region server not accept connections intermittently

Posted by Rural Hunter <ru...@gmail.com>.

I checked the parameter and it seems also a gc parameter to print the
total time of "stop the world". So will it help to get the info about
hung "was not caused by GC "?

于 2014/7/8 14:28, 谢良 写道:
> Coud you try with "-XX:+PrintGCApplicationStoppedTime" vm parameter ?
> the hung from vm side was not caused by GC always
>
> Thanks
>

答复: Region server not accept connections intermittently

Posted by 谢良 <xi...@xiaomi.com>.

Coud you try with "-XX:+PrintGCApplicationStoppedTime" vm parameter ?
the hung from vm side was not caused by GC always

Thanks,
________________________________________
发件人: Rural Hunter [ruralhunter@gmail.com]
发送时间: 2014年7月8日 14:06
收件人: user@hbase.apache.org
主题: Region server not accept connections intermittently

Hi,

I'm using hbase-0.96.2. I saw sometimes my region servers don't accept
connections from clients. this could last like 10 minutes to half hour.
I was not able to connect to the 60020 port even with telnet command
when it happened. After a while, the problem disappeared and the region
server behaves normally. My region server is configured with 8G heap and
the usage always kepes around at 3-5G. I checked the log and don't see
error log. The log seems normal:

2014-07-08 10:13:26,192 INFO  [MemStoreFlusher.0]
regionserver.DefaultStoreFlusher: Flushed, sequenceid=1450565470,
memsize=937.9 K, hasBloomFilter=true, into tmp file
hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/.tmp/d1a4e8fa945444958141dcdaa07320ab
2014-07-08 10:13:26,208 INFO  [MemStoreFlusher.0] regionserver.HStore:
Added
hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/cf1/d1a4e8fa945444958141dcdaa07320ab,
entries=4494, sequenceid=1450565470, filesize=146.8 K
2014-07-08 10:13:26,209 INFO  [MemStoreFlusher.0] regionserver.HRegion:
Finished memstore flush of ~957.3 K/980232, currentsize=0/0 for region
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. in 40ms,
sequenceid=1450565470, compaction requested=true
2014-07-08 10:13:26,209 INFO
[regionserver60020-smallCompactions-1397552664633] regionserver.HRegion:
Starting compaction on cf1 in region
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b.
2014-07-08 10:13:26,209 INFO
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore:
Starting compaction of 3 file(s) in cf1 of
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. into
tmpdir=hdfs://master:9000/hbase/data/default/tbdoc/f7635647bcae9291971843131c4add0b/.tmp,
totalSize=705.6 K
2014-07-08 10:13:26,390 INFO
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore:
Completed compaction of 3 file(s) in cf1 of
tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b. into
69b7c19f59994dffbb236faa03dc1090(size=693.8 K), total size for store is
1.1 G. This selection was in queue for 0sec, and took 0sec to execute.
2014-07-08 10:13:26,390 INFO
[regionserver60020-smallCompactions-1397552664633]
regionserver.CompactSplitThread: Completed compaction: Request =
regionName=tbdoc,,1390820890834.f7635647bcae9291971843131c4add0b.,
storeName=cf1, fileCount=3, fileSize=705.6 K, priority=6,
time=47043081604802310; duration=0sec
2014-07-08 10:29:06,383 INFO  [regionserver60020.periodicFlusher]
regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
flush for region
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
after a delay of 10113
2014-07-08 10:29:16,383 INFO  [regionserver60020.periodicFlusher]
regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
flush for region
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
after a delay of 3456
2014-07-08 10:29:16,537 INFO  [MemStoreFlusher.0]
regionserver.DefaultStoreFlusher: Flushed, sequenceid=1450574203,
memsize=425.0 K, hasBloomFilter=true, into tmp file
hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/.tmp/cabc231951cb473695c3bac045456cee
2014-07-08 10:29:16,549 INFO  [MemStoreFlusher.0] regionserver.HStore:
Added
hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/cf1/cabc231951cb473695c3bac045456cee,
entries=1984, sequenceid=1450574203, filesize=83.0 K
2014-07-08 10:29:16,549 INFO  [MemStoreFlusher.0] regionserver.HRegion:
Finished memstore flush of ~429 K/439296, currentsize=0/0 for region
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
in 53ms, sequenceid=1450574203, compaction requested=true
2014-07-08 10:29:16,550 INFO
[regionserver60020-smallCompactions-1397552664633] regionserver.HRegion:
Starting compaction on cf1 in region
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
2014-07-08 10:29:16,550 INFO
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore:
Starting compaction of 3 file(s) in cf1 of
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
into
tmpdir=hdfs://master:9000/hbase/data/default/tbguid/ded050ffeacd83613a0e3db7baaee4f0/.tmp,
totalSize=9.0 M
2014-07-08 10:29:17,053 INFO
[regionserver60020-smallCompactions-1397552664633] regionserver.HStore:
Completed compaction of 3 file(s) in cf1 of
tbguid,781e82be88ce94de7bb2dee9f3436e97,1392692827380.ded050ffeacd83613a0e3db7baaee4f0.
into e923e5ad17c54bf4934ef07e3d0b5496(size=9.0 M), total size for store
is 582.3 M. This selection was in queue for 0sec, and took 0sec to execute.

I suspect it's the gc problem so I checked the gc log and found it kept
print logs like this(both when the problem happens or not):
2014-07-08T10:26:45.273+0800: 7233757.674: [GC 7233757.674: [ParNew:
112969K->8702K(118016K), 0.0652040 secs] 5262052K->5160850K(8375552K),
0.0653760 secs] [Times: user=0.00 sys=0.00, real=0.07 secs]
2014-07-08T10:26:46.836+0800: 7233759.237: [GC 7233759.237: [ParNew:
113662K->6949K(118016K), 0.0342530 secs] 5265810K->5163641K(8375552K),
0.0343870 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:48.422+0800: 7233760.823: [GC 7233760.823: [ParNew:
111909K->4855K(118016K), 0.0269810 secs] 5268601K->5165318K(8375552K),
0.0271120 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:49.896+0800: 7233762.296: [GC 7233762.296: [ParNew:
109815K->7721K(118016K), 0.0251810 secs] 5270278K->5168184K(8375552K),
0.0253300 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:26:51.334+0800: 7233763.734: [GC 7233763.735: [ParNew:
112681K->8670K(118016K), 0.0233590 secs] 5273144K->5171935K(8375552K),
0.0237200 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:52.701+0800: 7233765.102: [GC 7233765.102: [ParNew:
113390K->7880K(118016K), 0.0266700 secs] 5276656K->5175725K(8375552K),
0.0268250 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:54.166+0800: 7233766.566: [GC 7233766.566: [ParNew:
112840K->7022K(118016K), 0.0249410 secs] 5280685K->5178575K(8375552K),
0.0250850 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:26:55.586+0800: 7233767.986: [GC 7233767.987: [ParNew:
111982K->6807K(118016K), 0.0317690 secs] 5283535K->5182134K(8375552K),
0.0319390 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:26:57.006+0800: 7233769.407: [GC 7233769.407: [ParNew:
111767K->7417K(118016K), 0.0199740 secs] 5287094K->5185610K(8375552K),
0.0201160 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:26:58.520+0800: 7233770.921: [GC 7233770.921: [ParNew:
112377K->6285K(118016K), 0.0219640 secs] 5290570K->5188051K(8375552K),
0.0220970 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:00.077+0800: 7233772.478: [GC 7233772.478: [ParNew:
111245K->8191K(118016K), 0.0215280 secs] 5293011K->5189957K(8375552K),
0.0216740 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:01.571+0800: 7233773.971: [GC 7233773.972: [ParNew:
113151K->8510K(118016K), 0.0185730 secs] 5294917K->5193731K(8375552K),
0.0187220 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:02.958+0800: 7233775.359: [GC 7233775.359: [ParNew:
113470K->9384K(118016K), 0.0252470 secs] 5298691K->5197072K(8375552K),
0.0253900 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:04.416+0800: 7233776.817: [GC 7233776.817: [ParNew:
114344K->6721K(118016K), 0.0203040 secs] 5302032K->5200076K(8375552K),
0.0204510 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:05.891+0800: 7233778.291: [GC 7233778.291: [ParNew:
111681K->6341K(118016K), 0.0434460 secs] 5305036K->5203021K(8375552K),
0.0436020 secs] [Times: user=0.00 sys=0.00, real=0.05 secs]
2014-07-08T10:27:07.497+0800: 7233779.898: [GC 7233779.898: [ParNew:
111301K->9838K(118016K), 0.0177230 secs] 5307981K->5206518K(8375552K),
0.0178770 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:08.904+0800: 7233781.305: [GC 7233781.305: [ParNew:
114788K->7575K(118016K), 0.0231210 secs] 5311469K->5209993K(8375552K),
0.0233670 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:10.305+0800: 7233782.706: [GC 7233782.706: [ParNew:
112535K->7616K(118016K), 0.0226150 secs] 5314953K->5213487K(8375552K),
0.0227620 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:11.942+0800: 7233784.343: [GC 7233784.343: [ParNew:
112576K->7835K(118016K), 0.0244290 secs] 5318447K->5217156K(8375552K),
0.0245700 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]
2014-07-08T10:27:13.565+0800: 7233785.966: [GC 7233785.966: [ParNew:
112795K->7254K(118016K), 0.0206670 secs] 5322116K->5220093K(8375552K),
0.0207760 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:15.107+0800: 7233787.507: [GC 7233787.508: [ParNew:
112101K->5551K(118016K), 0.0213290 secs] 5324940K->5222345K(8375552K),
0.0214360 secs] [Times: user=0.00 sys=0.00, real=0.02 secs]
2014-07-08T10:27:16.818+0800: 7233789.218: [GC 7233789.218: [ParNew:
110511K->8559K(118016K), 0.0326800 secs] 5327305K->5225353K(8375552K),
0.0328540 secs] [Times: user=0.00 sys=0.00, real=0.03 secs]

I also anaylzed the gc log and found the max pause of gc is only around
2 seconds. So I don't see any gc problem. Can anyone point out the
possible cause of this problem?