You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Gaojinchao <ga...@huawei.com> on 2011/05/16 08:45:19 UTC
Thread leak // Hmaster has some warn logs.
I try to reproduce the issue that socket is close_wait and find a new issue thread leak for hmaster process.
jstack 7983:
2011-05-16 13:25:42
Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode):
"IPC Server handler 4 on 60000.logSyncer" daemon prio=10 tid=0x00007f42f065b800 nid=0x502e waiting on condition [0x00007f42c3358000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:962)
count:
jstack 7983 | grep 60000.logSyncer| wc -l
411
In this case, It seems LogSyncer thread can't be closed
Hlog can't be closed when use api createHRegion.
public static HRegion createHRegion(final HRegionInfo info, final Path rootDir,
final Configuration conf)
throws IOException {
Path tableDir =
HTableDescriptor.getTableDir(rootDir, info.getTableDesc().getName());
Path regionDir = HRegion.getRegionDir(tableDir, info.getEncodedName());
FileSystem fs = FileSystem.get(conf);
fs.mkdirs(regionDir);
HRegion region = HRegion.newHRegion(tableDir,
new HLog(fs, new Path(regionDir, HConstants.HREGION_LOGDIR_NAME), // Hlog can't close when use api createHRegion
new Path(regionDir, HConstants.HREGION_OLDLOGDIR_NAME), conf),
fs, conf, info, null);
region.initialize();
return region;
}
It also find some CLOSE_WAIT. But I am not sure what' up.
I need keep digging.
tcp 227 0 157.5.100.1:60000 157.5.100.1:52780 CLOSE_WAIT
tcp 1 0 157.5.100.1:60000 157.5.100.5:56633 CLOSE_WAIT
tcp 233 0 157.5.100.1:60000 157.5.100.4:50772 CLOSE_WAIT
tcp 227 0 157.5.100.1:60000 157.5.100.1:59921 CLOSE_WAIT
tcp 233 0 157.5.100.1:60000 157.5.100.3:37877 CLOSE_WAIT
tcp 233 0 157.5.100.1:60000 157.5.100.5:57556 CLOSE_WAIT
tcp 233 0 157.5.100.1:60000 157.5.100.5:44967 CLOSE_WAIT
tcp 233 0 157.5.100.1:60000 157.5.100.3:45944 CLOSE_WAIT
C4C1:/opt/hbasetools # netstat -an |grep 60000 | grep CLOSE_WAIT | wc -l
140
-----邮件原件-----
发件人: Gaojinchao
发送时间: 2011年5月13日 16:59
收件人: user@hbase.apache.org
主题: Re: Hmaster has some warn logs.
Hi,
I have found a hint. All the socket is close_wait.
But, what 's up. I need keep digging.
Anyone has experience, Please share it. Thanks.
20020(regin server port)
# netstat -an | grep 20020
tcp 13 0 157.5.100.13:20020 :::* LISTEN
tcp 178 0 157.5.100.13:20020 157.5.100.11:56169 CLOSE_WAIT
tcp 212 0 157.5.100.13:20020 157.5.100.11:36908 CLOSE_WAIT
tcp 212 0 157.5.100.13:20020 157.5.100.11:38372 CLOSE_WAIT
tcp 1537168 0 157.5.100.13:20020 157.5.100.6:45643 ESTABLISHED
tcp 178 0 157.5.100.13:20020 157.5.100.11:52667 CLOSE_WAIT
tcp 178 0 157.5.100.13:20020 157.5.100.11:48926 CLOSE_WAIT
tcp 212 0 157.5.100.13:20020 157.5.100.11:37150 CLOSE_WAIT
tcp 212 0 157.5.100.13:20020 157.5.100.11:36665 CLOSE_WAIT
tcp 212 0 157.5.100.13:20020 157.5.100.11:36432 CLOSE_WAIT
tcp 178 0 157.5.100.13:20020 157.5.100.11:46395 CLOSE_WAIT
tcp 178 0 157.5.100.13:20020 157.5.100.11:57346 CLOSE_WAIT
tcp 178 0 157.5.100.13:20020 157.5.100.11:47115 CLOSE_WAIT
tcp 178 0 157.5.100.13:20020 157.5.100.11:55164 CLOSE_WAIT
tcp 178 0 157.5.100.13:20020 157.5.100.11:50981 CLOSE_WAIT
tcp 250 0 157.5.100.13:20020 157.5.100.11:36214 CLOSE_WAIT
-----邮件原件-----
发件人: Gaojinchao [mailto:gaojinchao@huawei.com]
发送时间: 2011年5月13日 8:33
收件人: user@hbase.apache.org
主题: re: Hmaster has some warn logs.
It can't reproduce, I need dig from logs and code.
-----邮件原件-----
发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
发送时间: 2011年5月13日 1:08
收件人: user@hbase.apache.org
主题: Re: Hmaster has some warn logs.
Seems the master had lots of problems talking to that node... can you
repro? If you jstack, are all the handlers filled?
J-D
2011/5/12 Gaojinchao <ga...@huawei.com>:
> Thank you for reply.
>
> It seems master had some problem. But I am not sure what 's up.
> I am not familiar with RPC and need keep digging.
>
>
> // the master connetion was ok
> 2011-05-04 04:47:06,267 INFO org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 378 catalog row(s) and gc'd 9 unreferenced parent region(s)
> 2011-05-04 04:52:05,833 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing. servers=3 regions=371 average=123.666664 mostloaded=124 leastloaded=124
>
> //the connection had some problem.
> 2011-05-04 04:53:05,858 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
> java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:42736 remote=/10.85.129.177:60020]
> at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> at $Proxy6.getRegionInfo(Unknown Source)
>
> // It was also fault.
> 2011-05-04 09:52:38,425 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
> java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:50964 remote=linux-88.site/10.85.129.177:60020]
> at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> at $Proxy6.getRegionInfo(Unknown Source)
>
> // region server had done nothing between 04:47 and 04:52.
> 2011-05-04 04:43:08,888 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux-88.site,60020,1304408517081/linux-88.site%3A60020.1304451798487, entries=1473, filesize=124253101. New hlog /hbase/.logs/linux-88.site,60020,1304408517081/linux-88.site%3A60020.1304455388698
> 2011-05-04 04:47:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526799, hits=37164, hitRatio=7.05%%, cachingAccesses=37279, cachingHits=37164, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2011-05-04 04:52:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2011-05-04 04:57:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2011-05-04 05:02:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>
> -----邮件原件-----
> 发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
> 发送时间: 2011年5月10日 6:17
> 收件人: user@hbase.apache.org
> 主题: Re: Hmaster has some warn logs.
>
> Anything closer to 05:13:05 in the region server? That's an hour and a
> half before.
>
> J-D
>
> On Thu, May 5, 2011 at 6:03 PM, Gaojinchao <ga...@huawei.com> wrote:
>> Thanks for your reply.
>>
>> I got it. But master and Region server is same machine and the cluster is free.
>>
>> In my scenario:
>> 1/ before 2011-05-04 03:42. bulk data was putting and some IPC was blocked.
>>
>> 2011-05-04 03:42:39,155 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region test1,2010010100#8613880051110#0000,1304451639350.dfbdc1cf590bfffb50ace4c7d116c524.: memstore size 128.1m is >= than blocking 128.0m size
>> 2011-05-04 03:42:39,194 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 6 on 60020' on region test1,2010010100#8613880051110#0000,1304451639350.dfbdc1cf590bfffb50ace4c7d116c524.: memstore size 128.4m is >= than blocking 128.0m size
>>
>>
>> 2/ 2011-05-04 05:13 hbase client had exited and cluster was free.
>>
>> 2011-05-04 04:47:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526799, hits=37164, hitRatio=7.05%%, cachingAccesses=37279, cachingHits=37164, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 04:52:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 04:57:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:02:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:07:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:12:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:17:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:22:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>>
>> 3/ master had some warning log, but region server did not any abnormal.
>>
>> -----邮件原件-----
>> 发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
>> 发送时间: 2011年5月6日 1:27
>> 收件人: user@hbase.apache.org
>> 主题: Re: Hmaster has some warn logs.
>>
>> The master was trying to talk to the region server at 10.85.129.177
>> but it took more than 60 seconds to get an answer.
>>
>> J-D
>>
>> On Thu, May 5, 2011 at 5:40 AM, Gaojinchao <ga...@huawei.com> wrote:
>>> In our test cluster. There is a lot of logs for socket timeout.
>>> I try to dig it, but nothing.
>>> Who have a clue? Thanks.
>>>
>>> Hbase version : 0.90.2
>>>
>>> Logs:
>>> 2011-05-04 05:13:05,856 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
>>> java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:54980 remote=/10.85.129.177:60020]
>>> at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
>>> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
>>> at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>>> at $Proxy6.getRegionInfo(Unknown Source)
>>> at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:424)
>>> at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:272)
>>> at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331)
>>> at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:364)
>>> at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
>>> at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
>>> at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:116)
>>> at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:85)
>>> at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:54980 remote=/10.85.129.177:60020]
>>> at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
>>> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>> at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>> at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
>>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>> at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>> at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>> at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
>>> at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
>>>
>>
>
Re: Thread leak // Hmaster has some warn logs.
Posted by Stack <st...@duboce.net>.
2011/5/15 Gaojinchao <ga...@huawei.com>:
> I try to reproduce the issue that socket is close_wait and find a new issue thread leak for hmaster process.
>
We are leaking threads? What thread (Can you diff thread dumps and
see what the name of the new thread is? That would be a clue).
> In this case, It seems LogSyncer thread can't be closed
> Hlog can't be closed when use api createHRegion.
>
Why not? It looks like its a daemon thread. Is it stuck in a wait?
> It also find some CLOSE_WAIT. But I am not sure what' up.
> I need keep digging.
>
> tcp 227 0 157.5.100.1:60000 157.5.100.1:52780 CLOSE_WAIT
> tcp 1 0 157.5.100.1:60000 157.5.100.5:56633 CLOSE_WAIT
> tcp 233 0 157.5.100.1:60000 157.5.100.4:50772 CLOSE_WAIT
> tcp 227 0 157.5.100.1:60000 157.5.100.1:59921 CLOSE_WAIT
> tcp 233 0 157.5.100.1:60000 157.5.100.3:37877 CLOSE_WAIT
> tcp 233 0 157.5.100.1:60000 157.5.100.5:57556 CLOSE_WAIT
> tcp 233 0 157.5.100.1:60000 157.5.100.5:44967 CLOSE_WAIT
> tcp 233 0 157.5.100.1:60000 157.5.100.3:45944 CLOSE_WAIT
> C4C1:/opt/hbasetools # netstat -an |grep 60000 | grep CLOSE_WAIT | wc -l
> 140
>
Do please keep digging.
St.Ack