You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Gaojinchao <ga...@huawei.com> on 2011/05/16 08:45:19 UTC

Thread leak // Hmaster has some warn logs.

I try to reproduce the issue that socket is close_wait and find a new issue thread leak for hmaster process.

jstack 7983:
2011-05-16 13:25:42
Full thread dump Java HotSpot(TM) 64-Bit Server VM (17.1-b03 mixed mode):
"IPC Server handler 4 on 60000.logSyncer" daemon prio=10 tid=0x00007f42f065b800 nid=0x502e waiting on condition [0x00007f42c3358000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:962)
count: 
jstack 7983 | grep 60000.logSyncer| wc -l
411

In this case, It seems LogSyncer thread can't be closed
Hlog can't be closed when use api createHRegion.

public static HRegion createHRegion(final HRegionInfo info, final Path rootDir,
    final Configuration conf)
  throws IOException {
    Path tableDir =
      HTableDescriptor.getTableDir(rootDir, info.getTableDesc().getName());
    Path regionDir = HRegion.getRegionDir(tableDir, info.getEncodedName());
    FileSystem fs = FileSystem.get(conf);
    fs.mkdirs(regionDir);
    HRegion region = HRegion.newHRegion(tableDir,
      new HLog(fs, new Path(regionDir, HConstants.HREGION_LOGDIR_NAME),          // Hlog can't close when use api createHRegion
          new Path(regionDir, HConstants.HREGION_OLDLOGDIR_NAME), conf),
      fs, conf, info, null);
    region.initialize();
    return region;
  }

It also find some CLOSE_WAIT. But I am not sure what' up.
I need keep digging.

tcp      227      0 157.5.100.1:60000       157.5.100.1:52780       CLOSE_WAIT
tcp        1      0 157.5.100.1:60000       157.5.100.5:56633       CLOSE_WAIT
tcp      233      0 157.5.100.1:60000       157.5.100.4:50772       CLOSE_WAIT
tcp      227      0 157.5.100.1:60000       157.5.100.1:59921       CLOSE_WAIT
tcp      233      0 157.5.100.1:60000       157.5.100.3:37877       CLOSE_WAIT
tcp      233      0 157.5.100.1:60000       157.5.100.5:57556       CLOSE_WAIT
tcp      233      0 157.5.100.1:60000       157.5.100.5:44967       CLOSE_WAIT
tcp      233      0 157.5.100.1:60000       157.5.100.3:45944       CLOSE_WAIT
C4C1:/opt/hbasetools # netstat -an |grep 60000 | grep CLOSE_WAIT | wc -l
140

-----邮件原件-----
发件人: Gaojinchao 
发送时间: 2011年5月13日 16:59
收件人: user@hbase.apache.org
主题: Re: Hmaster has some warn logs.

Hi,
I have found a hint. All the socket is close_wait. 
But, what 's up. I need keep digging.

Anyone has experience, Please share it. Thanks.

20020(regin server port)
# netstat -an | grep 20020
tcp       13      0 157.5.100.13:20020      :::*                    LISTEN      
tcp      178      0 157.5.100.13:20020      157.5.100.11:56169      CLOSE_WAIT  
tcp      212      0 157.5.100.13:20020      157.5.100.11:36908      CLOSE_WAIT  
tcp      212      0 157.5.100.13:20020      157.5.100.11:38372      CLOSE_WAIT  
tcp   1537168      0 157.5.100.13:20020      157.5.100.6:45643       ESTABLISHED 
tcp      178      0 157.5.100.13:20020      157.5.100.11:52667      CLOSE_WAIT  
tcp      178      0 157.5.100.13:20020      157.5.100.11:48926      CLOSE_WAIT  
tcp      212      0 157.5.100.13:20020      157.5.100.11:37150      CLOSE_WAIT  
tcp      212      0 157.5.100.13:20020      157.5.100.11:36665      CLOSE_WAIT  
tcp      212      0 157.5.100.13:20020      157.5.100.11:36432      CLOSE_WAIT  
tcp      178      0 157.5.100.13:20020      157.5.100.11:46395      CLOSE_WAIT  
tcp      178      0 157.5.100.13:20020      157.5.100.11:57346      CLOSE_WAIT  
tcp      178      0 157.5.100.13:20020      157.5.100.11:47115      CLOSE_WAIT  
tcp      178      0 157.5.100.13:20020      157.5.100.11:55164      CLOSE_WAIT  
tcp      178      0 157.5.100.13:20020      157.5.100.11:50981      CLOSE_WAIT  
tcp      250      0 157.5.100.13:20020      157.5.100.11:36214      CLOSE_WAIT  

-----邮件原件-----
发件人: Gaojinchao [mailto:gaojinchao@huawei.com] 
发送时间: 2011年5月13日 8:33
收件人: user@hbase.apache.org
主题: re: Hmaster has some warn logs.

It can't reproduce, I need dig from logs and code. 

-----邮件原件-----
发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
发送时间: 2011年5月13日 1:08
收件人: user@hbase.apache.org
主题: Re: Hmaster has some warn logs.

Seems the master had lots of problems talking to that node... can you
repro? If you jstack, are all the handlers filled?

J-D

2011/5/12 Gaojinchao <ga...@huawei.com>:
> Thank you for reply.
>
> It seems master had some problem. But I am not sure what 's up.
> I am not familiar with RPC and need keep digging.
>
>
> // the master connetion was ok
> 2011-05-04 04:47:06,267 INFO org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 378 catalog row(s) and gc'd 9 unreferenced parent region(s)
> 2011-05-04 04:52:05,833 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing.  servers=3 regions=371 average=123.666664 mostloaded=124 leastloaded=124
>
> //the connection had some problem.
> 2011-05-04 04:53:05,858 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
> java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:42736 remote=/10.85.129.177:60020]
>        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>        at $Proxy6.getRegionInfo(Unknown Source)
>
> // It was also fault.
> 2011-05-04 09:52:38,425 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
> java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:50964 remote=linux-88.site/10.85.129.177:60020]
>        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>        at $Proxy6.getRegionInfo(Unknown Source)
>
> // region server had done nothing between 04:47 and 04:52.
> 2011-05-04 04:43:08,888 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Roll /hbase/.logs/linux-88.site,60020,1304408517081/linux-88.site%3A60020.1304451798487, entries=1473, filesize=124253101. New hlog /hbase/.logs/linux-88.site,60020,1304408517081/linux-88.site%3A60020.1304455388698
> 2011-05-04 04:47:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526799, hits=37164, hitRatio=7.05%%, cachingAccesses=37279, cachingHits=37164, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2011-05-04 04:52:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2011-05-04 04:57:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2011-05-04 05:02:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>
> -----邮件原件-----
> 发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
> 发送时间: 2011年5月10日 6:17
> 收件人: user@hbase.apache.org
> 主题: Re: Hmaster has some warn logs.
>
> Anything closer to 05:13:05 in the region server? That's an hour and a
> half before.
>
> J-D
>
> On Thu, May 5, 2011 at 6:03 PM, Gaojinchao <ga...@huawei.com> wrote:
>> Thanks for your reply.
>>
>> I got it. But master and Region server is same machine and the cluster is free.
>>
>> In my scenario：
>> 1/ before 2011-05-04 03:42. bulk data was putting and some IPC was blocked.
>>
>> 2011-05-04 03:42:39,155 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 1 on 60020' on region test1,2010010100#8613880051110#0000,1304451639350.dfbdc1cf590bfffb50ace4c7d116c524.: memstore size 128.1m is >= than blocking 128.0m size
>> 2011-05-04 03:42:39,194 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 6 on 60020' on region test1,2010010100#8613880051110#0000,1304451639350.dfbdc1cf590bfffb50ace4c7d116c524.: memstore size 128.4m is >= than blocking 128.0m size
>>
>>
>> 2/ 2011-05-04 05:13 hbase client had exited and cluster was free.
>>
>> 2011-05-04 04:47:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526799, hits=37164, hitRatio=7.05%%, cachingAccesses=37279, cachingHits=37164, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 04:52:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 04:57:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:02:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:07:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:12:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:17:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>> 2011-05-04 05:22:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN
>>
>> 3/ master had some warning log, but region server did not any abnormal.
>>
>> -----邮件原件-----
>> 发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
>> 发送时间: 2011年5月6日 1:27
>> 收件人: user@hbase.apache.org
>> 主题: Re: Hmaster has some warn logs.
>>
>> The master was trying to talk to the region server at 10.85.129.177
>> but it took more than 60 seconds to get an answer.
>>
>> J-D
>>
>> On Thu, May 5, 2011 at 5:40 AM, Gaojinchao <ga...@huawei.com> wrote:
>>> In our test cluster. There is a lot of logs for socket timeout.
>>> I try to dig it, but nothing.
>>> Who have a clue? Thanks.
>>>
>>> Hbase version : 0.90.2
>>>
>>> Logs:
>>> 2011-05-04 05:13:05,856 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan of catalog table
>>> java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:54980 remote=/10.85.129.177:60020]
>>>        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802)
>>>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
>>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>>>        at $Proxy6.getRegionInfo(Unknown Source)
>>>        at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:424)
>>>        at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:272)
>>>        at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331)
>>>        at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:364)
>>>        at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255)
>>>        at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
>>>        at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:116)
>>>        at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:85)
>>>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.85.129.177:54980 remote=/10.85.129.177:60020]
>>>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
>>>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299)
>>>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
>>>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
>>>
>>
>

Re: Thread leak // Hmaster has some warn logs.

Posted by Stack <st...@duboce.net>.

2011/5/15 Gaojinchao <ga...@huawei.com>:
> I try to reproduce the issue that socket is close_wait and find a new issue thread leak for hmaster process.
>

We are leaking threads?  What thread (Can you diff thread dumps and
see what the name of the new thread is?  That would be a clue).


> In this case, It seems LogSyncer thread can't be closed
> Hlog can't be closed when use api createHRegion.
>

Why not?  It looks like its a daemon thread.  Is it stuck in a wait?



> It also find some CLOSE_WAIT. But I am not sure what' up.
> I need keep digging.
>
> tcp      227      0 157.5.100.1:60000       157.5.100.1:52780       CLOSE_WAIT
> tcp        1      0 157.5.100.1:60000       157.5.100.5:56633       CLOSE_WAIT
> tcp      233      0 157.5.100.1:60000       157.5.100.4:50772       CLOSE_WAIT
> tcp      227      0 157.5.100.1:60000       157.5.100.1:59921       CLOSE_WAIT
> tcp      233      0 157.5.100.1:60000       157.5.100.3:37877       CLOSE_WAIT
> tcp      233      0 157.5.100.1:60000       157.5.100.5:57556       CLOSE_WAIT
> tcp      233      0 157.5.100.1:60000       157.5.100.5:44967       CLOSE_WAIT
> tcp      233      0 157.5.100.1:60000       157.5.100.3:45944       CLOSE_WAIT
> C4C1:/opt/hbasetools # netstat -an |grep 60000 | grep CLOSE_WAIT | wc -l
> 140
>

Do please keep digging.
St.Ack