You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Galed Friedmann <ga...@onavo.com> on 2012/01/30 15:39:06 UTC

Thrift "hang ups" with no apparent reason

Hi,
I have an HBase cluster which consists of 1 master server (running
NameNode, Zoo Keeper and HBase Master) and 3 region masters (Running
DataNode and Region Server).
I also have a Thrift server running on the master.
I have some Hadoop MR jobs running on a separate Hadoop cluster (using
JRuby) and some other processes that use Thrift as the end point to HBase.
All of this on EC2.

Lately we're having weird issues with Thrift, after several hours the
Thrift server "hangs" - the scripts that are using it to access HBase get
connection timeouts, we're also using Heroku and ruby on rails apps that
use Thrift and they simply get stuck. Only when restarting the Thrift
process everything goes back to normal.

I've tried tweaking everything I could, increasing the heap size of the
Thrift process (to 4GB) only delayed the time until the hang ups appear
(from around 4-5 hours to 9-10 hours) but did not fix the problem. Zoo
Keeper and HBase Master also have 4GB heap size.

The Thrift log files show nothing, the only thing I see in the logs are the
establishment of connection when I brought the Thrift up (few hours before
the hangups) and then when I restart it.

Looking at the different log files this is what I see during the time the
hangups start:

*Zoo Keeper log at the time of the hangups, looking at the Thrift process
session ID (0x1352a393d180008 and 0x1352a393d180009): *
2012-01-30 10:51:36,721 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid *
0x1352a393d180008*, likely client has
closed socket
2012-01-30 10:51:36,721 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /10.217.55.193:53475 which had
sessionid *0x1352a393d180008*
2012-01-30 10:51:36,721 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid *
0x1352a393d180009*, likely client has
closed socket
2012-01-30 10:51:36,722 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /10.217.55.193:53477 which had
sessionid *0x1352a393d180009*
2012-01-30 10:52:00,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session 0x1352a393d18051c, timeout of 90000ms exceeded
2012-01-30 10:52:00,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: 0x1352a393d18051c
2012-01-30 10:52:06,040 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /10.217.55.193:35937
2012-01-30 10:52:06,043 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to establish new session at /10.217.55.193:35937
2012-01-30 10:52:06,044 INFO org.apache.zookeeper.server.NIOServerCnxn:
Established session 0x1352a393d18051d with negotiated timeout 90000 for
client /10.217.55.193:35937
2012-01-30 10:52:08,820 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /10.217.55.193:35940
2012-01-30 10:52:08,821 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to establish new session at /10.217.55.193:35940
2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn:
Established session 0x1352a393d18051e with negotiated timeout 90000 for
client /10.217.55.193:35940
2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded
2012-01-30 10:52:28,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: 0x1352a393d18051b
2012-01-30 10:52:50,844 INFO org.apache.zookeeper.server.NIOServerCnxn:
Accepted socket connection from /10.64.165.124:47983
2012-01-30 10:52:50,856 INFO org.apache.zookeeper.server.NIOServerCnxn:
Client attempting to establish new session at /10.64.165.124:47983
2012-01-30 10:52:50,858 INFO org.apache.zookeeper.server.NIOServerCnxn:
Established session 0x1352a393d18051f with negotiated timeout 90000 for
client /10.64.165.124:47983
2012-01-30 10:52:54,243 WARN org.apache.zookeeper.server.NIOServerCnxn:
EndOfStreamException: Unable to read additional data from client sessionid
0x1352a393d18051f, likely client has
closed socket
2012-01-30 10:52:54,244 INFO org.apache.zookeeper.server.NIOServerCnxn:
Closed socket connection for client /10.64.165.124:47983 which had
sessionid 0x1352a393d18051f
2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session *0x1352a393d180009*, timeout of 90000ms exceeded
2012-01-30 10:52:56,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
Expiring session *0x1352a393d180008*, timeout of 90000ms exceeded
2012-01-30 10:52:56,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: *0x1352a393d180009*
2012-01-30 10:52:56,001 INFO
org.apache.zookeeper.server.PrepRequestProcessor: Processed session
termination for sessionid: *0x1352a393d180008*
*
*
*
*
*In addition to that, on one of the Region Servers I found this exception
at the time of the hangup:*
2012-01-30 10:46:23,854 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
8801271291968240625 lease expired
2012-01-30 10:46:23,854 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
4523402662192609713 lease expired
2012-01-30 10:46:23,854 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-3235593536276390176 lease expired
2012-01-30 10:46:35,034 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-8329379051383952775 lease expired
2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
listener on 60020: readAndProcess threw exception java.io.IOException:
Connection rese
t by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
        at
org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
2012-01-30 10:52:24,016 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-4511393305838866925 lease expired
2012-01-30 10:52:24,016 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-5818959718437063034 lease expired
2012-01-30 10:52:24,016 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-1408921590864341720 lease expired

I would really appreciate the help, I'm kind of losing my mind here over
this. The cluster worked perfectly for long time and recently we've started
having these problems.

Thanks alot!
Galed.

Re: Thrift "hang ups" with no apparent reason

Posted by Jean-Daniel Cryans <jd...@apache.org>.

It seems like the thrift servers are doing "something", I see they are
reading inputs from your application and one is scanning.

Earlier you mentioned that setting bigger heaps only delayed the
issue, so it seems there's a memory leak.

Which HBase version are you using? Earlier 0.90 versions had some
issues with connection handling but we've been running in production
with Thrift for ages and didn't see this issue (we migrated from the
0.89 serie to 0.90.2 so maybe you are using something older?).

Did you enable GC logging? When it "gets stuck", I'm pretty sure that
the GC log is filled with multi-seconds Full GCs.

Finally, if you are using a recent version and do see heavy GCing, try
to get a heap dump and see where the memory is allocated. "jvisualvm"
can help you doing that if you don't feel like paying for jprofiler.

Hope this helps,

J-D

On Thu, Feb 2, 2012 at 1:33 AM, Galed Friedmann
<ga...@onavo.com> wrote:
> Uploaded to pastebin:
> http://pastebin.com/YAHLyEMV
> http://pastebin.com/HuAhypsU
> http://pastebin.com/BupMyENi
>
> Thanks
>
> On Thu, Feb 2, 2012 at 10:57 AM, <yu...@gmail.com> wrote:
>
>> I don't think the attachments went through.
>> Can you find some other place to upload the files ?
>>
>> Thanks
>>
>>
>>
>> On Feb 1, 2012, at 11:59 PM, Galed Friedmann <ga...@onavo.com>
>> wrote:
>>
>> > Hi again,
>> > Moved one of the services to another Thrift gateway and still got
>> timeouts from that service, something tells me even load balancing won't
>> help.
>> >
>> > I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the
>> additional server we brought up, the other 2 files are from the Thrift that
>> is running on the HMaster.
>> >
>> > Thanks again for the help and patience.
>> >
>> > On Wed, Feb 1, 2012 at 7:23 PM, Stack <st...@duboce.net> wrote:
>> > On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann
>> > <ga...@onavo.com> wrote:
>> > > Can you explain how to take the dump from the Thrift server? I couldn't
>> > > find how to do that.
>> > >
>> >
>> > Try this:
>> http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx
>> >
>> > > At the moment we have only 1 Thrift gateway, I'm going to add some more
>> > > with load balancing.
>> > >
>> >
>> > At a minimum, it might put off the hang.
>> > St.Ack
>> >
>>

Re: Thrift "hang ups" with no apparent reason

Posted by Galed Friedmann <ga...@onavo.com>.

Uploaded to pastebin:
http://pastebin.com/YAHLyEMV
http://pastebin.com/HuAhypsU
http://pastebin.com/BupMyENi

Thanks

On Thu, Feb 2, 2012 at 10:57 AM, <yu...@gmail.com> wrote:

> I don't think the attachments went through.
> Can you find some other place to upload the files ?
>
> Thanks
>
>
>
> On Feb 1, 2012, at 11:59 PM, Galed Friedmann <ga...@onavo.com>
> wrote:
>
> > Hi again,
> > Moved one of the services to another Thrift gateway and still got
> timeouts from that service, something tells me even load balancing won't
> help.
> >
> > I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the
> additional server we brought up, the other 2 files are from the Thrift that
> is running on the HMaster.
> >
> > Thanks again for the help and patience.
> >
> > On Wed, Feb 1, 2012 at 7:23 PM, Stack <st...@duboce.net> wrote:
> > On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann
> > <ga...@onavo.com> wrote:
> > > Can you explain how to take the dump from the Thrift server? I couldn't
> > > find how to do that.
> > >
> >
> > Try this:
> http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx
> >
> > > At the moment we have only 1 Thrift gateway, I'm going to add some more
> > > with load balancing.
> > >
> >
> > At a minimum, it might put off the hang.
> > St.Ack
> >
>

Re: Thrift "hang ups" with no apparent reason

Posted by yu...@gmail.com.

I don't think the attachments went through. 
Can you find some other place to upload the files ?

Thanks



On Feb 1, 2012, at 11:59 PM, Galed Friedmann <ga...@onavo.com> wrote:

> Hi again,
> Moved one of the services to another Thrift gateway and still got timeouts from that service, something tells me even load balancing won't help.
> 
> I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the additional server we brought up, the other 2 files are from the Thrift that is running on the HMaster.
> 
> Thanks again for the help and patience.
> 
> On Wed, Feb 1, 2012 at 7:23 PM, Stack <st...@duboce.net> wrote:
> On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann
> <ga...@onavo.com> wrote:
> > Can you explain how to take the dump from the Thrift server? I couldn't
> > find how to do that.
> >
> 
> Try this: http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx
> 
> > At the moment we have only 1 Thrift gateway, I'm going to add some more
> > with load balancing.
> >
> 
> At a minimum, it might put off the hang.
> St.Ack
>

Re: Thrift "hang ups" with no apparent reason

Posted by Galed Friedmann <ga...@onavo.com>.

Hi again,
Moved one of the services to another Thrift gateway and still got timeouts
from that service, something tells me even load balancing won't help.

I'm attaching 3 dumps from the Thrift servers, thrift2.dump is the
additional server we brought up, the other 2 files are from the Thrift that
is running on the HMaster.

Thanks again for the help and patience.

On Wed, Feb 1, 2012 at 7:23 PM, Stack <st...@duboce.net> wrote:

> On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann
> <ga...@onavo.com> wrote:
> > Can you explain how to take the dump from the Thrift server? I couldn't
> > find how to do that.
> >
>
> Try this:
> http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx
>
> > At the moment we have only 1 Thrift gateway, I'm going to add some more
> > with load balancing.
> >
>
> At a minimum, it might put off the hang.
> St.Ack
>

Re: Thrift "hang ups" with no apparent reason

Posted by Stack <st...@duboce.net>.

On Wed, Feb 1, 2012 at 9:08 AM, Galed Friedmann
<ga...@onavo.com> wrote:
> Can you explain how to take the dump from the Thrift server? I couldn't
> find how to do that.
>

Try this: http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx

> At the moment we have only 1 Thrift gateway, I'm going to add some more
> with load balancing.
>

At a minimum, it might put off the hang.
St.Ack

Re: Thrift "hang ups" with no apparent reason

Posted by Galed Friedmann <ga...@onavo.com>.

Hi,
It doesn't look like the servers are loaded, we're not passing that much
traffic though the cluster at the moment.
Can you explain how to take the dump from the Thrift server? I couldn't
find how to do that.

At the moment we have only 1 Thrift gateway, I'm going to add some more
with load balancing.

Thanks again.

On Wed, Feb 1, 2012 at 6:57 PM, Stack <st...@duboce.net> wrote:

> On Wed, Feb 1, 2012 at 1:00 AM, Galed Friedmann
> <ga...@onavo.com> wrote:
> > 1. I've taken a dump from the HMaster when we felt some timeouts, I hope
> > that's what you're looking for, attached.
>
> I was looking for dumps of the hung up thrift server.
>
> The master dump shows it idle.
>
> > 2. The timeout occurs around 10-12 hours after the ZK established the
> > connection with the Thrift server so it's not immediate. On the Thrift
> logs
> > you see that nothing happened and only see the timeouts on the ZK logs.
> > Actually we hadn't had errors in the last 15 hours nor ZK timeouts for
> > Thrift but it'll happen again I'm sure..
>
> OK.  Thread dump it when its hung up.    Thrift is getting stuck going
> against the cluster it seems.  How many gateways are you running?  Run
> more?
>
> > 3. The lease expiration happens all the time, we're using mostly JRuby
> > scripts and closing the scans when we're done.
> >
>
> Could it be the client is taking a long time to get back to the
> server?  Or maybe the server is taking long time to respond because
> its heavily loaded (is it?).
>
> St.Ack
>
> > Thanks again,
> > Galed.
> >
> >
> > On Tue, Jan 31, 2012 at 10:51 PM, Stack <st...@duboce.net> wrote:
> >>
> >> On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann
> >> <ga...@onavo.com> wrote:
> >> > Lately we're having weird issues with Thrift, after several hours the
> >> > Thrift server "hangs" - the scripts that are using it to access HBase
> >> > get
> >> > connection timeouts, we're also using Heroku and ruby on rails apps
> that
> >> > use Thrift and they simply get stuck. Only when restarting the Thrift
> >> > process everything goes back to normal.
> >> >
> >>
> >> Can you thread dump the thrift server when its all hung up?
> >>
> >> Have you enabled
> >>
> >>
> >> > 2012-01-30 10:52:08,823 INFO
> org.apache.zookeeper.server.NIOServerCnxn:
> >> > Established session 0x1352a393d18051e with negotiated timeout 90000
> for
> >> > client /10.217.55.193:35940
> >> > 2012-01-30 10:52:28,001 INFO
> >> > org.apache.zookeeper.server.ZooKeeperServer:
> >> > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded
> >> > 2012-01-30 10:52:28,001 INFO
> >> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> >> > termination for sessionid: 0x1352a393d18051b
> >>
> >> ZK is establishing a session w/ 90second timeout and then timing out
> >> immediately?
> >>
> >>
> >>
> >>
> >> > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> >> > Server
> >> > listener on 60020: readAndProcess threw exception java.io.IOException:
> >> > Connection rese
> >> > t by peer. Count of bytes read: 0
> >> > java.io.IOException: Connection reset by peer
> >> >        at sun.nio.ch.FileDispatcher.read0(Native Method)
> >> >        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >> >        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
> >> >        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
> >> >        at
> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> >> >        at
> >> >
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
> >> >        at
> >> >
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
> >> >        at
> >> >
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
> >> >        at
> >> >
> >> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
> >> >        at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >> >        at
> >> >
> >> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >> >        at java.lang.Thread.run(Thread.java:619)
> >> > 2012-01-30 10:52:24,016 INFO
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> > -4511393305838866925 lease expired
> >> > 2012-01-30 10:52:24,016 INFO
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> > -5818959718437063034 lease expired
> >> > 2012-01-30 10:52:24,016 INFO
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> >> > -1408921590864341720 lease expired
> >> >
> >>
> >> Client went away?  All the lease expireds happen always or just around
> >> time of the hangup (You are closing scanners when done?)
> >>
> >> St.Ack
> >
> >
>

Re: Thrift "hang ups" with no apparent reason

Posted by Stack <st...@duboce.net>.

On Wed, Feb 1, 2012 at 1:00 AM, Galed Friedmann
<ga...@onavo.com> wrote:
> 1. I've taken a dump from the HMaster when we felt some timeouts, I hope
> that's what you're looking for, attached.

I was looking for dumps of the hung up thrift server.

The master dump shows it idle.

> 2. The timeout occurs around 10-12 hours after the ZK established the
> connection with the Thrift server so it's not immediate. On the Thrift logs
> you see that nothing happened and only see the timeouts on the ZK logs.
> Actually we hadn't had errors in the last 15 hours nor ZK timeouts for
> Thrift but it'll happen again I'm sure..

OK.  Thread dump it when its hung up.    Thrift is getting stuck going
against the cluster it seems.  How many gateways are you running?  Run
more?

> 3. The lease expiration happens all the time, we're using mostly JRuby
> scripts and closing the scans when we're done.
>

Could it be the client is taking a long time to get back to the
server?  Or maybe the server is taking long time to respond because
its heavily loaded (is it?).

St.Ack

> Thanks again,
> Galed.
>
>
> On Tue, Jan 31, 2012 at 10:51 PM, Stack <st...@duboce.net> wrote:
>>
>> On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann
>> <ga...@onavo.com> wrote:
>> > Lately we're having weird issues with Thrift, after several hours the
>> > Thrift server "hangs" - the scripts that are using it to access HBase
>> > get
>> > connection timeouts, we're also using Heroku and ruby on rails apps that
>> > use Thrift and they simply get stuck. Only when restarting the Thrift
>> > process everything goes back to normal.
>> >
>>
>> Can you thread dump the thrift server when its all hung up?
>>
>> Have you enabled
>>
>>
>> > 2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn:
>> > Established session 0x1352a393d18051e with negotiated timeout 90000 for
>> > client /10.217.55.193:35940
>> > 2012-01-30 10:52:28,001 INFO
>> > org.apache.zookeeper.server.ZooKeeperServer:
>> > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded
>> > 2012-01-30 10:52:28,001 INFO
>> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session
>> > termination for sessionid: 0x1352a393d18051b
>>
>> ZK is establishing a session w/ 90second timeout and then timing out
>> immediately?
>>
>>
>>
>>
>> > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> > Server
>> > listener on 60020: readAndProcess threw exception java.io.IOException:
>> > Connection rese
>> > t by peer. Count of bytes read: 0
>> > java.io.IOException: Connection reset by peer
>> >        at sun.nio.ch.FileDispatcher.read0(Native Method)
>> >        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>> >        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
>> >        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
>> >        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>> >        at
>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
>> >        at
>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
>> >        at
>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>> >        at
>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>> >        at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> >        at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> >        at java.lang.Thread.run(Thread.java:619)
>> > 2012-01-30 10:52:24,016 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>> > -4511393305838866925 lease expired
>> > 2012-01-30 10:52:24,016 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>> > -5818959718437063034 lease expired
>> > 2012-01-30 10:52:24,016 INFO
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
>> > -1408921590864341720 lease expired
>> >
>>
>> Client went away?  All the lease expireds happen always or just around
>> time of the hangup (You are closing scanners when done?)
>>
>> St.Ack
>
>

Re: Thrift "hang ups" with no apparent reason

Posted by Galed Friedmann <ga...@onavo.com>.

Hi,
Thanks for replying!

Answers to your questions:
1. I've taken a dump from the HMaster when we felt some timeouts, I hope
that's what you're looking for, attached.
2. The timeout occurs around 10-12 hours after the ZK established the
connection with the Thrift server so it's not immediate. On the Thrift logs
you see that nothing happened and only see the timeouts on the ZK logs.
Actually we hadn't had errors in the last 15 hours nor ZK timeouts for
Thrift but it'll happen again I'm sure..
3. The lease expiration happens all the time, we're using mostly JRuby
scripts and closing the scans when we're done.

Thanks again,
Galed.

On Tue, Jan 31, 2012 at 10:51 PM, Stack <st...@duboce.net> wrote:

> On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann
> <ga...@onavo.com> wrote:
> > Lately we're having weird issues with Thrift, after several hours the
> > Thrift server "hangs" - the scripts that are using it to access HBase get
> > connection timeouts, we're also using Heroku and ruby on rails apps that
> > use Thrift and they simply get stuck. Only when restarting the Thrift
> > process everything goes back to normal.
> >
>
> Can you thread dump the thrift server when its all hung up?
>
> Have you enabled
>
>
> > 2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn:
> > Established session 0x1352a393d18051e with negotiated timeout 90000 for
> > client /10.217.55.193:35940
> > 2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
> > Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded
> > 2012-01-30 10:52:28,001 INFO
> > org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> > termination for sessionid: 0x1352a393d18051b
>
> ZK is establishing a session w/ 90second timeout and then timing out
> immediately?
>
>
>
>
> > 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > listener on 60020: readAndProcess threw exception java.io.IOException:
> > Connection rese
> > t by peer. Count of bytes read: 0
> > java.io.IOException: Connection reset by peer
> >        at sun.nio.ch.FileDispatcher.read0(Native Method)
> >        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
> >        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
> >        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
> >        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:619)
> > 2012-01-30 10:52:24,016 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> > -4511393305838866925 lease expired
> > 2012-01-30 10:52:24,016 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> > -5818959718437063034 lease expired
> > 2012-01-30 10:52:24,016 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> > -1408921590864341720 lease expired
> >
>
> Client went away?  All the lease expireds happen always or just around
> time of the hangup (You are closing scanners when done?)
>
> St.Ack
>

Re: Thrift "hang ups" with no apparent reason

Posted by Stack <st...@duboce.net>.

On Mon, Jan 30, 2012 at 6:39 AM, Galed Friedmann
<ga...@onavo.com> wrote:
> Lately we're having weird issues with Thrift, after several hours the
> Thrift server "hangs" - the scripts that are using it to access HBase get
> connection timeouts, we're also using Heroku and ruby on rails apps that
> use Thrift and they simply get stuck. Only when restarting the Thrift
> process everything goes back to normal.
>

Can you thread dump the thrift server when its all hung up?

Have you enabled


> 2012-01-30 10:52:08,823 INFO org.apache.zookeeper.server.NIOServerCnxn:
> Established session 0x1352a393d18051e with negotiated timeout 90000 for
> client /10.217.55.193:35940
> 2012-01-30 10:52:28,001 INFO org.apache.zookeeper.server.ZooKeeperServer:
> Expiring session 0x1352a393d18051b, timeout of 90000ms exceeded
> 2012-01-30 10:52:28,001 INFO
> org.apache.zookeeper.server.PrepRequestProcessor: Processed session
> termination for sessionid: 0x1352a393d18051b

ZK is establishing a session w/ 90second timeout and then timing out
immediately?




> 2012-01-30 10:51:36,382 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> listener on 60020: readAndProcess threw exception java.io.IOException:
> Connection rese
> t by peer. Count of bytes read: 0
> java.io.IOException: Connection reset by peer
>        at sun.nio.ch.FileDispatcher.read0(Native Method)
>        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:237)
>        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
>        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1359)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:900)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:522)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:316)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:619)
> 2012-01-30 10:52:24,016 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> -4511393305838866925 lease expired
> 2012-01-30 10:52:24,016 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> -5818959718437063034 lease expired
> 2012-01-30 10:52:24,016 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
> -1408921590864341720 lease expired
>

Client went away?  All the lease expireds happen always or just around
time of the hangup (You are closing scanners when done?)

St.Ack