You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Walter King <wa...@adroll.com> on 2014/08/07 17:42:36 UTC

IPC Queue Size

Hi, we're running 0.98.3 hbase, and it seems like the ipc queueSize is not
always being reduced.  Graphing the metric from jmx, at the start it stays
at zero and spikes up and returns back to zero.  At a certain point it
stops returning to zero, and returns to something slightly higher.
 Eventually we ran into ipc queue size full messages being returned to
clients trying large batch puts, as it approaches a gigabyte.  Some of our
servers, when idle and not processing any requests, say the queue size is
hundreds of megabytes.  I have a heap dump of one claiming to have over a
hundred mb in the queue, and as far as i can tell the call objects are
maybe a single mb in heap space.  We do have a sizable number of scans that
timeout without closing, lots of increments, and a few puts.

Graphs: http://imgur.com/9qDHgRd  The middle graph is what I'd expect to
see, the bottom graph seems to stop working around 15:30, and the top graph
never drops below 10MB.  Read/write request counts are relatively balanced
amongst the servers.

Is this normal?

Re: IPC Queue Size

Posted by Qiang Tian <ti...@gmail.com>.

ps there are so many "Connection reset by peer", why your client reset the
connection? :-)


On Fri, Aug 8, 2014 at 4:56 PM, Qiang Tian <ti...@gmail.com> wrote:

> good point.  that is a big suspect.
>
> I check your log, ClosedChannelException should be triggered by
>  call.sendResponseIfReady()(it is the only request in the queue, so handler
> send response directly), but at that point the callqueueSize has been
> decremented.
>
> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> RpcServer.respondercallId: 118504 service: ClientService methodName: Multi
> size: 141.9 K connection: 10.248.134.67:55347: output error
> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> ipc.RpcServer (CallRunner.java:run(135)) - RpcServer.handler=57,port=60020:
> caught a ClosedChannelException, this means that the server was processing
> a request but the client went away. The error message was: null
>
> it looks you have got the fix, would you file a jira?
> thanks.
>
>
> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com> wrote:
>
>> I've only looked at the code a little, and likely missed something, but
>> does this if block decrement the call queue, if the client already closed
>> the connection?
>>
>>
>> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
>>
>>
>>
>> On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com> wrote:
>>
>> > Yes, sorry, CallQueueTooBigException. but that value never returns to
>> > zero, even when number of requests goes to zero.  The call queue too big
>> > happens if any regionserver is up for a long enough period of time, so I
>> > have to periodically restart them.  Also at that 15:30 time I wasn't
>> > seeing that exception, but it seems like that is one time in which a
>> call
>> > didnt properly decrement the callqueuesize because it was at zero before
>> > and has never hit zero again - today the minimum is even higher.
>> >
>> >
>> > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com> wrote:
>> >
>> >> bq. "Eventually we ran into ipc queue size full messages being
>> returned to
>> >> clients trying large batch puts, as it approaches a gigabyte."
>> >>
>> >> Do you mean CallQueueTooBigException? it looks not the queue size, but
>> the
>> >> data size that client sends..configured by
>> >> "hbase.ipc.server.max.callqueue.size".
>> >>
>> >> I guess when you client got the exception, it closed the exception and
>> >> causing other shared connection RPC failed.
>> >>
>> >>
>> >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
>> >> ipc.RpcServer
>> >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count of
>> >> bytes read: 0
>> >> java.io.IOException: Connection reset by peer
>> >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>> >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>> >> at
>> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
>> >> at
>> >>
>> >>
>> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
>> >> at
>> >>
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
>> >> at
>> >>
>> >>
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
>> >> at
>> >>
>> >>
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
>> >> at
>> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> >> at
>> >>
>> >>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> >> at java.lang.Thread.run(Thread.java:744)
>> >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
>> >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> >> RpcServer.respondercallId: 84968 service: ClientService methodName:
>> Multi
>> >> size: 17.7 K connection: 10.248.130.152:49780: output error
>> >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
>> >> ipc.RpcServer (CallRunner.java:run(135)) -
>> >> RpcServer.handler=18,port=60020:
>> >> caught a ClosedChannelException, this means that the server was
>> processing
>> >> a request but the client went away. The error message was: null
>> >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
>> >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> >> RpcServer.respondercallId: 83907 service: ClientService methodName:
>> Multi
>> >> size: 17.1 K connection: 10.248.1.56:53615: output error
>> >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
>> >> ipc.RpcServer (CallRunner.java:run(135)) -
>> >> RpcServer.handler=61,port=60020:
>> >> caught a ClosedChannelException, this means that the server was
>> processing
>> >> a request but the client went away. The error message was: null
>> >>
>> >>
>> >>
>> >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com> wrote:
>> >>
>> >> >
>> >>
>> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>> >> >
>> >> >
>> http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>> >> >
>> >> > These are logs from that particular server, and the debug dump from
>> >> now(no
>> >> > restart in between).  The times in the graph are pacific, so it
>> should
>> >> be
>> >> > around 2014-08-06 22:25:00.  I do see some exceptions around there.
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: IPC Queue Size

Posted by Walter King <wa...@adroll.com>.

Excellent.  I updated versions, the rate of these errors has dropped
significantly.  Thanks!


On Mon, Aug 11, 2014 at 2:03 AM, Qiang Tian <ti...@gmail.com> wrote:

> update - it has been fixed in 0.98.4, HBASE-11374
>
>
> On Sat, Aug 9, 2014 at 5:53 PM, Qiang Tian <ti...@gmail.com> wrote:
>
> > please see HBASE-11714
> >
> >
> > On Sat, Aug 9, 2014 at 4:27 PM, Qiang Tian <ti...@gmail.com> wrote:
> >
> >> took a quick look. it looks a bug, will dig more and update later.
> >>
> >>
> >>
> >> On Sat, Aug 9, 2014 at 1:45 AM, Walter King <wa...@adroll.com> wrote:
> >>
> >>> awesome, thanks!
> >>> the resets and such were actually what I've been trying to investigate.
> >>>  Recently switched from 0.94 and 0.98, and finding that periodically
> >>> things
> >>> are having issues - lots of retry exceptions.  Don't have much
> >>> information,
> >>> still looking into it when saw this.
> >>>
> >>> By timestamp, this seems to be the corresponding client exception:
> >>>
> >>> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
> >>> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500
> >>> ops,
> >>> last exception: java.net.SocketTimeoutException: Call to
> >>> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
> >>> because java.net.SocketTimeoutException: 2000 millis timeout while
> >>> waiting
> >>> for channel to be ready for read. ch :
> >>> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
> >>> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020
> ]
> >>> on
> >>> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651,
> tracking
> >>> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay
> 500
> >>> ops.
> >>>
> >>> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for
> >>> some
> >>> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10,
> >>> currentTasksDone=10, retries=10 hasError=false,
> >>> tableName=rt_global_monthly_campaign_deliveries
> >>>
> >>>
> >>> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez <
> esteban@cloudera.com>
> >>> wrote:
> >>>
> >>> > The RequestContext seems to be ok since it hasn't been set. So just
> >>> > decrementing the call size should be enough.
> >>> >
> >>> > cheers,
> >>> > esteban.
> >>> >
> >>> >
> >>> > --
> >>> > Cloudera, Inc.
> >>> >
> >>> >
> >>> >
> >>> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <
> >>> esteban@cloudera.com>
> >>> > wrote:
> >>> >
> >>> > >
> >>> > > Looks like after HBASE-10506 we didn't do what CallRunner used to
> do
> >>> > > before: clear the RequestContext and decrement callQueueSize.
> >>> > >
> >>> > > Created HBASE-11705 to address this issue.
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > Cloudera, Inc.
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <ti...@gmail.com>
> >>> wrote:
> >>> > >
> >>> > >> good point.  that is a big suspect.
> >>> > >>
> >>> > >> I check your log, ClosedChannelException should be triggered by
> >>> > >>  call.sendResponseIfReady()(it is the only request in the queue,
> so
> >>> > >> handler
> >>> > >> send response directly), but at that point the callqueueSize has
> >>> been
> >>> > >> decremented.
> >>> > >>
> >>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> >>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >>> > >> RpcServer.respondercallId: 118504 service: ClientService
> methodName:
> >>> > Multi
> >>> > >> size: 141.9 K connection: 10.248.134.67:55347: output error
> >>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> >>> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> >>> > >> RpcServer.handler=57,port=60020:
> >>> > >> caught a ClosedChannelException, this means that the server was
> >>> > processing
> >>> > >> a request but the client went away. The error message was: null
> >>> > >>
> >>> > >> it looks you have got the fix, would you file a jira?
> >>> > >> thanks.
> >>> > >>
> >>> > >>
> >>> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com>
> >>> wrote:
> >>> > >>
> >>> > >> > I've only looked at the code a little, and likely missed
> >>> something,
> >>> > but
> >>> > >> > does this if block decrement the call queue, if the client
> already
> >>> > >> closed
> >>> > >> > the connection?
> >>> > >> >
> >>> > >> >
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
> >>> > >> >
> >>> > >> >
> >>> > >> >
> >>> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <walter@adroll.com
> >
> >>> > wrote:
> >>> > >> >
> >>> > >> > > Yes, sorry, CallQueueTooBigException. but that value never
> >>> returns
> >>> > to
> >>> > >> > > zero, even when number of requests goes to zero.  The call
> >>> queue too
> >>> > >> big
> >>> > >> > > happens if any regionserver is up for a long enough period of
> >>> time,
> >>> > >> so I
> >>> > >> > > have to periodically restart them.  Also at that 15:30 time I
> >>> wasn't
> >>> > >> > > seeing that exception, but it seems like that is one time in
> >>> which a
> >>> > >> call
> >>> > >> > > didnt properly decrement the callqueuesize because it was at
> >>> zero
> >>> > >> before
> >>> > >> > > and has never hit zero again - today the minimum is even
> higher.
> >>> > >> > >
> >>> > >> > >
> >>> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <tianq01@gmail.com
> >
> >>> > wrote:
> >>> > >> > >
> >>> > >> > >> bq. "Eventually we ran into ipc queue size full messages
> being
> >>> > >> returned
> >>> > >> > to
> >>> > >> > >> clients trying large batch puts, as it approaches a
> gigabyte."
> >>> > >> > >>
> >>> > >> > >> Do you mean CallQueueTooBigException? it looks not the queue
> >>> size,
> >>> > >> but
> >>> > >> > the
> >>> > >> > >> data size that client sends..configured by
> >>> > >> > >> "hbase.ipc.server.max.callqueue.size".
> >>> > >> > >>
> >>> > >> > >> I guess when you client got the exception, it closed the
> >>> exception
> >>> > >> and
> >>> > >> > >> causing other shared connection RPC failed.
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
> >>> > >> > >> ipc.RpcServer
> >>> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020:
> >>> count
> >>> > >> of
> >>> > >> > >> bytes read: 0
> >>> > >> > >> java.io.IOException: Connection reset by peer
> >>> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >>> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >>> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >>> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> >>> > >> > >> at
> >>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> >>> > >> > >> at
> >>> > >> >
> >>> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
> >>> > >> > >> at
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
> >>> > >> > >> at
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
> >>> > >> > >> at
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
> >>> > >> > >> at
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
> >>> > >> > >> at
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >>> > >> > >> at
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>> > >> > >> at java.lang.Thread.run(Thread.java:744)
> >>> > >> > >> 2014-08-06 22:27:57,257 WARN
>  [RpcServer.handler=18,port=60020]
> >>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >>> > >> > >> RpcServer.respondercallId: 84968 service: ClientService
> >>> methodName:
> >>> > >> > Multi
> >>> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
> >>> > >> > >> 2014-08-06 22:27:57,258 WARN
>  [RpcServer.handler=18,port=60020]
> >>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> >>> > >> > >> RpcServer.handler=18,port=60020:
> >>> > >> > >> caught a ClosedChannelException, this means that the server
> was
> >>> > >> > processing
> >>> > >> > >> a request but the client went away. The error message was:
> null
> >>> > >> > >> 2014-08-06 22:27:57,260 WARN
>  [RpcServer.handler=61,port=60020]
> >>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >>> > >> > >> RpcServer.respondercallId: 83907 service: ClientService
> >>> methodName:
> >>> > >> > Multi
> >>> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
> >>> > >> > >> 2014-08-06 22:27:57,263 WARN
>  [RpcServer.handler=61,port=60020]
> >>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> >>> > >> > >> RpcServer.handler=61,port=60020:
> >>> > >> > >> caught a ClosedChannelException, this means that the server
> was
> >>> > >> > processing
> >>> > >> > >> a request but the client went away. The error message was:
> null
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> > >>
> >>> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <
> walter@adroll.com
> >>> >
> >>> > >> wrote:
> >>> > >> > >>
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> >
> >>> > >>
> >>> >
> >>>
> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
> >>> > >> > >> >
> >>> > >> > >> >
> >>> > >> >
> >>> >
> http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
> >>> > >> > >> >
> >>> > >> > >> > These are logs from that particular server, and the debug
> >>> dump
> >>> > from
> >>> > >> > >> now(no
> >>> > >> > >> > restart in between).  The times in the graph are pacific,
> so
> >>> it
> >>> > >> should
> >>> > >> > >> be
> >>> > >> > >> > around 2014-08-06 22:25:00.  I do see some exceptions
> around
> >>> > there.
> >>> > >> > >> >
> >>> > >> > >>
> >>> > >> > >
> >>> > >> > >
> >>> > >> >
> >>> > >>
> >>> > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: IPC Queue Size

Posted by Qiang Tian <ti...@gmail.com>.

update - it has been fixed in 0.98.4, HBASE-11374


On Sat, Aug 9, 2014 at 5:53 PM, Qiang Tian <ti...@gmail.com> wrote:

> please see HBASE-11714
>
>
> On Sat, Aug 9, 2014 at 4:27 PM, Qiang Tian <ti...@gmail.com> wrote:
>
>> took a quick look. it looks a bug, will dig more and update later.
>>
>>
>>
>> On Sat, Aug 9, 2014 at 1:45 AM, Walter King <wa...@adroll.com> wrote:
>>
>>> awesome, thanks!
>>> the resets and such were actually what I've been trying to investigate.
>>>  Recently switched from 0.94 and 0.98, and finding that periodically
>>> things
>>> are having issues - lots of retry exceptions.  Don't have much
>>> information,
>>> still looking into it when saw this.
>>>
>>> By timestamp, this seems to be the corresponding client exception:
>>>
>>> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
>>> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500
>>> ops,
>>> last exception: java.net.SocketTimeoutException: Call to
>>> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
>>> because java.net.SocketTimeoutException: 2000 millis timeout while
>>> waiting
>>> for channel to be ready for read. ch :
>>> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
>>> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020]
>>> on
>>> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
>>> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
>>> ops.
>>>
>>> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for
>>> some
>>> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10,
>>> currentTasksDone=10, retries=10 hasError=false,
>>> tableName=rt_global_monthly_campaign_deliveries
>>>
>>>
>>> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez <es...@cloudera.com>
>>> wrote:
>>>
>>> > The RequestContext seems to be ok since it hasn't been set. So just
>>> > decrementing the call size should be enough.
>>> >
>>> > cheers,
>>> > esteban.
>>> >
>>> >
>>> > --
>>> > Cloudera, Inc.
>>> >
>>> >
>>> >
>>> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <
>>> esteban@cloudera.com>
>>> > wrote:
>>> >
>>> > >
>>> > > Looks like after HBASE-10506 we didn't do what CallRunner used to do
>>> > > before: clear the RequestContext and decrement callQueueSize.
>>> > >
>>> > > Created HBASE-11705 to address this issue.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Cloudera, Inc.
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <ti...@gmail.com>
>>> wrote:
>>> > >
>>> > >> good point.  that is a big suspect.
>>> > >>
>>> > >> I check your log, ClosedChannelException should be triggered by
>>> > >>  call.sendResponseIfReady()(it is the only request in the queue, so
>>> > >> handler
>>> > >> send response directly), but at that point the callqueueSize has
>>> been
>>> > >> decremented.
>>> > >>
>>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>>> > >> RpcServer.respondercallId: 118504 service: ClientService methodName:
>>> > Multi
>>> > >> size: 141.9 K connection: 10.248.134.67:55347: output error
>>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>>> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>>> > >> RpcServer.handler=57,port=60020:
>>> > >> caught a ClosedChannelException, this means that the server was
>>> > processing
>>> > >> a request but the client went away. The error message was: null
>>> > >>
>>> > >> it looks you have got the fix, would you file a jira?
>>> > >> thanks.
>>> > >>
>>> > >>
>>> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com>
>>> wrote:
>>> > >>
>>> > >> > I've only looked at the code a little, and likely missed
>>> something,
>>> > but
>>> > >> > does this if block decrement the call queue, if the client already
>>> > >> closed
>>> > >> > the connection?
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >>
>>> >
>>> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com>
>>> > wrote:
>>> > >> >
>>> > >> > > Yes, sorry, CallQueueTooBigException. but that value never
>>> returns
>>> > to
>>> > >> > > zero, even when number of requests goes to zero.  The call
>>> queue too
>>> > >> big
>>> > >> > > happens if any regionserver is up for a long enough period of
>>> time,
>>> > >> so I
>>> > >> > > have to periodically restart them.  Also at that 15:30 time I
>>> wasn't
>>> > >> > > seeing that exception, but it seems like that is one time in
>>> which a
>>> > >> call
>>> > >> > > didnt properly decrement the callqueuesize because it was at
>>> zero
>>> > >> before
>>> > >> > > and has never hit zero again - today the minimum is even higher.
>>> > >> > >
>>> > >> > >
>>> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com>
>>> > wrote:
>>> > >> > >
>>> > >> > >> bq. "Eventually we ran into ipc queue size full messages being
>>> > >> returned
>>> > >> > to
>>> > >> > >> clients trying large batch puts, as it approaches a gigabyte."
>>> > >> > >>
>>> > >> > >> Do you mean CallQueueTooBigException? it looks not the queue
>>> size,
>>> > >> but
>>> > >> > the
>>> > >> > >> data size that client sends..configured by
>>> > >> > >> "hbase.ipc.server.max.callqueue.size".
>>> > >> > >>
>>> > >> > >> I guess when you client got the exception, it closed the
>>> exception
>>> > >> and
>>> > >> > >> causing other shared connection RPC failed.
>>> > >> > >>
>>> > >> > >>
>>> > >> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
>>> > >> > >> ipc.RpcServer
>>> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020:
>>> count
>>> > >> of
>>> > >> > >> bytes read: 0
>>> > >> > >> java.io.IOException: Connection reset by peer
>>> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>> > >> > >> at
>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>> > >> > >> at
>>> > >> >
>>> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
>>> > >> > >> at
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> > >> > >> at java.lang.Thread.run(Thread.java:744)
>>> > >> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
>>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>>> > >> > >> RpcServer.respondercallId: 84968 service: ClientService
>>> methodName:
>>> > >> > Multi
>>> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
>>> > >> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
>>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>>> > >> > >> RpcServer.handler=18,port=60020:
>>> > >> > >> caught a ClosedChannelException, this means that the server was
>>> > >> > processing
>>> > >> > >> a request but the client went away. The error message was: null
>>> > >> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
>>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>>> > >> > >> RpcServer.respondercallId: 83907 service: ClientService
>>> methodName:
>>> > >> > Multi
>>> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
>>> > >> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
>>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>>> > >> > >> RpcServer.handler=61,port=60020:
>>> > >> > >> caught a ClosedChannelException, this means that the server was
>>> > >> > processing
>>> > >> > >> a request but the client went away. The error message was: null
>>> > >> > >>
>>> > >> > >>
>>> > >> > >>
>>> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <walter@adroll.com
>>> >
>>> > >> wrote:
>>> > >> > >>
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>>> > >> > >> >
>>> > >> > >> >
>>> > >> >
>>> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>>> > >> > >> >
>>> > >> > >> > These are logs from that particular server, and the debug
>>> dump
>>> > from
>>> > >> > >> now(no
>>> > >> > >> > restart in between).  The times in the graph are pacific, so
>>> it
>>> > >> should
>>> > >> > >> be
>>> > >> > >> > around 2014-08-06 22:25:00.  I do see some exceptions around
>>> > there.
>>> > >> > >> >
>>> > >> > >>
>>> > >> > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: IPC Queue Size

Posted by Qiang Tian <ti...@gmail.com>.

please see HBASE-11714


On Sat, Aug 9, 2014 at 4:27 PM, Qiang Tian <ti...@gmail.com> wrote:

> took a quick look. it looks a bug, will dig more and update later.
>
>
>
> On Sat, Aug 9, 2014 at 1:45 AM, Walter King <wa...@adroll.com> wrote:
>
>> awesome, thanks!
>> the resets and such were actually what I've been trying to investigate.
>>  Recently switched from 0.94 and 0.98, and finding that periodically
>> things
>> are having issues - lots of retry exceptions.  Don't have much
>> information,
>> still looking into it when saw this.
>>
>> By timestamp, this seems to be the corresponding client exception:
>>
>> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
>> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
>> last exception: java.net.SocketTimeoutException: Call to
>> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
>> because java.net.SocketTimeoutException: 2000 millis timeout while waiting
>> for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
>> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020]
>> on
>> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
>> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
>> ops.
>>
>> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for
>> some
>> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10,
>> currentTasksDone=10, retries=10 hasError=false,
>> tableName=rt_global_monthly_campaign_deliveries
>>
>>
>> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez <es...@cloudera.com>
>> wrote:
>>
>> > The RequestContext seems to be ok since it hasn't been set. So just
>> > decrementing the call size should be enough.
>> >
>> > cheers,
>> > esteban.
>> >
>> >
>> > --
>> > Cloudera, Inc.
>> >
>> >
>> >
>> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <esteban@cloudera.com
>> >
>> > wrote:
>> >
>> > >
>> > > Looks like after HBASE-10506 we didn't do what CallRunner used to do
>> > > before: clear the RequestContext and decrement callQueueSize.
>> > >
>> > > Created HBASE-11705 to address this issue.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Cloudera, Inc.
>> > >
>> > >
>> > >
>> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <ti...@gmail.com> wrote:
>> > >
>> > >> good point.  that is a big suspect.
>> > >>
>> > >> I check your log, ClosedChannelException should be triggered by
>> > >>  call.sendResponseIfReady()(it is the only request in the queue, so
>> > >> handler
>> > >> send response directly), but at that point the callqueueSize has been
>> > >> decremented.
>> > >>
>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> RpcServer.respondercallId: 118504 service: ClientService methodName:
>> > Multi
>> > >> size: 141.9 K connection: 10.248.134.67:55347: output error
>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> RpcServer.handler=57,port=60020:
>> > >> caught a ClosedChannelException, this means that the server was
>> > processing
>> > >> a request but the client went away. The error message was: null
>> > >>
>> > >> it looks you have got the fix, would you file a jira?
>> > >> thanks.
>> > >>
>> > >>
>> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com>
>> wrote:
>> > >>
>> > >> > I've only looked at the code a little, and likely missed something,
>> > but
>> > >> > does this if block decrement the call queue, if the client already
>> > >> closed
>> > >> > the connection?
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com>
>> > wrote:
>> > >> >
>> > >> > > Yes, sorry, CallQueueTooBigException. but that value never
>> returns
>> > to
>> > >> > > zero, even when number of requests goes to zero.  The call queue
>> too
>> > >> big
>> > >> > > happens if any regionserver is up for a long enough period of
>> time,
>> > >> so I
>> > >> > > have to periodically restart them.  Also at that 15:30 time I
>> wasn't
>> > >> > > seeing that exception, but it seems like that is one time in
>> which a
>> > >> call
>> > >> > > didnt properly decrement the callqueuesize because it was at zero
>> > >> before
>> > >> > > and has never hit zero again - today the minimum is even higher.
>> > >> > >
>> > >> > >
>> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com>
>> > wrote:
>> > >> > >
>> > >> > >> bq. "Eventually we ran into ipc queue size full messages being
>> > >> returned
>> > >> > to
>> > >> > >> clients trying large batch puts, as it approaches a gigabyte."
>> > >> > >>
>> > >> > >> Do you mean CallQueueTooBigException? it looks not the queue
>> size,
>> > >> but
>> > >> > the
>> > >> > >> data size that client sends..configured by
>> > >> > >> "hbase.ipc.server.max.callqueue.size".
>> > >> > >>
>> > >> > >> I guess when you client got the exception, it closed the
>> exception
>> > >> and
>> > >> > >> causing other shared connection RPC failed.
>> > >> > >>
>> > >> > >>
>> > >> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
>> > >> > >> ipc.RpcServer
>> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020:
>> count
>> > >> of
>> > >> > >> bytes read: 0
>> > >> > >> java.io.IOException: Connection reset by peer
>> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>> > >> > >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>> > >> > >> at
>> > >> >
>> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
>> > >> > >> at
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> > >> > >> at java.lang.Thread.run(Thread.java:744)
>> > >> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> > >> RpcServer.respondercallId: 84968 service: ClientService
>> methodName:
>> > >> > Multi
>> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
>> > >> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> > >> RpcServer.handler=18,port=60020:
>> > >> > >> caught a ClosedChannelException, this means that the server was
>> > >> > processing
>> > >> > >> a request but the client went away. The error message was: null
>> > >> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> > >> RpcServer.respondercallId: 83907 service: ClientService
>> methodName:
>> > >> > Multi
>> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
>> > >> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> > >> RpcServer.handler=61,port=60020:
>> > >> > >> caught a ClosedChannelException, this means that the server was
>> > >> > processing
>> > >> > >> a request but the client went away. The error message was: null
>> > >> > >>
>> > >> > >>
>> > >> > >>
>> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com>
>> > >> wrote:
>> > >> > >>
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>> > >> > >> >
>> > >> > >> >
>> > >> >
>> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>> > >> > >> >
>> > >> > >> > These are logs from that particular server, and the debug dump
>> > from
>> > >> > >> now(no
>> > >> > >> > restart in between).  The times in the graph are pacific, so
>> it
>> > >> should
>> > >> > >> be
>> > >> > >> > around 2014-08-06 22:25:00.  I do see some exceptions around
>> > there.
>> > >> > >> >
>> > >> > >>
>> > >> > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: IPC Queue Size

Posted by Qiang Tian <ti...@gmail.com>.

took a quick look. it looks a bug, will dig more and update later.



On Sat, Aug 9, 2014 at 1:45 AM, Walter King <wa...@adroll.com> wrote:

> awesome, thanks!
> the resets and such were actually what I've been trying to investigate.
>  Recently switched from 0.94 and 0.98, and finding that periodically things
> are having issues - lots of retry exceptions.  Don't have much information,
> still looking into it when saw this.
>
> By timestamp, this seems to be the corresponding client exception:
>
> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
> last exception: java.net.SocketTimeoutException: Call to
> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
> because java.net.SocketTimeoutException: 2000 millis timeout while waiting
> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
> ops.
>
> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for some
> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10,
> currentTasksDone=10, retries=10 hasError=false,
> tableName=rt_global_monthly_campaign_deliveries
>
>
> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez <es...@cloudera.com>
> wrote:
>
> > The RequestContext seems to be ok since it hasn't been set. So just
> > decrementing the call size should be enough.
> >
> > cheers,
> > esteban.
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> >
> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <es...@cloudera.com>
> > wrote:
> >
> > >
> > > Looks like after HBASE-10506 we didn't do what CallRunner used to do
> > > before: clear the RequestContext and decrement callQueueSize.
> > >
> > > Created HBASE-11705 to address this issue.
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Cloudera, Inc.
> > >
> > >
> > >
> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <ti...@gmail.com> wrote:
> > >
> > >> good point.  that is a big suspect.
> > >>
> > >> I check your log, ClosedChannelException should be triggered by
> > >>  call.sendResponseIfReady()(it is the only request in the queue, so
> > >> handler
> > >> send response directly), but at that point the callqueueSize has been
> > >> decremented.
> > >>
> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> > >> RpcServer.respondercallId: 118504 service: ClientService methodName:
> > Multi
> > >> size: 141.9 K connection: 10.248.134.67:55347: output error
> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> > >> RpcServer.handler=57,port=60020:
> > >> caught a ClosedChannelException, this means that the server was
> > processing
> > >> a request but the client went away. The error message was: null
> > >>
> > >> it looks you have got the fix, would you file a jira?
> > >> thanks.
> > >>
> > >>
> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com>
> wrote:
> > >>
> > >> > I've only looked at the code a little, and likely missed something,
> > but
> > >> > does this if block decrement the call queue, if the client already
> > >> closed
> > >> > the connection?
> > >> >
> > >> >
> > >> >
> > >>
> >
> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com>
> > wrote:
> > >> >
> > >> > > Yes, sorry, CallQueueTooBigException. but that value never returns
> > to
> > >> > > zero, even when number of requests goes to zero.  The call queue
> too
> > >> big
> > >> > > happens if any regionserver is up for a long enough period of
> time,
> > >> so I
> > >> > > have to periodically restart them.  Also at that 15:30 time I
> wasn't
> > >> > > seeing that exception, but it seems like that is one time in
> which a
> > >> call
> > >> > > didnt properly decrement the callqueuesize because it was at zero
> > >> before
> > >> > > and has never hit zero again - today the minimum is even higher.
> > >> > >
> > >> > >
> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com>
> > wrote:
> > >> > >
> > >> > >> bq. "Eventually we ran into ipc queue size full messages being
> > >> returned
> > >> > to
> > >> > >> clients trying large batch puts, as it approaches a gigabyte."
> > >> > >>
> > >> > >> Do you mean CallQueueTooBigException? it looks not the queue
> size,
> > >> but
> > >> > the
> > >> > >> data size that client sends..configured by
> > >> > >> "hbase.ipc.server.max.callqueue.size".
> > >> > >>
> > >> > >> I guess when you client got the exception, it closed the
> exception
> > >> and
> > >> > >> causing other shared connection RPC failed.
> > >> > >>
> > >> > >>
> > >> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
> > >> > >> ipc.RpcServer
> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020:
> count
> > >> of
> > >> > >> bytes read: 0
> > >> > >> java.io.IOException: Connection reset by peer
> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> > >> > >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> > >> > >> at
> > >> >
> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
> > >> > >> at
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
> > >> > >> at
> > >> > >>
> > >> >
> > >>
> > org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
> > >> > >> at
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
> > >> > >> at
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
> > >> > >> at
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > >> > >> at
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > >> > >> at java.lang.Thread.run(Thread.java:744)
> > >> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> > >> > >> RpcServer.respondercallId: 84968 service: ClientService
> methodName:
> > >> > Multi
> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
> > >> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> > >> > >> RpcServer.handler=18,port=60020:
> > >> > >> caught a ClosedChannelException, this means that the server was
> > >> > processing
> > >> > >> a request but the client went away. The error message was: null
> > >> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> > >> > >> RpcServer.respondercallId: 83907 service: ClientService
> methodName:
> > >> > Multi
> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
> > >> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> > >> > >> RpcServer.handler=61,port=60020:
> > >> > >> caught a ClosedChannelException, this means that the server was
> > >> > processing
> > >> > >> a request but the client went away. The error message was: null
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com>
> > >> wrote:
> > >> > >>
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
> > >> > >> >
> > >> > >> >
> > >> >
> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
> > >> > >> >
> > >> > >> > These are logs from that particular server, and the debug dump
> > from
> > >> > >> now(no
> > >> > >> > restart in between).  The times in the graph are pacific, so it
> > >> should
> > >> > >> be
> > >> > >> > around 2014-08-06 22:25:00.  I do see some exceptions around
> > there.
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: IPC Queue Size

Posted by Walter King <wa...@adroll.com>.

awesome, thanks!
the resets and such were actually what I've been trying to investigate.
 Recently switched from 0.94 and 0.98, and finding that periodically things
are having issues - lots of retry exceptions.  Don't have much information,
still looking into it when saw this.

By timestamp, this seems to be the corresponding client exception:

2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
last exception: java.net.SocketTimeoutException: Call to
ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
because java.net.SocketTimeoutException: 2000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
ops.

2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for some
tasks to finish. Expected max=0, tasksSent=11, tasksDone=10,
currentTasksDone=10, retries=10 hasError=false,
tableName=rt_global_monthly_campaign_deliveries


On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez <es...@cloudera.com>
wrote:

> The RequestContext seems to be ok since it hasn't been set. So just
> decrementing the call size should be enough.
>
> cheers,
> esteban.
>
>
> --
> Cloudera, Inc.
>
>
>
> On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <es...@cloudera.com>
> wrote:
>
> >
> > Looks like after HBASE-10506 we didn't do what CallRunner used to do
> > before: clear the RequestContext and decrement callQueueSize.
> >
> > Created HBASE-11705 to address this issue.
> >
> >
> >
> >
> >
> >
> > --
> > Cloudera, Inc.
> >
> >
> >
> > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <ti...@gmail.com> wrote:
> >
> >> good point.  that is a big suspect.
> >>
> >> I check your log, ClosedChannelException should be triggered by
> >>  call.sendResponseIfReady()(it is the only request in the queue, so
> >> handler
> >> send response directly), but at that point the callqueueSize has been
> >> decremented.
> >>
> >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >> RpcServer.respondercallId: 118504 service: ClientService methodName:
> Multi
> >> size: 141.9 K connection: 10.248.134.67:55347: output error
> >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> >> ipc.RpcServer (CallRunner.java:run(135)) -
> >> RpcServer.handler=57,port=60020:
> >> caught a ClosedChannelException, this means that the server was
> processing
> >> a request but the client went away. The error message was: null
> >>
> >> it looks you have got the fix, would you file a jira?
> >> thanks.
> >>
> >>
> >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com> wrote:
> >>
> >> > I've only looked at the code a little, and likely missed something,
> but
> >> > does this if block decrement the call queue, if the client already
> >> closed
> >> > the connection?
> >> >
> >> >
> >> >
> >>
> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
> >> >
> >> >
> >> >
> >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com>
> wrote:
> >> >
> >> > > Yes, sorry, CallQueueTooBigException. but that value never returns
> to
> >> > > zero, even when number of requests goes to zero.  The call queue too
> >> big
> >> > > happens if any regionserver is up for a long enough period of time,
> >> so I
> >> > > have to periodically restart them.  Also at that 15:30 time I wasn't
> >> > > seeing that exception, but it seems like that is one time in which a
> >> call
> >> > > didnt properly decrement the callqueuesize because it was at zero
> >> before
> >> > > and has never hit zero again - today the minimum is even higher.
> >> > >
> >> > >
> >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com>
> wrote:
> >> > >
> >> > >> bq. "Eventually we ran into ipc queue size full messages being
> >> returned
> >> > to
> >> > >> clients trying large batch puts, as it approaches a gigabyte."
> >> > >>
> >> > >> Do you mean CallQueueTooBigException? it looks not the queue size,
> >> but
> >> > the
> >> > >> data size that client sends..configured by
> >> > >> "hbase.ipc.server.max.callqueue.size".
> >> > >>
> >> > >> I guess when you client got the exception, it closed the exception
> >> and
> >> > >> causing other shared connection RPC failed.
> >> > >>
> >> > >>
> >> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
> >> > >> ipc.RpcServer
> >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count
> >> of
> >> > >> bytes read: 0
> >> > >> java.io.IOException: Connection reset by peer
> >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> >> > >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> >> > >> at
> >> > org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
> >> > >> at
> >> > >>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
> >> > >> at
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
> >> > >> at
> >> > >>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
> >> > >> at
> >> > >>
> >> > >>
> >> >
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
> >> > >> at
> >> > >>
> >> > >>
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> > >> at
> >> > >>
> >> > >>
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> > >> at java.lang.Thread.run(Thread.java:744)
> >> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
> >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >> > >> RpcServer.respondercallId: 84968 service: ClientService methodName:
> >> > Multi
> >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
> >> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
> >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> >> > >> RpcServer.handler=18,port=60020:
> >> > >> caught a ClosedChannelException, this means that the server was
> >> > processing
> >> > >> a request but the client went away. The error message was: null
> >> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
> >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >> > >> RpcServer.respondercallId: 83907 service: ClientService methodName:
> >> > Multi
> >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
> >> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
> >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> >> > >> RpcServer.handler=61,port=60020:
> >> > >> caught a ClosedChannelException, this means that the server was
> >> > processing
> >> > >> a request but the client went away. The error message was: null
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com>
> >> wrote:
> >> > >>
> >> > >> >
> >> > >>
> >> >
> >>
> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
> >> > >> >
> >> > >> >
> >> >
> http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
> >> > >> >
> >> > >> > These are logs from that particular server, and the debug dump
> from
> >> > >> now(no
> >> > >> > restart in between).  The times in the graph are pacific, so it
> >> should
> >> > >> be
> >> > >> > around 2014-08-06 22:25:00.  I do see some exceptions around
> there.
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

Re: IPC Queue Size

Posted by Esteban Gutierrez <es...@cloudera.com>.

The RequestContext seems to be ok since it hasn't been set. So just
decrementing the call size should be enough.

cheers,
esteban.


--
Cloudera, Inc.



On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <es...@cloudera.com>
wrote:

>
> Looks like after HBASE-10506 we didn't do what CallRunner used to do
> before: clear the RequestContext and decrement callQueueSize.
>
> Created HBASE-11705 to address this issue.
>
>
>
>
>
>
> --
> Cloudera, Inc.
>
>
>
> On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <ti...@gmail.com> wrote:
>
>> good point.  that is a big suspect.
>>
>> I check your log, ClosedChannelException should be triggered by
>>  call.sendResponseIfReady()(it is the only request in the queue, so
>> handler
>> send response directly), but at that point the callqueueSize has been
>> decremented.
>>
>> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> RpcServer.respondercallId: 118504 service: ClientService methodName: Multi
>> size: 141.9 K connection: 10.248.134.67:55347: output error
>> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>> ipc.RpcServer (CallRunner.java:run(135)) -
>> RpcServer.handler=57,port=60020:
>> caught a ClosedChannelException, this means that the server was processing
>> a request but the client went away. The error message was: null
>>
>> it looks you have got the fix, would you file a jira?
>> thanks.
>>
>>
>> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com> wrote:
>>
>> > I've only looked at the code a little, and likely missed something, but
>> > does this if block decrement the call queue, if the client already
>> closed
>> > the connection?
>> >
>> >
>> >
>> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
>> >
>> >
>> >
>> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com> wrote:
>> >
>> > > Yes, sorry, CallQueueTooBigException. but that value never returns to
>> > > zero, even when number of requests goes to zero.  The call queue too
>> big
>> > > happens if any regionserver is up for a long enough period of time,
>> so I
>> > > have to periodically restart them.  Also at that 15:30 time I wasn't
>> > > seeing that exception, but it seems like that is one time in which a
>> call
>> > > didnt properly decrement the callqueuesize because it was at zero
>> before
>> > > and has never hit zero again - today the minimum is even higher.
>> > >
>> > >
>> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com> wrote:
>> > >
>> > >> bq. "Eventually we ran into ipc queue size full messages being
>> returned
>> > to
>> > >> clients trying large batch puts, as it approaches a gigabyte."
>> > >>
>> > >> Do you mean CallQueueTooBigException? it looks not the queue size,
>> but
>> > the
>> > >> data size that client sends..configured by
>> > >> "hbase.ipc.server.max.callqueue.size".
>> > >>
>> > >> I guess when you client got the exception, it closed the exception
>> and
>> > >> causing other shared connection RPC failed.
>> > >>
>> > >>
>> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
>> > >> ipc.RpcServer
>> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count
>> of
>> > >> bytes read: 0
>> > >> java.io.IOException: Connection reset by peer
>> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>> > >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>> > >> at
>> > org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
>> > >> at
>> > >>
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
>> > >> at
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
>> > >> at
>> > >>
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
>> > >> at
>> > >>
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
>> > >> at
>> > >>
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> > >> at
>> > >>
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> > >> at java.lang.Thread.run(Thread.java:744)
>> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> RpcServer.respondercallId: 84968 service: ClientService methodName:
>> > Multi
>> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
>> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
>> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> RpcServer.handler=18,port=60020:
>> > >> caught a ClosedChannelException, this means that the server was
>> > processing
>> > >> a request but the client went away. The error message was: null
>> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> RpcServer.respondercallId: 83907 service: ClientService methodName:
>> > Multi
>> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
>> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
>> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> RpcServer.handler=61,port=60020:
>> > >> caught a ClosedChannelException, this means that the server was
>> > processing
>> > >> a request but the client went away. The error message was: null
>> > >>
>> > >>
>> > >>
>> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com>
>> wrote:
>> > >>
>> > >> >
>> > >>
>> >
>> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>> > >> >
>> > >> >
>> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>> > >> >
>> > >> > These are logs from that particular server, and the debug dump from
>> > >> now(no
>> > >> > restart in between).  The times in the graph are pacific, so it
>> should
>> > >> be
>> > >> > around 2014-08-06 22:25:00.  I do see some exceptions around there.
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: IPC Queue Size

Posted by Esteban Gutierrez <es...@cloudera.com>.

Looks like after HBASE-10506 we didn't do what CallRunner used to do
before: clear the RequestContext and decrement callQueueSize.

Created HBASE-11705 to address this issue.






--
Cloudera, Inc.



On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <ti...@gmail.com> wrote:

> good point.  that is a big suspect.
>
> I check your log, ClosedChannelException should be triggered by
>  call.sendResponseIfReady()(it is the only request in the queue, so handler
> send response directly), but at that point the callqueueSize has been
> decremented.
>
> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> RpcServer.respondercallId: 118504 service: ClientService methodName: Multi
> size: 141.9 K connection: 10.248.134.67:55347: output error
> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
> ipc.RpcServer (CallRunner.java:run(135)) - RpcServer.handler=57,port=60020:
> caught a ClosedChannelException, this means that the server was processing
> a request but the client went away. The error message was: null
>
> it looks you have got the fix, would you file a jira?
> thanks.
>
>
> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com> wrote:
>
> > I've only looked at the code a little, and likely missed something, but
> > does this if block decrement the call queue, if the client already closed
> > the connection?
> >
> >
> >
> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
> >
> >
> >
> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com> wrote:
> >
> > > Yes, sorry, CallQueueTooBigException. but that value never returns to
> > > zero, even when number of requests goes to zero.  The call queue too
> big
> > > happens if any regionserver is up for a long enough period of time, so
> I
> > > have to periodically restart them.  Also at that 15:30 time I wasn't
> > > seeing that exception, but it seems like that is one time in which a
> call
> > > didnt properly decrement the callqueuesize because it was at zero
> before
> > > and has never hit zero again - today the minimum is even higher.
> > >
> > >
> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com> wrote:
> > >
> > >> bq. "Eventually we ran into ipc queue size full messages being
> returned
> > to
> > >> clients trying large batch puts, as it approaches a gigabyte."
> > >>
> > >> Do you mean CallQueueTooBigException? it looks not the queue size, but
> > the
> > >> data size that client sends..configured by
> > >> "hbase.ipc.server.max.callqueue.size".
> > >>
> > >> I guess when you client got the exception, it closed the exception and
> > >> causing other shared connection RPC failed.
> > >>
> > >>
> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
> > >> ipc.RpcServer
> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count of
> > >> bytes read: 0
> > >> java.io.IOException: Connection reset by peer
> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> > >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> > >> at
> > org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
> > >> at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
> > >> at
> > >>
> > org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
> > >> at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
> > >> at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
> > >> at
> > >>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > >> at
> > >>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > >> at java.lang.Thread.run(Thread.java:744)
> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> > >> RpcServer.respondercallId: 84968 service: ClientService methodName:
> > Multi
> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> > >> RpcServer.handler=18,port=60020:
> > >> caught a ClosedChannelException, this means that the server was
> > processing
> > >> a request but the client went away. The error message was: null
> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> > >> RpcServer.respondercallId: 83907 service: ClientService methodName:
> > Multi
> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
> > >> ipc.RpcServer (CallRunner.java:run(135)) -
> > >> RpcServer.handler=61,port=60020:
> > >> caught a ClosedChannelException, this means that the server was
> > processing
> > >> a request but the client went away. The error message was: null
> > >>
> > >>
> > >>
> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com>
> wrote:
> > >>
> > >> >
> > >>
> >
> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
> > >> >
> > >> >
> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
> > >> >
> > >> > These are logs from that particular server, and the debug dump from
> > >> now(no
> > >> > restart in between).  The times in the graph are pacific, so it
> should
> > >> be
> > >> > around 2014-08-06 22:25:00.  I do see some exceptions around there.
> > >> >
> > >>
> > >
> > >
> >
>

Re: IPC Queue Size

Posted by Qiang Tian <ti...@gmail.com>.

good point.  that is a big suspect.

I check your log, ClosedChannelException should be triggered by
 call.sendResponseIfReady()(it is the only request in the queue, so handler
send response directly), but at that point the callqueueSize has been
decremented.

2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
ipc.RpcServer (RpcServer.java:processResponse(1041)) -
RpcServer.respondercallId: 118504 service: ClientService methodName: Multi
size: 141.9 K connection: 10.248.134.67:55347: output error
2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
ipc.RpcServer (CallRunner.java:run(135)) - RpcServer.handler=57,port=60020:
caught a ClosedChannelException, this means that the server was processing
a request but the client went away. The error message was: null

it looks you have got the fix, would you file a jira?
thanks.


On Fri, Aug 8, 2014 at 2:41 PM, Walter King <wa...@adroll.com> wrote:

> I've only looked at the code a little, and likely missed something, but
> does this if block decrement the call queue, if the client already closed
> the connection?
>
>
> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
>
>
>
> On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com> wrote:
>
> > Yes, sorry, CallQueueTooBigException. but that value never returns to
> > zero, even when number of requests goes to zero.  The call queue too big
> > happens if any regionserver is up for a long enough period of time, so I
> > have to periodically restart them.  Also at that 15:30 time I wasn't
> > seeing that exception, but it seems like that is one time in which a call
> > didnt properly decrement the callqueuesize because it was at zero before
> > and has never hit zero again - today the minimum is even higher.
> >
> >
> > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com> wrote:
> >
> >> bq. "Eventually we ran into ipc queue size full messages being returned
> to
> >> clients trying large batch puts, as it approaches a gigabyte."
> >>
> >> Do you mean CallQueueTooBigException? it looks not the queue size, but
> the
> >> data size that client sends..configured by
> >> "hbase.ipc.server.max.callqueue.size".
> >>
> >> I guess when you client got the exception, it closed the exception and
> >> causing other shared connection RPC failed.
> >>
> >>
> >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
> >> ipc.RpcServer
> >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count of
> >> bytes read: 0
> >> java.io.IOException: Connection reset by peer
> >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> >> at
> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
> >> at
> >>
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
> >> at
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
> >> at
> >>
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
> >> at
> >>
> >>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
> >> at
> >>
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >> at
> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >> at java.lang.Thread.run(Thread.java:744)
> >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
> >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >> RpcServer.respondercallId: 84968 service: ClientService methodName:
> Multi
> >> size: 17.7 K connection: 10.248.130.152:49780: output error
> >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
> >> ipc.RpcServer (CallRunner.java:run(135)) -
> >> RpcServer.handler=18,port=60020:
> >> caught a ClosedChannelException, this means that the server was
> processing
> >> a request but the client went away. The error message was: null
> >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
> >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> >> RpcServer.respondercallId: 83907 service: ClientService methodName:
> Multi
> >> size: 17.1 K connection: 10.248.1.56:53615: output error
> >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
> >> ipc.RpcServer (CallRunner.java:run(135)) -
> >> RpcServer.handler=61,port=60020:
> >> caught a ClosedChannelException, this means that the server was
> processing
> >> a request but the client went away. The error message was: null
> >>
> >>
> >>
> >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com> wrote:
> >>
> >> >
> >>
> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
> >> >
> >> >
> http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
> >> >
> >> > These are logs from that particular server, and the debug dump from
> >> now(no
> >> > restart in between).  The times in the graph are pacific, so it should
> >> be
> >> > around 2014-08-06 22:25:00.  I do see some exceptions around there.
> >> >
> >>
> >
> >
>

Re: IPC Queue Size

Posted by Walter King <wa...@adroll.com>.

I've only looked at the code a little, and likely missed something, but
does this if block decrement the call queue, if the client already closed
the connection?

https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74



On Thu, Aug 7, 2014 at 11:32 PM, Walter King <wa...@adroll.com> wrote:

> Yes, sorry, CallQueueTooBigException. but that value never returns to
> zero, even when number of requests goes to zero.  The call queue too big
> happens if any regionserver is up for a long enough period of time, so I
> have to periodically restart them.  Also at that 15:30 time I wasn't
> seeing that exception, but it seems like that is one time in which a call
> didnt properly decrement the callqueuesize because it was at zero before
> and has never hit zero again - today the minimum is even higher.
>
>
> On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com> wrote:
>
>> bq. "Eventually we ran into ipc queue size full messages being returned to
>> clients trying large batch puts, as it approaches a gigabyte."
>>
>> Do you mean CallQueueTooBigException? it looks not the queue size, but the
>> data size that client sends..configured by
>> "hbase.ipc.server.max.callqueue.size".
>>
>> I guess when you client got the exception, it closed the exception and
>> causing other shared connection RPC failed.
>>
>>
>> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
>> ipc.RpcServer
>> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count of
>> bytes read: 0
>> java.io.IOException: Connection reset by peer
>> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>> at org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
>> at
>>
>> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
>> at
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
>> at
>>
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
>> at
>>
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
>> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> RpcServer.respondercallId: 84968 service: ClientService methodName: Multi
>> size: 17.7 K connection: 10.248.130.152:49780: output error
>> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
>> ipc.RpcServer (CallRunner.java:run(135)) -
>> RpcServer.handler=18,port=60020:
>> caught a ClosedChannelException, this means that the server was processing
>> a request but the client went away. The error message was: null
>> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
>> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> RpcServer.respondercallId: 83907 service: ClientService methodName: Multi
>> size: 17.1 K connection: 10.248.1.56:53615: output error
>> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
>> ipc.RpcServer (CallRunner.java:run(135)) -
>> RpcServer.handler=61,port=60020:
>> caught a ClosedChannelException, this means that the server was processing
>> a request but the client went away. The error message was: null
>>
>>
>>
>> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com> wrote:
>>
>> >
>> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>> >
>> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>> >
>> > These are logs from that particular server, and the debug dump from
>> now(no
>> > restart in between).  The times in the graph are pacific, so it should
>> be
>> > around 2014-08-06 22:25:00.  I do see some exceptions around there.
>> >
>>
>
>

Re: IPC Queue Size

Posted by Walter King <wa...@adroll.com>.

Yes, sorry, CallQueueTooBigException. but that value never returns to zero,
even when number of requests goes to zero.  The call queue too big happens
if any regionserver is up for a long enough period of time, so I have to
periodically restart them.  Also at that 15:30 time I wasn't seeing that
exception, but it seems like that is one time in which a call didnt
properly decrement the callqueuesize because it was at zero before and has
never hit zero again - today the minimum is even higher.


On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <ti...@gmail.com> wrote:

> bq. "Eventually we ran into ipc queue size full messages being returned to
> clients trying large batch puts, as it approaches a gigabyte."
>
> Do you mean CallQueueTooBigException? it looks not the queue size, but the
> data size that client sends..configured by
> "hbase.ipc.server.max.callqueue.size".
>
> I guess when you client got the exception, it closed the exception and
> causing other shared connection RPC failed.
>
>
> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020] ipc.RpcServer
> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count of
> bytes read: 0
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
> at
>
> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
> at
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
> at
>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
> at
>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> RpcServer.respondercallId: 84968 service: ClientService methodName: Multi
> size: 17.7 K connection: 10.248.130.152:49780: output error
> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
> ipc.RpcServer (CallRunner.java:run(135)) - RpcServer.handler=18,port=60020:
> caught a ClosedChannelException, this means that the server was processing
> a request but the client went away. The error message was: null
> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
> RpcServer.respondercallId: 83907 service: ClientService methodName: Multi
> size: 17.1 K connection: 10.248.1.56:53615: output error
> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
> ipc.RpcServer (CallRunner.java:run(135)) - RpcServer.handler=61,port=60020:
> caught a ClosedChannelException, this means that the server was processing
> a request but the client went away. The error message was: null
>
>
>
> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com> wrote:
>
> >
> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
> >
> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
> >
> > These are logs from that particular server, and the debug dump from
> now(no
> > restart in between).  The times in the graph are pacific, so it should be
> > around 2014-08-06 22:25:00.  I do see some exceptions around there.
> >
>

Re: IPC Queue Size

Posted by Qiang Tian <ti...@gmail.com>.

bq. "Eventually we ran into ipc queue size full messages being returned to
clients trying large batch puts, as it approaches a gigabyte."

Do you mean CallQueueTooBigException? it looks not the queue size, but the
data size that client sends..configured by
"hbase.ipc.server.max.callqueue.size".

I guess when you client got the exception, it closed the exception and
causing other shared connection RPC failed.

2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020] ipc.RpcServer
(RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: count of
bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
at
org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
at org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
at
org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
ipc.RpcServer (RpcServer.java:processResponse(1041)) -
RpcServer.respondercallId: 84968 service: ClientService methodName: Multi
size: 17.7 K connection: 10.248.130.152:49780: output error
2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
ipc.RpcServer (CallRunner.java:run(135)) - RpcServer.handler=18,port=60020:
caught a ClosedChannelException, this means that the server was processing
a request but the client went away. The error message was: null
2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
ipc.RpcServer (RpcServer.java:processResponse(1041)) -
RpcServer.respondercallId: 83907 service: ClientService methodName: Multi
size: 17.1 K connection: 10.248.1.56:53615: output error
2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
ipc.RpcServer (CallRunner.java:run(135)) - RpcServer.handler=61,port=60020:
caught a ClosedChannelException, this means that the server was processing
a request but the client went away. The error message was: null

On Fri, Aug 8, 2014 at 2:57 AM, Walter King <wa...@adroll.com> wrote:

> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>
> http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>
> These are logs from that particular server, and the debug dump from now(no
> restart in between).  The times in the graph are pacific, so it should be
> around 2014-08-06 22:25:00.  I do see some exceptions around there.
>

Re: IPC Queue Size

Posted by Walter King <wa...@adroll.com>.

https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt

http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz

These are logs from that particular server, and the debug dump from now(no
restart in between).  The times in the graph are pacific, so it should be
around 2014-08-06 22:25:00.  I do see some exceptions around there.

Re: IPC Queue Size

Posted by Esteban Gutierrez <es...@cloudera.com>.

Hello Walter,

Do you have the RS logs and the debug dump (rs:port/dump) from this RS that
you could share? The only reason I can think of that behavior is that this
Region Server might have struggled with single large call and the call size
was never decremented due an exception. By
default ipc.client.call.purge.timeout should have purged that call from the
queue unless ipc.client.call.purge.timeout is configured with a vey large
value.

--
Cloudera, Inc.

On Thu, Aug 7, 2014 at 8:42 AM, Walter King <wa...@adroll.com> wrote:

> Hi, we're running 0.98.3 hbase, and it seems like the ipc queueSize is not
> always being reduced.  Graphing the metric from jmx, at the start it stays
> at zero and spikes up and returns back to zero.  At a certain point it
> stops returning to zero, and returns to something slightly higher.
>  Eventually we ran into ipc queue size full messages being returned to
> clients trying large batch puts, as it approaches a gigabyte.  Some of our
> servers, when idle and not processing any requests, say the queue size is
> hundreds of megabytes.  I have a heap dump of one claiming to have over a
> hundred mb in the queue, and as far as i can tell the call objects are
> maybe a single mb in heap space.  We do have a sizable number of scans that
> timeout without closing, lots of increments, and a few puts.
>
> Graphs: http://imgur.com/9qDHgRd  The middle graph is what I'd expect to
> see, the bottom graph seems to stop working around 15:30, and the top graph
> never drops below 10MB.  Read/write request counts are relatively balanced
> amongst the servers.
>
> Is this normal?
>