You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ivan Tretyakov <it...@griddynamics.com> on 2013/11/08 19:18:50 UTC

Hbase region servers shuts down unexpectedly

Hello!

We have following issue on our cluster running HBase 0.92.1-cdh4.1.1.
When we start full scan of the table some of servers shuts down
unexpectedly with following lines in the log:

2013-11-07 21:19:12,173 WARN org.apache.hadoop.ipc.HBaseServer:
(responseTooLarge):
{"processingtimems":6723,"call":"next(-3171672497308828151, 1000), rpc
version=1, client version=29, methodsFingerPrint=1891768260","client":"
10.0.241.99:43063
","starttimems":1383859145449,"queuetimems":0,"class":"HRegionServer","responsesize":1059073884,"method":"next"}
2013-11-07 21:19:33,009 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
20545ms instead of 3000ms, this is likely due to a long garbage collecting
pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-11-07 21:19:41,651 INFO org.apache.hadoop.hbase.util.VersionInfo:
HBase 0.92.1-cdh4.1.1

or one more example:

2013-11-07 22:07:02,587 WARN org.apache.hadoop.ipc.HBaseServer:
(responseTooLarge):
{"processingtimems":12540,"call":"next(8031108008798991209, 1000), rpc
version=1, client version=29, methodsFingerPrint=1891768260","client":"
10.0.240.211:33538
","starttimems":1383862010045,"queuetimems":14955,"class":"HRegionServer","responsesize":1322737704,"method":"next"}
2013-11-07 22:08:00,413 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception for block
BP-1892992341-10.10.122.111-1352825964285:blk_-2134516062062022634_68425527
java.io.EOFException: Premature EOF: no length prefix available
        at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
        at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:670)
2013-11-07 22:08:09,394 INFO org.apache.hadoop.hbase.util.VersionInfo:
HBase 0.92.1-cdh4.1.1

Last line ' HBase 0.92.1-cdh4.1.1' is indicating just started new instance
of region server. Every time I see 'responseTooLarge' message before
shutdown.
The job is working with '-caching' option equal to 1000.

My current assumption that problem caused by memory shortage on RS and long
GC pause which cause ZK session to expire and server to shutdown (-Xmx for
RS is 8GB). Then cloudera manager restarts it.

I've tried to run job with '-caching' equal to 1 there were no restarted
servers but job didn't finished within reasonable amount of time. I
understand that decreasing value of caching can mitigate the problem but it
not looks like right way for me, because number of regions per server can
be increased in future and we will have similar problem. And it it will
also slow down the job.

Do you think the problem caused by the same reasons which I assume?
Is that known issue?
What do you think could be the ways to resolve it?
Is there some option to send response when it is becoming too large
independent on caching value?

Thanks in advance for your answers.
I'm ready to provide any additional information you may need to help me
with this issue.

-- 
Best Regards
Ivan Tretyakov

Re: Hbase region servers shuts down unexpectedly

Posted by Ivan Tretyakov <it...@griddynamics.com>.
Thank you for the answer Ted.

We were able to fix the issue by tuning
hbase.client.scanner.max.result.size parameter.

P.S. "The HBase development team has affectionately dubbed this scenario a
Juliet Pause — the master (Romeo) presumes the region server (Juliet) is
dead when it’s really just sleeping, and thus takes some drastic action
(recovery). When the server wakes up, it sees that a great mistake has been
made and takes its own life. Makes for a good play, but a pretty awful
failure scenario!"


On Fri, Nov 8, 2013 at 10:26 PM, Ted Yu <yu...@gmail.com> wrote:

> Have you tried using setBatch() to limit the number of columns returned ?
>
> See code example in 9.4.4.3. of
> http://hbase.apache.org/book.html#client.filter.kvm
>
>
> On Fri, Nov 8, 2013 at 10:18 AM, Ivan Tretyakov <
> itretyakov@griddynamics.com
> > wrote:
>
> > Hello!
> >
> > We have following issue on our cluster running HBase 0.92.1-cdh4.1.1.
> > When we start full scan of the table some of servers shuts down
> > unexpectedly with following lines in the log:
> >
> > 2013-11-07 21:19:12,173 WARN org.apache.hadoop.ipc.HBaseServer:
> > (responseTooLarge):
> > {"processingtimems":6723,"call":"next(-3171672497308828151, 1000), rpc
> > version=1, client version=29, methodsFingerPrint=1891768260","client":"
> > 10.0.241.99:43063
> >
> >
> ","starttimems":1383859145449,"queuetimems":0,"class":"HRegionServer","responsesize":1059073884,"method":"next"}
> > 2013-11-07 21:19:33,009 WARN org.apache.hadoop.hbase.util.Sleeper: We
> slept
> > 20545ms instead of 3000ms, this is likely due to a long garbage
> collecting
> > pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > 2013-11-07 21:19:41,651 INFO org.apache.hadoop.hbase.util.VersionInfo:
> > HBase 0.92.1-cdh4.1.1
> >
> > or one more example:
> >
> > 2013-11-07 22:07:02,587 WARN org.apache.hadoop.ipc.HBaseServer:
> > (responseTooLarge):
> > {"processingtimems":12540,"call":"next(8031108008798991209, 1000), rpc
> > version=1, client version=29, methodsFingerPrint=1891768260","client":"
> > 10.0.240.211:33538
> >
> >
> ","starttimems":1383862010045,"queuetimems":14955,"class":"HRegionServer","responsesize":1322737704,"method":"next"}
> > 2013-11-07 22:08:00,413 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception for block
> >
> BP-1892992341-10.10.122.111-1352825964285:blk_-2134516062062022634_68425527
> > java.io.EOFException: Premature EOF: no length prefix available
> >         at
> >
> >
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
> >         at
> >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
> >         at
> >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:670)
> > 2013-11-07 22:08:09,394 INFO org.apache.hadoop.hbase.util.VersionInfo:
> > HBase 0.92.1-cdh4.1.1
> >
> > Last line ' HBase 0.92.1-cdh4.1.1' is indicating just started new
> instance
> > of region server. Every time I see 'responseTooLarge' message before
> > shutdown.
> > The job is working with '-caching' option equal to 1000.
> >
> > My current assumption that problem caused by memory shortage on RS and
> long
> > GC pause which cause ZK session to expire and server to shutdown (-Xmx
> for
> > RS is 8GB). Then cloudera manager restarts it.
> >
> > I've tried to run job with '-caching' equal to 1 there were no restarted
> > servers but job didn't finished within reasonable amount of time. I
> > understand that decreasing value of caching can mitigate the problem but
> it
> > not looks like right way for me, because number of regions per server can
> > be increased in future and we will have similar problem. And it it will
> > also slow down the job.
> >
> > Do you think the problem caused by the same reasons which I assume?
> > Is that known issue?
> > What do you think could be the ways to resolve it?
> > Is there some option to send response when it is becoming too large
> > independent on caching value?
> >
> > Thanks in advance for your answers.
> > I'm ready to provide any additional information you may need to help me
> > with this issue.
> >
> > --
> > Best Regards
> > Ivan Tretyakov
> >
>



-- 
Best Regards
Ivan Tretyakov

Deployment Engineer
Grid Dynamics
+7 812 640 38 76
Skype: ivan.v.tretyakov
www.griddynamics.com
itretyakov@griddynamics.com

Re: Hbase region servers shuts down unexpectedly

Posted by Ted Yu <yu...@gmail.com>.
Have you tried using setBatch() to limit the number of columns returned ?

See code example in 9.4.4.3. of
http://hbase.apache.org/book.html#client.filter.kvm


On Fri, Nov 8, 2013 at 10:18 AM, Ivan Tretyakov <itretyakov@griddynamics.com
> wrote:

> Hello!
>
> We have following issue on our cluster running HBase 0.92.1-cdh4.1.1.
> When we start full scan of the table some of servers shuts down
> unexpectedly with following lines in the log:
>
> 2013-11-07 21:19:12,173 WARN org.apache.hadoop.ipc.HBaseServer:
> (responseTooLarge):
> {"processingtimems":6723,"call":"next(-3171672497308828151, 1000), rpc
> version=1, client version=29, methodsFingerPrint=1891768260","client":"
> 10.0.241.99:43063
>
> ","starttimems":1383859145449,"queuetimems":0,"class":"HRegionServer","responsesize":1059073884,"method":"next"}
> 2013-11-07 21:19:33,009 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 20545ms instead of 3000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2013-11-07 21:19:41,651 INFO org.apache.hadoop.hbase.util.VersionInfo:
> HBase 0.92.1-cdh4.1.1
>
> or one more example:
>
> 2013-11-07 22:07:02,587 WARN org.apache.hadoop.ipc.HBaseServer:
> (responseTooLarge):
> {"processingtimems":12540,"call":"next(8031108008798991209, 1000), rpc
> version=1, client version=29, methodsFingerPrint=1891768260","client":"
> 10.0.240.211:33538
>
> ","starttimems":1383862010045,"queuetimems":14955,"class":"HRegionServer","responsesize":1322737704,"method":"next"}
> 2013-11-07 22:08:00,413 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception for block
> BP-1892992341-10.10.122.111-1352825964285:blk_-2134516062062022634_68425527
> java.io.EOFException: Premature EOF: no length prefix available
>         at
>
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>         at
>
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
>         at
>
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:670)
> 2013-11-07 22:08:09,394 INFO org.apache.hadoop.hbase.util.VersionInfo:
> HBase 0.92.1-cdh4.1.1
>
> Last line ' HBase 0.92.1-cdh4.1.1' is indicating just started new instance
> of region server. Every time I see 'responseTooLarge' message before
> shutdown.
> The job is working with '-caching' option equal to 1000.
>
> My current assumption that problem caused by memory shortage on RS and long
> GC pause which cause ZK session to expire and server to shutdown (-Xmx for
> RS is 8GB). Then cloudera manager restarts it.
>
> I've tried to run job with '-caching' equal to 1 there were no restarted
> servers but job didn't finished within reasonable amount of time. I
> understand that decreasing value of caching can mitigate the problem but it
> not looks like right way for me, because number of regions per server can
> be increased in future and we will have similar problem. And it it will
> also slow down the job.
>
> Do you think the problem caused by the same reasons which I assume?
> Is that known issue?
> What do you think could be the ways to resolve it?
> Is there some option to send response when it is becoming too large
> independent on caching value?
>
> Thanks in advance for your answers.
> I'm ready to provide any additional information you may need to help me
> with this issue.
>
> --
> Best Regards
> Ivan Tretyakov
>