You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Buckley,Ron" <bu...@oclc.org> on 2014/05/30 17:24:46 UTC

Region Server hung during shutdown after StackOverflow error

Interesting case happened out dev HBase cluster overnight.  (We're running HBase 0.94.15 from CDH 4.6.0)

A region server took a StackOverflow error, it looks like during during a minor compaction.

The region server is trying to shut down with a Fatal, but is now hung during shutdown.

The particularly troublesome thing is that the RS is alive enough to keep zookeeper happy.

So, the regions arent moving off, but our apps cant get to them because the RS is mostly dead.

I put some of the details on pastebin.

JStack -> http://pastebin.com/hnLtaG54
Outfile -> http://pastebin.com/5F1UcGjg
Logfile -> http://pastebin.com/TBL1YSZM

Re: Region Server hung during shutdown after StackOverflow error

Posted by Ted Yu <yu...@gmail.com>.

There was 47 second gap in region server log (where the calls to subList()
might have happened):


   1. 2014-05-29 19:09:02,257 INFO
   org.apache.hadoop.hbase.regionserver.compactions.CompactSelection: Deleting
   the expired store file by compaction:
   hdfs://cluster/hbase/IngestProcessing/bf754ed8764ca705a2acc0058e13b69c/data/22b41ad9388f488cb672cca3de0614e9
   whose maxTimeStamp is -1 while the max expired timestamp is 1401318542257
   2. 2014-05-29 19:09:49,324 INFO
   org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
   -6708632874853984071 lease expired on region
   WorldcatCrossref,4333961705,1334582131683.90c82e6c71dd99f21a18df41df28e5b0.


Good practice would be, instead of assigning subList() to the same member
variable, to clear the sublist which is not needed.

Cheers

On Fri, May 30, 2014 at 9:52 AM, Andrew Purtell <ap...@apache.org> wrote:

> Maybe we can kill the zookeeper connection in the abort handler.
>
>
> On Fri, May 30, 2014 at 9:38 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > Thanks Ted. I should have seen that.
> >
> > I finally had to 'kill -9' the rs, as I couldnt get it to shut down any
> > other way.
> >
> > It seems like, the Region Server shouldnt have kept telling ZooKeeper
> that
> > all was well, even though it was trying to abort with a fatal error.
> >
> >
> > -----Original Message-----
> > From: Ted Yu [mailto:yuzhihong@gmail.com]
> > Sent: Friday, May 30, 2014 12:11 PM
> > To: user@hbase.apache.org
> > Subject: Re: Region Server hung during shutdown after StackOverflow error
> >
> > Looking at the StackOverflowError in pastebin, the cause was too many
> > calls to subList().
> > J-D fixed one similar bug in HBASE-10312
> >
> > I searched for '\.subList(' in 0.94 codebase but haven't pinpointed which
> > class was the source of such calls.
> >
> > Will dig deeper when I have time.
> >
> > Cheers
> >
> >
> > On Fri, May 30, 2014 at 8:24 AM, Buckley,Ron <bu...@oclc.org> wrote:
> >
> > > Interesting case happened out dev HBase cluster overnight.  (We're
> > > running HBase 0.94.15 from CDH 4.6.0)
> > >
> > > A region server took a StackOverflow error, it looks like during
> > > during a minor compaction.
> > >
> > > The region server is trying to shut down with a Fatal, but is now hung
> > > during shutdown.
> > >
> > > The particularly troublesome thing is that the RS is alive enough to
> > > keep zookeeper happy.
> > >
> > > So, the regions arent moving off, but our apps cant get to them
> > > because the RS is mostly dead.
> > >
> > > I put some of the details on pastebin.
> > >
> > > JStack -> http://pastebin.com/hnLtaG54 Outfile ->
> > > http://pastebin.com/5F1UcGjg Logfile -> http://pastebin.com/TBL1YSZM
> > >
> > >
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Region Server hung during shutdown after StackOverflow error

Posted by Andrew Purtell <ap...@apache.org>.

Maybe we can kill the zookeeper connection in the abort handler.


On Fri, May 30, 2014 at 9:38 AM, Buckley,Ron <bu...@oclc.org> wrote:

> Thanks Ted. I should have seen that.
>
> I finally had to 'kill -9' the rs, as I couldnt get it to shut down any
> other way.
>
> It seems like, the Region Server shouldnt have kept telling ZooKeeper that
> all was well, even though it was trying to abort with a fatal error.
>
>
> -----Original Message-----
> From: Ted Yu [mailto:yuzhihong@gmail.com]
> Sent: Friday, May 30, 2014 12:11 PM
> To: user@hbase.apache.org
> Subject: Re: Region Server hung during shutdown after StackOverflow error
>
> Looking at the StackOverflowError in pastebin, the cause was too many
> calls to subList().
> J-D fixed one similar bug in HBASE-10312
>
> I searched for '\.subList(' in 0.94 codebase but haven't pinpointed which
> class was the source of such calls.
>
> Will dig deeper when I have time.
>
> Cheers
>
>
> On Fri, May 30, 2014 at 8:24 AM, Buckley,Ron <bu...@oclc.org> wrote:
>
> > Interesting case happened out dev HBase cluster overnight.  (We're
> > running HBase 0.94.15 from CDH 4.6.0)
> >
> > A region server took a StackOverflow error, it looks like during
> > during a minor compaction.
> >
> > The region server is trying to shut down with a Fatal, but is now hung
> > during shutdown.
> >
> > The particularly troublesome thing is that the RS is alive enough to
> > keep zookeeper happy.
> >
> > So, the regions arent moving off, but our apps cant get to them
> > because the RS is mostly dead.
> >
> > I put some of the details on pastebin.
> >
> > JStack -> http://pastebin.com/hnLtaG54 Outfile ->
> > http://pastebin.com/5F1UcGjg Logfile -> http://pastebin.com/TBL1YSZM
> >
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

RE: Region Server hung during shutdown after StackOverflow error

Posted by "Buckley,Ron" <bu...@oclc.org>.

Thanks Ted. I should have seen that. 

I finally had to 'kill -9' the rs, as I couldnt get it to shut down any other way.

It seems like, the Region Server shouldnt have kept telling ZooKeeper that all was well, even though it was trying to abort with a fatal error.

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Friday, May 30, 2014 12:11 PM
To: user@hbase.apache.org
Subject: Re: Region Server hung during shutdown after StackOverflow error

Looking at the StackOverflowError in pastebin, the cause was too many calls to subList().
J-D fixed one similar bug in HBASE-10312

I searched for '\.subList(' in 0.94 codebase but haven't pinpointed which class was the source of such calls.

Will dig deeper when I have time.

Cheers

On Fri, May 30, 2014 at 8:24 AM, Buckley,Ron <bu...@oclc.org> wrote:

> Interesting case happened out dev HBase cluster overnight.  (We're 
> running HBase 0.94.15 from CDH 4.6.0)
>
> A region server took a StackOverflow error, it looks like during 
> during a minor compaction.
>
> The region server is trying to shut down with a Fatal, but is now hung 
> during shutdown.
>
> The particularly troublesome thing is that the RS is alive enough to 
> keep zookeeper happy.
>
> So, the regions arent moving off, but our apps cant get to them 
> because the RS is mostly dead.
>
> I put some of the details on pastebin.
>
> JStack -> http://pastebin.com/hnLtaG54 Outfile -> 
> http://pastebin.com/5F1UcGjg Logfile -> http://pastebin.com/TBL1YSZM
>
>

Re: Region Server hung during shutdown after StackOverflow error

Posted by Ted Yu <yu...@gmail.com>.

Looking at the StackOverflowError in pastebin, the cause was too many calls
to subList().
J-D fixed one similar bug in HBASE-10312

I searched for '\.subList(' in 0.94 codebase but haven't pinpointed which
class was the source of such calls.

Will dig deeper when I have time.

Cheers

On Fri, May 30, 2014 at 8:24 AM, Buckley,Ron <bu...@oclc.org> wrote:

> Interesting case happened out dev HBase cluster overnight.  (We're running
> HBase 0.94.15 from CDH 4.6.0)
>
> A region server took a StackOverflow error, it looks like during during a
> minor compaction.
>
> The region server is trying to shut down with a Fatal, but is now hung
> during shutdown.
>
> The particularly troublesome thing is that the RS is alive enough to keep
> zookeeper happy.
>
> So, the regions arent moving off, but our apps cant get to them because
> the RS is mostly dead.
>
> I put some of the details on pastebin.
>
> JStack -> http://pastebin.com/hnLtaG54
> Outfile -> http://pastebin.com/5F1UcGjg
> Logfile -> http://pastebin.com/TBL1YSZM
>
>