You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2014/08/23 21:46:10 UTC
[jira] [Updated] (HBASE-11813) CellScanner#advance may infinitely
recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Purtell updated HBASE-11813:
-----------------------------------
Description:
On user@hbase, johannes.schaback@visual-meta.com reported:
{quote}
we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal.
{quote}
Subsequent troubleshooting reveals that RPC is getting stuck because we losing RPC handlers. In the .out files we have this:
{noformat}
Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
java.lang.StackOverflowError
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
[...]
Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
java.lang.StackOverflowError
{noformat}
That is the anonymous CellScanner instance we create from CellUtil#createCellScanner:
{code}
return new CellScanner() {
private final Iterator<? extends CellScannable> iterator = cellScannerables.iterator();
private CellScanner cellScanner = null;
@Override
public Cell current() {
return this.cellScanner != null? this.cellScanner.current(): null;
}
@Override
public boolean advance() throws IOException {
if (this.cellScanner == null) {
if (!this.iterator.hasNext()) return false;
this.cellScanner = this.iterator.next().cellScanner();
}
if (this.cellScanner.advance()) return true;
this.cellScanner = null;
---> return advance();
}
};
{code}
That final return statement is the immediate problem.
We should also fix this so the RegionServer aborts if it loses a handler to an Error.
was:
On user@hbase, johannes.schaback@visual-meta.com reported:
{quote}
we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal.
{quote}
Subsequent troubleshooting reveals that RPC is getting stuck because we losing RPC handlers. In the .out files we have this:
{noformat}
Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
java.lang.StackOverflowError
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
[...]
Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
java.lang.StackOverflowError
Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
java.lang.StackOverflowError
{noformat}
That is the anonymous CellScanner instance we create from CellUtil#createCellScanner:
{code}
return new CellScanner() {
private final Iterator<? extends CellScannable> iterator = cellScannerabl\
es.iterator();
private CellScanner cellScanner = null;
@Override
public Cell current() {
return this.cellScanner != null? this.cellScanner.current(): null;
}
@Override
public boolean advance() throws IOException {
if (this.cellScanner == null) {
if (!this.iterator.hasNext()) return false;
this.cellScanner = this.iterator.next().cellScanner();
}
if (this.cellScanner.advance()) return true;
this.cellScanner = null;
---> return advance();
}
};
{code}
That final return statement is the immediate problem.
We should also fix this so the RegionServer aborts if it loses a handler to an Error.
> CellScanner#advance may infinitely recurse
> ------------------------------------------
>
> Key: HBASE-11813
> URL: https://issues.apache.org/jira/browse/HBASE-11813
> Project: HBase
> Issue Type: Bug
> Reporter: Andrew Purtell
> Priority: Blocker
> Fix For: 0.99.0, 2.0.0, 0.98.6
>
>
> On user@hbase, johannes.schaback@visual-meta.com reported:
> {quote}
> we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal.
> {quote}
> Subsequent troubleshooting reveals that RPC is getting stuck because we losing RPC handlers. In the .out files we have this:
> {noformat}
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
> [...]
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
> java.lang.StackOverflowError
> {noformat}
> That is the anonymous CellScanner instance we create from CellUtil#createCellScanner:
> {code}
> return new CellScanner() {
> private final Iterator<? extends CellScannable> iterator = cellScannerables.iterator();
> private CellScanner cellScanner = null;
> @Override
> public Cell current() {
> return this.cellScanner != null? this.cellScanner.current(): null;
> }
> @Override
> public boolean advance() throws IOException {
> if (this.cellScanner == null) {
> if (!this.iterator.hasNext()) return false;
> this.cellScanner = this.iterator.next().cellScanner();
> }
> if (this.cellScanner.advance()) return true;
> this.cellScanner = null;
> ---> return advance();
> }
> };
> {code}
> That final return statement is the immediate problem.
> We should also fix this so the RegionServer aborts if it loses a handler to an Error.
--
This message was sent by Atlassian JIRA
(v6.2#6252)