You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jim Abramson <ji...@magnetic.com> on 2013/01/28 21:14:58 UTC

RegionServer crashes silently under heavy RowKey scans

Hi,

We are testing HBase for some read-heavy batch operations, and encountering frequent, silent RegionServer crashes.  The application does many thousands of very selective row scans on a dataset containing several hundred million rows (less than 200GB overall), via thrift.

We have encountered this behavior both on a physical cluster with CDH4 installed, and in an Amazon EMR HBase cluster.  HBase, Thrift, HDFS etc are all running with very minimal customization to the default configurations (in EMR's case, effectively no customization is done at all).

The logs contain some clues, but nothing straightforwardly acknowledging the root cause of failure.  As such it is somewhat scary and we'd really appreciate any advice or suggestions on cause, prevention, etc.


The post-mortem pattern goes like this:


1) RegionServer logs a WARN to the effect of:

 org.apache.hadoop.ipc.HBaseServer 
 (responseTooLarge): {"processingtimems":2997,"call":"next(-6754667650407290823, 1000), rpc version=1, client version=29, methodsFingerPrint=1891768260","client":"somehost:55253","starttimems":1359399938999,"queuetimems":0,"class":"HRegionServer","responsesize":231379184,"method":"next"}

2) RegionServer logs one or a few of these at DEBUG, in the second or two before it goes silent:

 org.apache.hadoop.hdfs.DFSClient
 
 Error making BlockReader. Closing stale Socket[addr=/thishost,port=50010,localport=51345]
 java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
	at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:407)
	at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
	at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
	at java.io.DataInputStream.read(DataInputStream.java:132)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:454)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1062)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1318)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:266)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:209)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:519)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:534)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:178)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:111)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:83)
	at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1779)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:2915)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1453)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1445)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1421)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2079)
	at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1345)


3) HDFS logs this at INFO:

 org.apache.hadoop.hdfs.server.datanode.DataNode 
 Exception in receiveBlock for BP-1555962259-thisIP-1358284735183:blk_-5783891561030721738_125644 
 java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
	at java.lang.Thread.run(Thread.java:662)

4) HDFS logs this as ERROR:

 org.apache.hadoop.hdfs.server.datanode.DataNode 	

 thishost:50010:DataXceiver error processing WRITE_BLOCK operation  src: /thishost:50620 dest: /thishost:50010
 java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
	at java.lang.Thread.run(Thread.java:662)

5) by now RegionServer is stopped, usually taking the Thrift server down with it (connection refused).



One final note; we found a bug in our code which was occasionally performing long and unbounded table scans.  After fixing that, the crashes became much less frequent, but they still happen.


Thanks,
Jim


Re: RegionServer crashes silently under heavy RowKey scans

Posted by Stack <st...@duboce.net>.
On Mon, Jan 28, 2013 at 12:14 PM, Jim Abramson <ji...@magnetic.com> wrote:

> Hi,
>
> We are testing HBase for some read-heavy batch operations, and
> encountering frequent, silent RegionServer crashes.



'Silent' is interesting.  Which files did you check?  .log and the .out?
 Nothing in the latter?


> The post-mortem pattern goes like this:
>
>
> 1) RegionServer logs a WARN to the effect of:
>
>  org.apache.hadoop.ipc.HBaseServer
>  (responseTooLarge):
> {"processingtimems":2997,"call":"next(-6754667650407290823, 1000), rpc
> version=1, client version=29,
> methodsFingerPrint=1891768260","client":"somehost:55253","starttimems":1359399938999,"queuetimems":0,"class":"HRegionServer","responsesize":231379184,"method":"next"}
>
>

Seems like we are trying to return > 100MB cell.  See around line #386 in
WritableRpcEngine.


Can you figure which region it is when we get this big cell?  We could look
at the hfiles under it using hfile too.  It reports big cells:
http://hbase.apache.org/book.html#hfile_tool2

St.Ack




> 2) RegionServer logs one or a few of these at DEBUG, in the second or two
> before it goes silent:
>
>  org.apache.hadoop.hdfs.DFSClient
>
>  Error making BlockReader. Closing stale
> Socket[addr=/thishost,port=50010,localport=51345]
>  java.io.EOFException: Premature EOF: no length prefix available
>         at
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
>






>         at
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:407)
>         at
> org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
>         at
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
>         at java.io.DataInputStream.read(DataInputStream.java:132)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:454)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1062)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1318)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:266)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:209)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:519)
>         at
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:534)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:178)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:111)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:83)
>         at
> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1779)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:2915)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1453)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1445)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1421)
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2079)
>         at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1345)
>
>
> 3) HDFS logs this at INFO:
>
>  org.apache.hadoop.hdfs.server.datanode.DataNode
>  Exception in receiveBlock for
> BP-1555962259-thisIP-1358284735183:blk_-5783891561030721738_125644
>  java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>         at java.lang.Thread.run(Thread.java:662)
>
> 4) HDFS logs this as ERROR:
>
>  org.apache.hadoop.hdfs.server.datanode.DataNode
>
>  thishost:50010:DataXceiver error processing WRITE_BLOCK operation  src:
> /thishost:50620 dest: /thishost:50010
>  java.io.IOException: Premature EOF from inputStream
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
>         at java.lang.Thread.run(Thread.java:662)
>
> 5) by now RegionServer is stopped, usually taking the Thrift server down
> with it (connection refused).
>
>
>
> One final note; we found a bug in our code which was occasionally
> performing long and unbounded table scans.  After fixing that, the crashes
> became much less frequent, but they still happen.
>
>
> Thanks,
> Jim
>
>