You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2014/11/06 01:27:34 UTC

[jira] [Comment Edited] (HBASE-12419) "Partial cell read caused by EOF" ERRORs on replication source during replication

    [ https://issues.apache.org/jira/browse/HBASE-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199447#comment-14199447 ] 

Andrew Purtell edited comment on HBASE-12419 at 11/6/14 12:27 AM:
------------------------------------------------------------------

I built an ingest test (attached) to catch this under debugger. It's a relatively rare occurrence. 

{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] (Suspended (breakpoint at line 66 in BaseDecoder))	
	KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) line: 66	
	KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
          in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
            .in = org.apache.hadoop.hdfs.DFSInputStream
              .blockEnd = 17967301
              .lastBlockBeingWrittenLength = 17967302
              .lastLocatedBlock.b.block.numBytes = 17967302
              .locatedBlocks.blocks.size = 1
              .locatedBlocks.isLastBlockComplete = false
              .pos = 17967302              
	WALEdit.readFromCells(Codec$Decoder, int) line: 248
          cellDecoder = org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder	
          expectedCount = 1
	ProtobufLogReader.readNext(HLog$Entry) line: 317
          expectedCount = 1
          posBefore = 17967298
	ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106	
	ProtobufLogReader(ReaderBase).next() line: 91	
	ReplicationHLogReaderManager.readNextAndSetPosition() line: 86	
	ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, List<Entry>) line: 441	
	ReplicationSource.run() line: 328	
{noformat}

This is a short read from an incomplete file. We then seek back to the last good known location at ProtobufLineReader.readNext:346

"Encountered a malformed edit, seeking back to last good position in file"

I conclude the reread after seeking back is ultimately successful because I am replicating rows with only one cell per row and row counts are correct at the end of testing.

Is there more to see here?

Should we downgrade the message at BaseDecoder.rethrowEofException:66 from ERROR to TRACE level?



was (Author: apurtell):
I built an ingest test (attached) to catch this under debugger. It's a relatively rare occurrence. 

{noformat}
Daemon Thread [RS:0;apurtell-ltm1:55971-EventThread.replicationSource,1] (Suspended (breakpoint at line 66 in BaseDecoder))	
	KeyValueCodec$KeyValueDecoder(BaseDecoder).rethrowEofException(IOException) line: 66	
	KeyValueCodec$KeyValueDecoder(BaseDecoder).advance() line: 53
          in = org.apache.hadoop.hdfs.client.HdfsDataInputStream
            .in = org.apache.hadoop.hdfs.DFSInputStream
              .blockEnd = 17967301
              .lastBlockBeingWrittenLength = 17967302
              .lastLocatedBlock.b.block.numBytes = 17967302
              .locatedBlocks.blocks.size = 1
              .locatedBlocks.isLastBlockComplete = false
              .pos = 17967302              
	WALEdit.readFromCells(Codec$Decoder, int) line: 248
          cellDecoder = org.apache.hadoop.hbase.codec.KeyValueCodec$KeyValueDecoder	
          expectedCount = 1
	ProtobufLogReader.readNext(HLog$Entry) line: 317
          expectedCount = 1
          posBefore = 17967298
	ProtobufLogReader(ReaderBase).next(HLog$Entry) line: 106	
	ProtobufLogReader(ReaderBase).next() line: 91	
	ReplicationHLogReaderManager.readNextAndSetPosition() line: 86	
	ReplicationSource.readAllEntriesToReplicateOrNextFile(boolean, List<Entry>) line: 441	
	ReplicationSource.run() line: 328	
{noformat}

This is a short read from an incomplete last block. We then seek back to the last good known location at ProtobufLineReader.readNext:346

"Encountered a malformed edit, seeking back to last good position in file"

I conclude the reread after seeking back is ultimately successful because I am replicating rows with only one cell per row and row counts are correct at the end of testing.

Is there more to see here?

Should we downgrade the message at BaseDecoder.rethrowEofException:66 from ERROR to TRACE level?


> "Partial cell read caused by EOF" ERRORs on replication source during replication
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-12419
>                 URL: https://issues.apache.org/jira/browse/HBASE-12419
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.7
>            Reporter: Andrew Purtell
>             Fix For: 2.0.0, 0.98.8, 0.99.2
>
>         Attachments: TestReplicationIngest.patch
>
>
> We are seeing exceptions like these on the replication sources when replication is active:
> {noformat}
> 2014-11-04 01:20:19,738 ERROR [regionserver8120-EventThread.replicationSource,1] codec.BaseDecoder:
> Partial cell read caused by EOF: java.io.IOException: Premature EOF from inputStream
> {noformat}
> HBase 0.98.8-SNAPSHOT, Hadoop 2.4.1.
> Happens both with and without short circuit reads on the source cluster.
> I'm able to reproduce this reliably:
> # Set up two clusters. Can be single slave.
> # Enable replication in configuration
> # Use LoadTestTool -init_only on both clusters
> # On source cluster via shell: alter 'cluster_test',{NAME=>'test_cf',REPLICATION_SCOPE=>1}
> # On source cluster via shell: add_peer 'remote:port:/hbase'
> # On source cluster, LoadTestTool -skip_init -write 1:1024:10 -num_keys 1000000
> # Wait for LoadTestTool to complete
> # Use the shell to verify 1M rows are in 'cluster_test' on the target cluster.
> All 1M rows will replicate without data loss, but I'll see 5-15 instances of "Partial cell read caused by EOF" messages logged from codec.BaseDecoder at ERROR level on the replication source. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)