You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sean Busbey (JIRA)" <ji...@apache.org> on 2016/06/07 20:53:21 UTC
[jira] [Commented] (HBASE-15983) Replication improperly discards data from end-of-wal in some cases.

    [ https://issues.apache.org/jira/browse/HBASE-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319379#comment-15319379 ] 

Sean Busbey commented on HBASE-15983:
-------------------------------------

The proximal failure that brought this to my attention is an error in handling offsets (but I don't know exactly what the root cause is yet). Here's a summary:

During our attempts to tail an in-progress WAL, at some point we mishandle some underlying error condition and get to a (saved) offset that is not a valid beginning of a message. The Reader properly gets and propogates a InvalidProtobufException and the ReplicationSource effectively treats this as "something happened at the end of the file, rewind." The problem is that the saved offset is bad, so rewinding just puts us back at the same location. We loop indefinitely so long as the WAL is the active one, then once it rolls we treat this failure as an end of file and dump the remainder of the file. In the particular deployment where this happened the result was 40-60% row loss.

I don't have a root cause yet, but I have a general work around that doesn't violate our current promises for replication (though it does make them more pronounced and more likely to be noticed). I plan to handle this in three subtasks:

# the workaround to ensure that in the case of a cleanly closed WAL file we are parsing all of the bytes that we expect to be present
# a docs update that makes ours promises around replication more precise (namely that we are at-least-once delivery, with no order guarantees)
# solving the proximal error on parsing while tailing the end of the active wal

> Replication improperly discards data from end-of-wal in some cases.
> -------------------------------------------------------------------
>
>                 Key: HBASE-15983
>                 URL: https://issues.apache.org/jira/browse/HBASE-15983
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.98.0, 1.0.0, 1.1.0, 1.2.0
>            Reporter: Sean Busbey
>            Assignee: Sean Busbey
>            Priority: Critical
>             Fix For: 2.0.0, 1.3.0, 1.0.4, 1.4.0, 1.2.2, 0.98.20, 1.1.6
>
>
> In some particular deployments, the Replication code believes it has
> reached EOF for a WAL prior to successfully parsing all bytes known to
> exist in a cleanly closed file.
> The underlying issue is that several different underlying problems with a WAL reader are all treated as end-of-file by the code in ReplicationSource that decides if a given WAL is completed or needs to be retried.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)