You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Duo Zhang (JIRA)" <ji...@apache.org> on 2016/02/11 05:17:18 UTC
[jira] [Commented] (HBASE-15252) Data loss when replaying wal if HDFS timeout

    [ https://issues.apache.org/jira/browse/HBASE-15252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142227#comment-15142227 ] 

Duo Zhang commented on HBASE-15252:
-----------------------------------

Changing the exception type back to IPBE can solve the problem(which cause the openHRegion fail with an IOException) but I want to revisit the readNext method because I'm a little confusing of how we deal with {{EOFException}}.

{code:title=ProtobufLogReader.java}
      } catch (EOFException eof) {
        LOG.trace("Encountered a malformed edit, seeking back to last good position in file", eof);
        // If originalPosition is < 0, it is rubbish and we cannot use it (probably local fs)
        if (originalPosition < 0) throw eof;
        // Else restore our position to original location in hope that next time through we will
        // read successfully.
        seekOnFs(originalPosition);
        return false;
      }
{code}

Here we seek to the last good position, but we call “return false” instead of "continue". This cause the {{next}} method of {{ReaderBase}} returns null and make the upper layer think it has reached the end of file and close the current log reader. So what is purpose of the seek here? And in fact, if the {{EOFException}} really means end of file, I do not think we could read a valid wal entry successfully when retrying...

Thanks. 

> Data loss when replaying wal if HDFS timeout
> --------------------------------------------
>
>                 Key: HBASE-15252
>                 URL: https://issues.apache.org/jira/browse/HBASE-15252
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>         Attachments: HBASE-15252-testcase.patch
>
>
> This is a problem introduced by HBASE-13825 where we change the exception type in catch block in {{readNext}} method of {{ProtobufLogReader}}.
> {code:title=ProtobufLogReader.java}
>       try {
>           ......
>           ProtobufUtil.mergeFrom(builder, new LimitInputStream(this.inputStream, size),
>             (int)size);
>         } catch (IOException ipbe) { // <------ used to be InvalidProtocolBufferException
>           throw (EOFException) new EOFException("Invalid PB, EOF? Ignoring; originalPosition=" +
>             originalPosition + ", currentPosition=" + this.inputStream.getPos() +
>             ", messageSize=" + size + ", currentAvailable=" + available).initCause(ipbe);
>         }
> {code}
> Here if the {{inputStream}} throws an {{IOException}} due to timeout or something, we just convert it to an {{EOFException}} and at the bottom of this method, we ignore {{EOFException}} and return false. This cause the upper layer think we reach the end of file. So when replaying we will treat the HDFS timeout error as a normal end of file and cause data loss.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)