You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Joe Ellis (JIRA)" <ji...@apache.org> on 2016/04/26 23:13:12 UTC

[jira] [Created] (HADOOP-13064) LineReader reports incorrect number of bytes read resulting in correctness issues using LineRecordReader

Joe Ellis created HADOOP-13064:
----------------------------------

             Summary: LineReader reports incorrect number of bytes read resulting in correctness issues using LineRecordReader
                 Key: HADOOP-13064
                 URL: https://issues.apache.org/jira/browse/HADOOP-13064
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Joe Ellis
            Priority: Critical


The specific issue we were seeing with LineReader is that when we pass in '\r\n' as the line delimiter the number of bytes that it claims to have read is less than what it actually read. We narrowed this down to only happening when the delimiter is split across the internal buffer boundary, so if fillbuffer fills with "row\r" and the next call fills with "\n" then the number of bytes reported would be 4 rather than 5.

This results in correctness issues in LineRecordReader because if this off by one issue is seen enough times when reading a split then it will continue to read records past its split boundary, resulting in records appearing to come from multiple splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)