You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/09/19 04:46:00 UTC

[jira] [Commented] (HADOOP-18400) Fix file split duplicating records from a succeeding split when reading BZip2 text files

    [ https://issues.apache.org/jira/browse/HADOOP-18400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17606417#comment-17606417 ] 

ASF GitHub Bot commented on HADOOP-18400:
-----------------------------------------

aajisaka merged PR #4732:
URL: https://github.com/apache/hadoop/pull/4732




>  Fix file split duplicating records from a succeeding split when reading BZip2 text files 
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-18400
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18400
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 3.3.3, 3.3.4
>            Reporter: Ashutosh Gupta
>            Assignee: Ashutosh Gupta
>            Priority: Critical
>              Labels: pull-request-available
>
> Fix data correctness issue with TextInputFormat that can occur when reading BZip2 compressed text files. When a file split's range does not include the start position of a BZip2 block, then it is expected to contain no records (i.e. the split is empty). However, if it so happens that the end of this split (exclusive) is at the start of a BZip2 block, then LineRecordReader ends up returning all the records for that BZip2 block. This ends up duplicating records read by a job because the next split would also end up returning all the records for the same block (since its range would include the start of that block).
> This bug does not get triggered when the file split's range does include the start of at least one block and ends just before the start of another block. The reason for this has to do with when BZip2CompressionInputStream updates its position when using the BYBLOCK READMODE. Using this read mode, the stream's position while reading only gets updated when reading the first byte past an end of a block marker. The bug is that if the stream, when initialized, was adjusted to be at the end of one block, then we don't update the position after we read the first byte of the next block. Rather, we keep the position to be equal to the next block marker we've initialized to. If the exclusive end position of the split is equal to stream's position, LineRecordReader will continue to read lines until the position is updated (an an additional record in the next block is read if needed).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org