You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Dustin Cote (JIRA)" <ji...@apache.org> on 2015/11/14 22:39:11 UTC

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dustin Cote updated MAPREDUCE-6549:
-----------------------------------
    Attachment: MAPREDUCE-6549-1.patch

Attaching a patch to basically remove the attempt to read the last incomplete record of an input and change the tests to test a more generic, imperfect scenario.  I'll add some more tests if review deems it necessary.  As far as I am aware, we should drop an incomplete record at the end of the input, which now this happens with this patch in addition to the correct number of records coming up in the middle of the input (where previously there were duplicates).

> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6549
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>            Reporter: Dustin Cote
>            Assignee: Dustin Cote
>         Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)