You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Michele Giusto (JIRA)" <ji...@apache.org> on 2014/01/13 16:11:13 UTC

[jira] [Commented] (MAPREDUCE-2254) Allow setting of end-of-record delimiter for TextInputFormat

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869600#comment-13869600 ] 

Michele Giusto commented on MAPREDUCE-2254:
-------------------------------------------

Hi everybody, I believe there is a bug when the custom record delimiter is longer than 1 character. For example using as delimiter "#$&" and having the line "...record1#$&record2#$&record3#$&..." divided between 2 consecutive input splits, with the second input split beginning after the first "$" (so it starts with "&record2#$&record3#$&..."), "record2" will not not be read.
This is due to the fact that the mapper that processes the second split starts reading from the last character of the first input split (so the "$"), then it looses the delimiter between "record1" and "record2". In this way the constructor of the mapper tries to skip the last line of the previous input split but it instead skips the first line of its one and reports "record3" as the first line. 
If you agree that this is a bug, a possible solution may be to modify the LineRecordReader class to start reading each input split (except the first one) not from the last character of the previous input split but going back a number of characters equals to the number of characters of the record delimiter (3 in my example). 

> Allow setting of end-of-record delimiter for TextInputFormat
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-2254
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2254
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2245.patch, MAPREDUCE-2254_r2.patch, MAPREDUCE-2254_r3.patch
>
>
> It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).
> I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)