You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2011/02/01 23:19:29 UTC

[jira] Commented: (HADOOP-7096) Allow setting of end-of-record delimiter for TextInputFormat

    [ https://issues.apache.org/jira/browse/HADOOP-7096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989411#comment-12989411 ] 

Todd Lipcon commented on HADOOP-7096:
-------------------------------------

A few style nits:
- after recordDelimiterBytes, need a blank line before the javadoc starts
- the comment on this line:
{code}
    } else { //recordDelimiterBytes != null, use default delimite
{code}
seems like it's backwards - don't you mean "== null" in the else clause? Also typo "delimite"
- Worth considering splitting readLine into two methods, one for the true case, one for the false case
- Can you make recordDelimiterBytes final?
- In the MAX_VALUE case, for custom delimiter, the IOE should say "Too many bytes before delimiter", rather than "before newline"


Although there aren't currently any unit tests to test LineReader in Common, it would be really great if you had time to add a couple. Otherwise we can probably consider this as being tested by the new unit inputformat tests in MAPREDUCE-2254.

> Allow setting of end-of-record delimiter for TextInputFormat
> ------------------------------------------------------------
>
>                 Key: HADOOP-7096
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7096
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Ahmed Radwan
>         Attachments: HADOOP-7096.patch, HADOOP-7096_r2.patch
>
>
> The patch for https://issues.apache.org/jira/browse/MAPREDUCE-2254 required minor changes to the LineReader class to allow extensions (see attached 2.patch). Description copied below:
> It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other tools using this TextInputFormat (See for example: https://issues.apache.org/jira/browse/PIG-836 and https://issues.cloudera.org/browse/SQOOP-136).
> I have wrote a patch to address this issue. This patch allows users to specify any custom end-of-record delimiter using a new added configuration property. For backward compatibility, if this new configuration property is absent, then the same exact previous delimiters are used (i.e., '\n', '\r' or '\r\n').

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira