You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2014/09/11 17:42:33 UTC

[jira] [Commented] (MAPREDUCE-6085) Facilitate processing of text files without key/value split

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130160#comment-14130160 ] 

Hadoop QA commented on MAPREDUCE-6085:
--------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12668096/IdentifierResolver.java.patch
  against trunk revision 4be9517.

    {color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4868//console

This message is automatically generated.

> Facilitate processing of text files without key/value split
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-6085
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.4.1
>            Reporter: Dmitry Sivachenko
>         Attachments: IdentifierResolver.java.patch
>
>
> There is a rather popular type of task: processing of text files line by line without splitting line to key/value pair in streaming mode.  (UNIX commands like grep, awk, etc, any filter scripts).
> By default, Hadoop streaming interface uses TextInputFormat which suites well for this task: it passes the input line itself to streaming job stdin.
> TextOutputReader class, which receives streaming job's output, splits it for key and value pair, and TextOutputFormat tries to merge this pair with separator.
> This results in extra separator appearing in the output in some cases.
> KeyOnlyTextOutputReader solves this problem: it passes the whole line as a key with null value, and TextOutputFormat correctly writes it without any separators inserted.
> I propose to add another IdentifierResolver: "keyonlytextoutput", which uses standard TextInputWriter but replaces TextOutputReader with KeyOnlyTextOutputReader).
> As a result, lines of text are never split into key/value pair and never joined back, so lines appear in the output unmodified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)