You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Dmitry Sivachenko (JIRA)" <ji...@apache.org> on 2014/09/11 17:36:34 UTC

[jira] [Created] (MAPREDUCE-6085) Facilitate processing of text files without key/value split

Dmitry Sivachenko created MAPREDUCE-6085:
--------------------------------------------

             Summary: Facilitate processing of text files without key/value split
                 Key: MAPREDUCE-6085
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
    Affects Versions: 2.4.1
            Reporter: Dmitry Sivachenko


There is a rather popular type of task: processing of text files line by line without splitting line to key/value pair in streaming mode.  (UNIX commands like grep, awk, etc, any filter scripts).

By default, Hadoop streaming interface uses TextInputFormat which suites well for this task: it passes the input line itself to streaming job stdin.

TextOutputReader class, which receives streaming job's output, splits it for key and value pair, and TextOutputFormat tries to merge this pair with separator.
This results in extra separator appearing in the output in some cases.

KeyOnlyTextOutputReader solves this problem: it passes the whole line as a key with null value, and TextOutputFormat correctly writes it without any separators inserted.

I propose to add another IdentifierResolver: "keyonlytextoutput", which uses standard TextInputWriter but replaces TextOutputReader with KeyOnlyTextOutputReader).

As a result, lines of text are never split into key/value pair and never joined back, so lines appear in the output unmodified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)