You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-issues@hadoop.apache.org by "Dmitry Sivachenko (JIRA)" <ji...@apache.org> on 2014/09/11 17:36:34 UTC

[jira] [Created] (MAPREDUCE-6085) Facilitate processing of text files without key/value split

Dmitry Sivachenko created MAPREDUCE-6085:
--------------------------------------------

Summary: Facilitate processing of text files without key/value split
Key: MAPREDUCE-6085
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6085
Project: Hadoop Map/Reduce
Issue Type: Improvement
Affects Versions: 2.4.1
Reporter: Dmitry Sivachenko

There is a rather popular type of task: processing of text files line by line without splitting line to key/value pair in streaming mode. (UNIX commands like grep, awk, etc, any filter scripts).

By default, Hadoop streaming interface uses TextInputFormat which suites well for this task: it passes the input line itself to streaming job stdin.

TextOutputReader class, which receives streaming job's output, splits it for key and value pair, and TextOutputFormat tries to merge this pair with separator.
This results in extra separator appearing in the output in some cases.

KeyOnlyTextOutputReader solves this problem: it passes the whole line as a key with null value, and TextOutputFormat correctly writes it without any separators inserted.

I propose to add another IdentifierResolver: "keyonlytextoutput", which uses standard TextInputWriter but replaces TextOutputReader with KeyOnlyTextOutputReader).

As a result, lines of text are never split into key/value pair and never joined back, so lines appear in the output unmodified.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)