You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2008/05/06 07:13:55 UTC

[jira] Commented: (HADOOP-3341) make key-value separators in hadoop streaming fully configurable

    [ https://issues.apache.org/jira/browse/HADOOP-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594469#action_12594469 ] 

Hadoop QA commented on HADOOP-3341:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12381472/3341-1.patch
  against trunk revision 653638.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2404/console

This message is automatically generated.

> make key-value separators in hadoop streaming fully configurable
> ----------------------------------------------------------------
>
>                 Key: HADOOP-3341
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3341
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Zheng Shao
>         Attachments: 3341-1.patch
>
>
> By default, hadoop streaming uses TAB as the separator in all places.  However in some environments, user may want to use customized separators (e.g, ^A = \u0001).
> The separator logic in hadoop streaming is very convoluted. Here is a brief summary:
> InputFormat {
>     KeyValueLineRecordReader.java:59:
> S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
> }
> Mapper {
>     PipeMapper.java:88: 
> S2: clientOut_.write('\t');
>     Call mapper process
>     PipeMapRed.java:124:
> S3: String mapOutputFieldSeparator = job_.get("stream.map.output.field.separator", "\t");
>     PipeMapRed.java:128:
>     this.numOfMapOutputKeyFields = job_.getInt("stream.num.map.output.key.fields", 1);
> }
> Reducer {
>     PipeReducer.java:78:
> S4: clientOut_.write('\t');
>     Call reducer process
>     PipeMapRed.java:125:
> S5: String reduceOutputFieldSeparator = job_.get("stream.reduce.output.field.separator", "\t");
>     PipeMapRed.java:129:
>     this.numOfReduceOutputKeyFields = job_.getInt("stream.num.reduce.output.key.fields", 1);
> }
> OutputFormat {
>     TextOuputFormat.java:112:
> S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", "\t");
> }
> Short-cuts: 
> 1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are directly feed into the mapper (through the value part of the key-value pair - keys, which are offsets, are directly ignored).
> 2. For jobs with no reducers, The "Reducer" step is skipped.
> We need to make S3 and S4 configurable, possibly under the following names for conformity:
> stream.map.input.field.separator
> stream.reduce.input.field.separator
> Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf stream.map.input.field.separator=^A -jobconf stream.map.output.field.separator=^A -jobconf stream.reducer.input.field.separator=^A -jobconf stream.reducer.output.field.separator=^A -jobconf mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB in every place!
> Maybe hadoop streaming can also provide a single option to override these 6 options.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.