You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Johan Oskarsson (JIRA)" <ji...@apache.org> on 2009/05/19 18:32:45 UTC

[jira] Resolved: (HADOOP-4913) When using the Hadoop streaming jar if the reduce job outputs only a value (no key) the code incorrectly outputs the value along with the tab character (key/value) separator.

     [ https://issues.apache.org/jira/browse/HADOOP-4913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson resolved HADOOP-4913.
-------------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: site)

You can do this in user code by implementing an output format that ignores the key and only saves the value. Have a look at TextOutputFormat for guidance.

> When using the Hadoop streaming jar if the reduce job outputs only a value (no key) the code incorrectly outputs the value along with the tab character (key/value) separator.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4913
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4913
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.18.2
>         Environment: Red Hat Linux 5.
>            Reporter: John Fisher
>            Priority: Minor
>
> I would like the output of my streaming job to only be the value, omitting the key and key/value separator.  However, when only printing the value I am noticing that each line is ending with a tab character.  I believe I have tracked down the issue (described below) but I'm not 100% sure.  The fix is working for me though so I figured maybe it should be incorporated into the code base.
> The tab gets printed out because of a bad check in the TextOutputFormat code.  It checks if the "key" and "value" objects are null.  If they are both not null, then that means that the line should be printed as <key><separator><value>, otherwise it should only print the key or value, depending on what is defined.  The bug is that the key and value are always defined.  I traced up further to see if the error was that these objects were defined when they shouldn't be, but it looks like that's how it should work.  I changed the Hadoop code to look for a null object and also an empty string length.
> *** Patch code begin ***
> if( ! nullKey ) {
>   nullKey = ( key.toString().length() == 0 );
> }
> if( ! nullValue ) {
>   nullValue = ( value.toString().length() == 0 );
> }
> *** Patch code end ***
> The OutputCollector calls the TextOutputFormat,write() method with whatever objects are passed into it (see ReduceTask.java, line 300) so that is fine.
> But above that if you look at PipeMapRed.java, in the run() method you will see that the code creates a new key and value object and then starts reading lines and feeding them to the OutputCollector.  This is why the key and value are always defined by the time they hit the TextOutputFormat,write() and why we always see the tab.
> Thanks,
> John

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.