You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Milind Bhandarkar (JIRA)" <ji...@apache.org> on 2008/03/06 19:26:58 UTC

[jira] Created: (HADOOP-2954) In streaming, map-output cannot have empty keys

In streaming, map-output cannot have empty keys
-----------------------------------------------

                 Key: HADOOP-2954
                 URL: https://issues.apache.org/jira/browse/HADOOP-2954
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.16.0
         Environment: All
            Reporter: Milind Bhandarkar
            Assignee: Sameer Paranjpye
             Fix For: 0.17.0


Here is the analysis, when the mapper and reducer both are /bin/cat,

default key field separator: '\t' (or tab)


for ex, if the input line is:

\tSDSDFIKSDFSDFJS

the input for the mapper ('cat' in this case) is:

\tSDSDFIKSDFSDFJS

-

the output of the mapper is split into a key, value pair as below:

(key, value) -> (\tSDSDFIKSDFSDFJS, "")
(i.e. the value is empty)

the function which splits the output into key,value pair for
streaming jobs, ignores the first character of the line

-

from the above (key, value) pair, the input for the reducer is:
(key followed by separator followed by value)

\tSDSDFIKSDFSDFJS\t

if the reducer is set to NONE, the above line is the output of
the map task

-

the output of the reducer ('cat' in this case) is:

\tSDSDFIKSDFSDFJS\t

-

if the line starts with the field separator, it is possible that
the output of the mapper can be assigned to different reducers because
it is possible that the line contains more than once instance of the
field separator - for ex:

input-line=\tABCDEFGH
key=\tABCDEFGH
value=
(value is empty)
output-line=\tABCDEFGH\t

line=\tABCDEFGHYH\tJHUHJH
key=\tABCDEFGHYH
value=JHUHJH
output-line=\tABCDEFGHYH\tJHUHJH

assuming defaults (HashPartitioner), they are likely to be assigned to
different reducers because the keys are different.

The streaming contract  says that from beginning of the line upto the first tab is the key, so key should be empty string. But it is not.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-2954) In streaming, map-output cannot have empty keys

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu resolved HADOOP-2954.
---------------------------------------------

    Resolution: Duplicate

Fixed by HADOOP-3040

> In streaming, map-output cannot have empty keys
> -----------------------------------------------
>
>                 Key: HADOOP-2954
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2954
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.16.0
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Sameer Paranjpye
>
> Here is the analysis, when the mapper and reducer both are /bin/cat,
> default key field separator: '\t' (or tab)
> for ex, if the input line is:
> \tSDSDFIKSDFSDFJS
> the input for the mapper ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS
> -
> the output of the mapper is split into a key, value pair as below:
> (key, value) -> (\tSDSDFIKSDFSDFJS, "")
> (i.e. the value is empty)
> the function which splits the output into key,value pair for
> streaming jobs, ignores the first character of the line
> -
> from the above (key, value) pair, the input for the reducer is:
> (key followed by separator followed by value)
> \tSDSDFIKSDFSDFJS\t
> if the reducer is set to NONE, the above line is the output of
> the map task
> -
> the output of the reducer ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS\t
> -
> if the line starts with the field separator, it is possible that
> the output of the mapper can be assigned to different reducers because
> it is possible that the line contains more than once instance of the
> field separator - for ex:
> input-line=\tABCDEFGH
> key=\tABCDEFGH
> value=
> (value is empty)
> output-line=\tABCDEFGH\t
> line=\tABCDEFGHYH\tJHUHJH
> key=\tABCDEFGHYH
> value=JHUHJH
> output-line=\tABCDEFGHYH\tJHUHJH
> assuming defaults (HashPartitioner), they are likely to be assigned to
> different reducers because the keys are different.
> The streaming contract  says that from beginning of the line upto the first tab is the key, so key should be empty string. But it is not.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2954) In streaming, map-output cannot have empty keys

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-2954:
------------------------------------

    Fix Version/s:     (was: 0.17.0)

> In streaming, map-output cannot have empty keys
> -----------------------------------------------
>
>                 Key: HADOOP-2954
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2954
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.16.0
>         Environment: All
>            Reporter: Milind Bhandarkar
>            Assignee: Sameer Paranjpye
>
> Here is the analysis, when the mapper and reducer both are /bin/cat,
> default key field separator: '\t' (or tab)
> for ex, if the input line is:
> \tSDSDFIKSDFSDFJS
> the input for the mapper ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS
> -
> the output of the mapper is split into a key, value pair as below:
> (key, value) -> (\tSDSDFIKSDFSDFJS, "")
> (i.e. the value is empty)
> the function which splits the output into key,value pair for
> streaming jobs, ignores the first character of the line
> -
> from the above (key, value) pair, the input for the reducer is:
> (key followed by separator followed by value)
> \tSDSDFIKSDFSDFJS\t
> if the reducer is set to NONE, the above line is the output of
> the map task
> -
> the output of the reducer ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS\t
> -
> if the line starts with the field separator, it is possible that
> the output of the mapper can be assigned to different reducers because
> it is possible that the line contains more than once instance of the
> field separator - for ex:
> input-line=\tABCDEFGH
> key=\tABCDEFGH
> value=
> (value is empty)
> output-line=\tABCDEFGH\t
> line=\tABCDEFGHYH\tJHUHJH
> key=\tABCDEFGHYH
> value=JHUHJH
> output-line=\tABCDEFGHYH\tJHUHJH
> assuming defaults (HashPartitioner), they are likely to be assigned to
> different reducers because the keys are different.
> The streaming contract  says that from beginning of the line upto the first tab is the key, so key should be empty string. But it is not.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.