You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Milind Bhandarkar (JIRA)" <ji...@apache.org> on 2008/03/06 19:26:58 UTC
[jira] Created: (HADOOP-2954) In streaming, map-output cannot have
empty keys
In streaming, map-output cannot have empty keys
-----------------------------------------------
Key: HADOOP-2954
URL: https://issues.apache.org/jira/browse/HADOOP-2954
Project: Hadoop Core
Issue Type: Bug
Components: contrib/streaming
Affects Versions: 0.16.0
Environment: All
Reporter: Milind Bhandarkar
Assignee: Sameer Paranjpye
Fix For: 0.17.0
Here is the analysis, when the mapper and reducer both are /bin/cat,
default key field separator: '\t' (or tab)
for ex, if the input line is:
\tSDSDFIKSDFSDFJS
the input for the mapper ('cat' in this case) is:
\tSDSDFIKSDFSDFJS
-
the output of the mapper is split into a key, value pair as below:
(key, value) -> (\tSDSDFIKSDFSDFJS, "")
(i.e. the value is empty)
the function which splits the output into key,value pair for
streaming jobs, ignores the first character of the line
-
from the above (key, value) pair, the input for the reducer is:
(key followed by separator followed by value)
\tSDSDFIKSDFSDFJS\t
if the reducer is set to NONE, the above line is the output of
the map task
-
the output of the reducer ('cat' in this case) is:
\tSDSDFIKSDFSDFJS\t
-
if the line starts with the field separator, it is possible that
the output of the mapper can be assigned to different reducers because
it is possible that the line contains more than once instance of the
field separator - for ex:
input-line=\tABCDEFGH
key=\tABCDEFGH
value=
(value is empty)
output-line=\tABCDEFGH\t
line=\tABCDEFGHYH\tJHUHJH
key=\tABCDEFGHYH
value=JHUHJH
output-line=\tABCDEFGHYH\tJHUHJH
assuming defaults (HashPartitioner), they are likely to be assigned to
different reducers because the keys are different.
The streaming contract says that from beginning of the line upto the first tab is the key, so key should be empty string. But it is not.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HADOOP-2954) In streaming, map-output cannot have
empty keys
Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amareshwari Sriramadasu resolved HADOOP-2954.
---------------------------------------------
Resolution: Duplicate
Fixed by HADOOP-3040
> In streaming, map-output cannot have empty keys
> -----------------------------------------------
>
> Key: HADOOP-2954
> URL: https://issues.apache.org/jira/browse/HADOOP-2954
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/streaming
> Affects Versions: 0.16.0
> Environment: All
> Reporter: Milind Bhandarkar
> Assignee: Sameer Paranjpye
>
> Here is the analysis, when the mapper and reducer both are /bin/cat,
> default key field separator: '\t' (or tab)
> for ex, if the input line is:
> \tSDSDFIKSDFSDFJS
> the input for the mapper ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS
> -
> the output of the mapper is split into a key, value pair as below:
> (key, value) -> (\tSDSDFIKSDFSDFJS, "")
> (i.e. the value is empty)
> the function which splits the output into key,value pair for
> streaming jobs, ignores the first character of the line
> -
> from the above (key, value) pair, the input for the reducer is:
> (key followed by separator followed by value)
> \tSDSDFIKSDFSDFJS\t
> if the reducer is set to NONE, the above line is the output of
> the map task
> -
> the output of the reducer ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS\t
> -
> if the line starts with the field separator, it is possible that
> the output of the mapper can be assigned to different reducers because
> it is possible that the line contains more than once instance of the
> field separator - for ex:
> input-line=\tABCDEFGH
> key=\tABCDEFGH
> value=
> (value is empty)
> output-line=\tABCDEFGH\t
> line=\tABCDEFGHYH\tJHUHJH
> key=\tABCDEFGHYH
> value=JHUHJH
> output-line=\tABCDEFGHYH\tJHUHJH
> assuming defaults (HashPartitioner), they are likely to be assigned to
> different reducers because the keys are different.
> The streaming contract says that from beginning of the line upto the first tab is the key, so key should be empty string. But it is not.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-2954) In streaming, map-output cannot have
empty keys
Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HADOOP-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Chansler updated HADOOP-2954:
------------------------------------
Fix Version/s: (was: 0.17.0)
> In streaming, map-output cannot have empty keys
> -----------------------------------------------
>
> Key: HADOOP-2954
> URL: https://issues.apache.org/jira/browse/HADOOP-2954
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/streaming
> Affects Versions: 0.16.0
> Environment: All
> Reporter: Milind Bhandarkar
> Assignee: Sameer Paranjpye
>
> Here is the analysis, when the mapper and reducer both are /bin/cat,
> default key field separator: '\t' (or tab)
> for ex, if the input line is:
> \tSDSDFIKSDFSDFJS
> the input for the mapper ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS
> -
> the output of the mapper is split into a key, value pair as below:
> (key, value) -> (\tSDSDFIKSDFSDFJS, "")
> (i.e. the value is empty)
> the function which splits the output into key,value pair for
> streaming jobs, ignores the first character of the line
> -
> from the above (key, value) pair, the input for the reducer is:
> (key followed by separator followed by value)
> \tSDSDFIKSDFSDFJS\t
> if the reducer is set to NONE, the above line is the output of
> the map task
> -
> the output of the reducer ('cat' in this case) is:
> \tSDSDFIKSDFSDFJS\t
> -
> if the line starts with the field separator, it is possible that
> the output of the mapper can be assigned to different reducers because
> it is possible that the line contains more than once instance of the
> field separator - for ex:
> input-line=\tABCDEFGH
> key=\tABCDEFGH
> value=
> (value is empty)
> output-line=\tABCDEFGH\t
> line=\tABCDEFGHYH\tJHUHJH
> key=\tABCDEFGHYH
> value=JHUHJH
> output-line=\tABCDEFGHYH\tJHUHJH
> assuming defaults (HashPartitioner), they are likely to be assigned to
> different reducers because the keys are different.
> The streaming contract says that from beginning of the line upto the first tab is the key, so key should be empty string. But it is not.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.