You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Abdul Qadeer (JIRA)" <ji...@apache.org> on 2008/09/16 07:37:44 UTC

[jira] Created: (HADOOP-4182) Streaming Documentation Update

Streaming Documentation Update
------------------------------

                 Key: HADOOP-4182
                 URL: https://issues.apache.org/jira/browse/HADOOP-4182
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/streaming
    Affects Versions: 0.19.0
            Reporter: Abdul Qadeer
            Priority: Minor
             Fix For: 0.19.0


When Text input data is used with streaming, every line is expected to end with a newline.  Hadoop results are undefined if input files do not end in a newline.  (The results will depend on how files are assigned to mappers.)

Example:

In streaming if

mapper = xargs cat
reducer = cat

and the input is a two line, where each line is symbolic link in HDFS

link1\n
link2\n
EOF

link1 points to a file which contains

This is line1EOF

link2 points to a file which  contains

This is line2EOF

Now running a streaming job such that, there is only one split, will produce results:

This is line1This is line2\t\n

But if there were two splits, the result will be

This is line1\t\n
This is line2\t\n

So in summary, the output depends on the factor that how many mappers were invoked.  As a caution, it should be recorded in Streaming wiki that users always put a new line at the end of each line to get away with such problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4182) Streaming Documentation Update

Posted by "Abdul Qadeer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633973#action_12633973 ] 

Abdul Qadeer commented on HADOOP-4182:
--------------------------------------

I agree with you that it is a problem at the application / user level.  I
only wanted to put a simple comment somewhere on the Hadoop Wiki that says
that a line must end with an end of line delimiter.  If not, user might get
different behaviors as I explained earlier.  This simple comment can keep a
user from accidental un-expected results.


> Streaming Documentation Update
> ------------------------------
>
>                 Key: HADOOP-4182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4182
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>    Affects Versions: 0.19.0
>            Reporter: Abdul Qadeer
>            Priority: Minor
>             Fix For: 0.19.0
>
>
> When Text input data is used with streaming, every line is expected to end with a newline.  Hadoop results are undefined if input files do not end in a newline.  (The results will depend on how files are assigned to mappers.)
> Example:
> In streaming if
> mapper = xargs cat
> reducer = cat
> and the input is a two line, where each line is symbolic link in HDFS
> link1\n
> link2\n
> EOF
> link1 points to a file which contains
> This is line1EOF
> link2 points to a file which  contains
> This is line2EOF
> Now running a streaming job such that, there is only one split, will produce results:
> This is line1This is line2\t\n
> But if there were two splits, the result will be
> This is line1\t\n
> This is line2\t\n
> So in summary, the output depends on the factor that how many mappers were invoked.  As a caution, it should be recorded in Streaming wiki that users always put a new line at the end of each line to get away with such problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-4182) Streaming Documentation Update

Posted by "Abdul Qadeer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633982#action_12633982 ] 

Abdul Qadeer commented on HADOOP-4182:
--------------------------------------

I updated the wiki documentation of the page http://wiki.apache.org/hadoop/HadoopStreaming?action=diff as follows.

Line 28: 

-Default Map input format: a line is a record in UTF-8
-  the key part ends at first TAB, the rest of the line is the value 


+Default Map input format: a line is a record in UTF-8. Every line must end
+  with an 'end of line' delimiter. The key part ends at first TAB, the rest
+  of the line is the value 

> Streaming Documentation Update
> ------------------------------
>
>                 Key: HADOOP-4182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4182
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>    Affects Versions: 0.19.0
>            Reporter: Abdul Qadeer
>            Priority: Minor
>             Fix For: 0.19.0
>
>
> When Text input data is used with streaming, every line is expected to end with a newline.  Hadoop results are undefined if input files do not end in a newline.  (The results will depend on how files are assigned to mappers.)
> Example:
> In streaming if
> mapper = xargs cat
> reducer = cat
> and the input is a two line, where each line is symbolic link in HDFS
> link1\n
> link2\n
> EOF
> link1 points to a file which contains
> This is line1EOF
> link2 points to a file which  contains
> This is line2EOF
> Now running a streaming job such that, there is only one split, will produce results:
> This is line1This is line2\t\n
> But if there were two splits, the result will be
> This is line1\t\n
> This is line2\t\n
> So in summary, the output depends on the factor that how many mappers were invoked.  As a caution, it should be recorded in Streaming wiki that users always put a new line at the end of each line to get away with such problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HADOOP-4182) Streaming Documentation Update

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley resolved HADOOP-4182.
-----------------------------------

    Resolution: Won't Fix

There isn't much that streaming can do. In the first case, your application gives the streaming framework:

line1line2EOF

In the second case you give it:

line1EOF in one map

and 

line2EOF in the second map

streaming needs line based data, so the entire input is treated as a line. Leading to the differences that you observed.

> Streaming Documentation Update
> ------------------------------
>
>                 Key: HADOOP-4182
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4182
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>    Affects Versions: 0.19.0
>            Reporter: Abdul Qadeer
>            Priority: Minor
>             Fix For: 0.19.0
>
>
> When Text input data is used with streaming, every line is expected to end with a newline.  Hadoop results are undefined if input files do not end in a newline.  (The results will depend on how files are assigned to mappers.)
> Example:
> In streaming if
> mapper = xargs cat
> reducer = cat
> and the input is a two line, where each line is symbolic link in HDFS
> link1\n
> link2\n
> EOF
> link1 points to a file which contains
> This is line1EOF
> link2 points to a file which  contains
> This is line2EOF
> Now running a streaming job such that, there is only one split, will produce results:
> This is line1This is line2\t\n
> But if there were two splits, the result will be
> This is line1\t\n
> This is line2\t\n
> So in summary, the output depends on the factor that how many mappers were invoked.  As a caution, it should be recorded in Streaming wiki that users always put a new line at the end of each line to get away with such problems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.