You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Dan Starr <ds...@gmail.com> on 2010/02/18 04:45:40 UTC

Hadoop Streaming File-not-found error on Cloudera's training VM

Hi, I've tried posting this to Cloudera's community support site, but
the community website getsatisfaction.com returns various server
errors at the moment.  I believe the following is an issue related to
my environment within Cloudera's Training virtual machine.

Despite having success running Hadoop streaming on other Hadoop
clusters and on Cloudera's Training VM in local mode, I'm currently
getting an error when attempting to run a simple Hadoop streaming job
in the normal queue based mode on the Training VM.  I'm thinking the
error described below is an issue related to the worker node not
recognizing the python reference in the script's top shebang line.

The hadoop command I am executing is:

hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
-mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
-input test_input/* -output output

Where the test_input directory contains 3 UNIX formatted, single line files:

training-vm: 3$ hadoop dfs -ls /user/training/test_input/
Found 3 items
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file1
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file2
-rw-r--r--   1 training supergroup         11 2010-02-17 10:48
/user/training/test_input/file3

training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
test_line1
test_line2
test_line3

And where blah.py looks like (UNIX formatted):

#!/usr/bin/python
import sys
for line in sys.stdin:
   print line

The resulting Hadoop-Streaming error is:

java.io.IOException: Cannot run program "blah.py":
java.io.IOException: error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
   ...


I get the same error when placing the python script on the HDFS, and
then using this in the hadoop command:

... -mapper hdfs:///user/training/blah.py ...


One suggestion found online, which may not be relevant to Cloudera's
distribution, mentions that the first line of the hadoop-streaming
python script (the shebang line) may not describe an applicable path
for the system.  The solution mentioned is to use: ... -mapper "python
blah.py " ... in the Hadoop streaming command.  This doesn't seem to
work correctly for me, since I find that the lines from the input data
files are also parsed by the Python interpreter.  But this does reveal
that python is available on the worker node when using this technique.
 I have also tried without success the '-mapper blah.py' technique
using shebang lines: "#!/usr/bin/env python", although on the training
VM Python is installed under /usr/bin/python.

Maybe the issue is something else.  Any suggestions or insights will be helpful.

Re: Hadoop Streaming File-not-found error on Cloudera's training VM

Posted by Dan Starr <ds...@gmail.com>.
Todd, Thanks!
This solved it.

-Dan

On Wed, Feb 17, 2010 at 8:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Hi Dan,
>
> This is actually a bug in the release you're using. Please run:
>
> $ sudo apt-get update
> $ sudo apt-get install hadoop-0.20
>
> Then restart the daemons (or the entire VM) and give it another go.
>
> Thanks
> -Todd
>
> On Wed, Feb 17, 2010 at 7:56 PM, Dan Starr <ds...@gmail.com> wrote:
>> Yes, I have tried that when passing the script.  Just now I tried:
>>
>> hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
>> -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
>> -input test_input/* -output output -file blah.py
>>
>> And got this error for a map task:
>>
>> java.io.IOException: Cannot run program "blah.py":
>> java.io.IOException: error=2, No such file or directory
>>        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>>        at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>>        at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
>>        ...
>>
>> -Dan
>>
>>
>> On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>> Are you passing the python script to the cluster using the -file
>>> option? eg -mapper foo.py -file foo.py
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr <ds...@gmail.com> wrote:
>>>> Hi, I've tried posting this to Cloudera's community support site, but
>>>> the community website getsatisfaction.com returns various server
>>>> errors at the moment.  I believe the following is an issue related to
>>>> my environment within Cloudera's Training virtual machine.
>>>>
>>>> Despite having success running Hadoop streaming on other Hadoop
>>>> clusters and on Cloudera's Training VM in local mode, I'm currently
>>>> getting an error when attempting to run a simple Hadoop streaming job
>>>> in the normal queue based mode on the Training VM.  I'm thinking the
>>>> error described below is an issue related to the worker node not
>>>> recognizing the python reference in the script's top shebang line.
>>>>
>>>> The hadoop command I am executing is:
>>>>
>>>> hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
>>>> -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
>>>> -input test_input/* -output output
>>>>
>>>> Where the test_input directory contains 3 UNIX formatted, single line files:
>>>>
>>>> training-vm: 3$ hadoop dfs -ls /user/training/test_input/
>>>> Found 3 items
>>>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>>>> /user/training/test_input/file1
>>>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>>>> /user/training/test_input/file2
>>>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>>>> /user/training/test_input/file3
>>>>
>>>> training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
>>>> test_line1
>>>> test_line2
>>>> test_line3
>>>>
>>>> And where blah.py looks like (UNIX formatted):
>>>>
>>>> #!/usr/bin/python
>>>> import sys
>>>> for line in sys.stdin:
>>>>    print line
>>>>
>>>> The resulting Hadoop-Streaming error is:
>>>>
>>>> java.io.IOException: Cannot run program "blah.py":
>>>> java.io.IOException: error=2, No such file or directory
>>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>>>> at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>>>>    ...
>>>>
>>>>
>>>> I get the same error when placing the python script on the HDFS, and
>>>> then using this in the hadoop command:
>>>>
>>>> ... -mapper hdfs:///user/training/blah.py ...
>>>>
>>>>
>>>> One suggestion found online, which may not be relevant to Cloudera's
>>>> distribution, mentions that the first line of the hadoop-streaming
>>>> python script (the shebang line) may not describe an applicable path
>>>> for the system.  The solution mentioned is to use: ... -mapper "python
>>>> blah.py " ... in the Hadoop streaming command.  This doesn't seem to
>>>> work correctly for me, since I find that the lines from the input data
>>>> files are also parsed by the Python interpreter.  But this does reveal
>>>> that python is available on the worker node when using this technique.
>>>>  I have also tried without success the '-mapper blah.py' technique
>>>> using shebang lines: "#!/usr/bin/env python", although on the training
>>>> VM Python is installed under /usr/bin/python.
>>>>
>>>> Maybe the issue is something else.  Any suggestions or insights will be helpful.
>>>>
>>>
>>
>

Re: Hadoop Streaming File-not-found error on Cloudera's training VM

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Dan,

This is actually a bug in the release you're using. Please run:

$ sudo apt-get update
$ sudo apt-get install hadoop-0.20

Then restart the daemons (or the entire VM) and give it another go.

Thanks
-Todd

On Wed, Feb 17, 2010 at 7:56 PM, Dan Starr <ds...@gmail.com> wrote:
> Yes, I have tried that when passing the script.  Just now I tried:
>
> hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
> -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
> -input test_input/* -output output -file blah.py
>
> And got this error for a map task:
>
> java.io.IOException: Cannot run program "blah.py":
> java.io.IOException: error=2, No such file or directory
>        at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>        at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>        at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
>        ...
>
> -Dan
>
>
> On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon <to...@cloudera.com> wrote:
>> Are you passing the python script to the cluster using the -file
>> option? eg -mapper foo.py -file foo.py
>>
>> Thanks
>> -Todd
>>
>> On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr <ds...@gmail.com> wrote:
>>> Hi, I've tried posting this to Cloudera's community support site, but
>>> the community website getsatisfaction.com returns various server
>>> errors at the moment.  I believe the following is an issue related to
>>> my environment within Cloudera's Training virtual machine.
>>>
>>> Despite having success running Hadoop streaming on other Hadoop
>>> clusters and on Cloudera's Training VM in local mode, I'm currently
>>> getting an error when attempting to run a simple Hadoop streaming job
>>> in the normal queue based mode on the Training VM.  I'm thinking the
>>> error described below is an issue related to the worker node not
>>> recognizing the python reference in the script's top shebang line.
>>>
>>> The hadoop command I am executing is:
>>>
>>> hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
>>> -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
>>> -input test_input/* -output output
>>>
>>> Where the test_input directory contains 3 UNIX formatted, single line files:
>>>
>>> training-vm: 3$ hadoop dfs -ls /user/training/test_input/
>>> Found 3 items
>>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>>> /user/training/test_input/file1
>>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>>> /user/training/test_input/file2
>>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>>> /user/training/test_input/file3
>>>
>>> training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
>>> test_line1
>>> test_line2
>>> test_line3
>>>
>>> And where blah.py looks like (UNIX formatted):
>>>
>>> #!/usr/bin/python
>>> import sys
>>> for line in sys.stdin:
>>>    print line
>>>
>>> The resulting Hadoop-Streaming error is:
>>>
>>> java.io.IOException: Cannot run program "blah.py":
>>> java.io.IOException: error=2, No such file or directory
>>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>>> at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>>>    ...
>>>
>>>
>>> I get the same error when placing the python script on the HDFS, and
>>> then using this in the hadoop command:
>>>
>>> ... -mapper hdfs:///user/training/blah.py ...
>>>
>>>
>>> One suggestion found online, which may not be relevant to Cloudera's
>>> distribution, mentions that the first line of the hadoop-streaming
>>> python script (the shebang line) may not describe an applicable path
>>> for the system.  The solution mentioned is to use: ... -mapper "python
>>> blah.py " ... in the Hadoop streaming command.  This doesn't seem to
>>> work correctly for me, since I find that the lines from the input data
>>> files are also parsed by the Python interpreter.  But this does reveal
>>> that python is available on the worker node when using this technique.
>>>  I have also tried without success the '-mapper blah.py' technique
>>> using shebang lines: "#!/usr/bin/env python", although on the training
>>> VM Python is installed under /usr/bin/python.
>>>
>>> Maybe the issue is something else.  Any suggestions or insights will be helpful.
>>>
>>
>

Re: Hadoop Streaming File-not-found error on Cloudera's training VM

Posted by Dan Starr <ds...@gmail.com>.
Yes, I have tried that when passing the script.  Just now I tried:

hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
-mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
-input test_input/* -output output -file blah.py

And got this error for a map task:

java.io.IOException: Cannot run program "blah.py":
java.io.IOException: error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
	at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
	at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66)
        ...

-Dan


On Wed, Feb 17, 2010 at 7:47 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Are you passing the python script to the cluster using the -file
> option? eg -mapper foo.py -file foo.py
>
> Thanks
> -Todd
>
> On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr <ds...@gmail.com> wrote:
>> Hi, I've tried posting this to Cloudera's community support site, but
>> the community website getsatisfaction.com returns various server
>> errors at the moment.  I believe the following is an issue related to
>> my environment within Cloudera's Training virtual machine.
>>
>> Despite having success running Hadoop streaming on other Hadoop
>> clusters and on Cloudera's Training VM in local mode, I'm currently
>> getting an error when attempting to run a simple Hadoop streaming job
>> in the normal queue based mode on the Training VM.  I'm thinking the
>> error described below is an issue related to the worker node not
>> recognizing the python reference in the script's top shebang line.
>>
>> The hadoop command I am executing is:
>>
>> hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
>> -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
>> -input test_input/* -output output
>>
>> Where the test_input directory contains 3 UNIX formatted, single line files:
>>
>> training-vm: 3$ hadoop dfs -ls /user/training/test_input/
>> Found 3 items
>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>> /user/training/test_input/file1
>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>> /user/training/test_input/file2
>> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
>> /user/training/test_input/file3
>>
>> training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
>> test_line1
>> test_line2
>> test_line3
>>
>> And where blah.py looks like (UNIX formatted):
>>
>> #!/usr/bin/python
>> import sys
>> for line in sys.stdin:
>>    print line
>>
>> The resulting Hadoop-Streaming error is:
>>
>> java.io.IOException: Cannot run program "blah.py":
>> java.io.IOException: error=2, No such file or directory
>> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
>> at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>>    ...
>>
>>
>> I get the same error when placing the python script on the HDFS, and
>> then using this in the hadoop command:
>>
>> ... -mapper hdfs:///user/training/blah.py ...
>>
>>
>> One suggestion found online, which may not be relevant to Cloudera's
>> distribution, mentions that the first line of the hadoop-streaming
>> python script (the shebang line) may not describe an applicable path
>> for the system.  The solution mentioned is to use: ... -mapper "python
>> blah.py " ... in the Hadoop streaming command.  This doesn't seem to
>> work correctly for me, since I find that the lines from the input data
>> files are also parsed by the Python interpreter.  But this does reveal
>> that python is available on the worker node when using this technique.
>>  I have also tried without success the '-mapper blah.py' technique
>> using shebang lines: "#!/usr/bin/env python", although on the training
>> VM Python is installed under /usr/bin/python.
>>
>> Maybe the issue is something else.  Any suggestions or insights will be helpful.
>>
>

Re: Hadoop Streaming File-not-found error on Cloudera's training VM

Posted by Todd Lipcon <to...@cloudera.com>.
Are you passing the python script to the cluster using the -file
option? eg -mapper foo.py -file foo.py

Thanks
-Todd

On Wed, Feb 17, 2010 at 7:45 PM, Dan Starr <ds...@gmail.com> wrote:
> Hi, I've tried posting this to Cloudera's community support site, but
> the community website getsatisfaction.com returns various server
> errors at the moment.  I believe the following is an issue related to
> my environment within Cloudera's Training virtual machine.
>
> Despite having success running Hadoop streaming on other Hadoop
> clusters and on Cloudera's Training VM in local mode, I'm currently
> getting an error when attempting to run a simple Hadoop streaming job
> in the normal queue based mode on the Training VM.  I'm thinking the
> error described below is an issue related to the worker node not
> recognizing the python reference in the script's top shebang line.
>
> The hadoop command I am executing is:
>
> hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-0.20.1+133-streaming.jar
> -mapper blah.py -reducer org.apache.hadoop.mapred.lib.IdentityReducer
> -input test_input/* -output output
>
> Where the test_input directory contains 3 UNIX formatted, single line files:
>
> training-vm: 3$ hadoop dfs -ls /user/training/test_input/
> Found 3 items
> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
> /user/training/test_input/file1
> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
> /user/training/test_input/file2
> -rw-r--r--   1 training supergroup         11 2010-02-17 10:48
> /user/training/test_input/file3
>
> training-vm: 3$ hadoop dfs -cat /user/training/test_input/*
> test_line1
> test_line2
> test_line3
>
> And where blah.py looks like (UNIX formatted):
>
> #!/usr/bin/python
> import sys
> for line in sys.stdin:
>    print line
>
> The resulting Hadoop-Streaming error is:
>
> java.io.IOException: Cannot run program "blah.py":
> java.io.IOException: error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
> at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:214)
>    ...
>
>
> I get the same error when placing the python script on the HDFS, and
> then using this in the hadoop command:
>
> ... -mapper hdfs:///user/training/blah.py ...
>
>
> One suggestion found online, which may not be relevant to Cloudera's
> distribution, mentions that the first line of the hadoop-streaming
> python script (the shebang line) may not describe an applicable path
> for the system.  The solution mentioned is to use: ... -mapper "python
> blah.py " ... in the Hadoop streaming command.  This doesn't seem to
> work correctly for me, since I find that the lines from the input data
> files are also parsed by the Python interpreter.  But this does reveal
> that python is available on the worker node when using this technique.
>  I have also tried without success the '-mapper blah.py' technique
> using shebang lines: "#!/usr/bin/env python", although on the training
> VM Python is installed under /usr/bin/python.
>
> Maybe the issue is something else.  Any suggestions or insights will be helpful.
>