You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Martin Blom <va...@gmail.com> on 2008/05/23 01:39:43 UTC

Import path for hadoop streaming with python

Hello all,

I'm trying to stream a little python script on my small hadoop
cluster, and it doesn't work like I thought it would.

The script looks something like

#!/usr/bin/env python
import mylib
dostuff

where mylib is a small python library that I want included, and I
launch the whole thing with something like

bin/hadoop jar contrib/streaming/hadoop-0.16.4-streaming.jar
-cacheFile "hdfs://master:54310/user/hadoop/mylib.py#mylib.py" -file
scrpit.py -mapper "script.py" -input input -output output

so it seems to me like the library should be available to the script.
When I run the script locally on my machine everything works perfectly
fine. However, when I run it it the script can't find the library.
Does hadoop do anything strange to default paths? Am I missing
something obvious? Any pointers or ideas on how to fix this would be
great.

Martin Blom

Re: Import path for hadoop streaming with python

Posted by Saptarshi Guha <sa...@gmail.com>.
I haven't done this using hadoop but before i 16.4 i had written my  
own distributed batch processor using HDFS as a common file storage  
and remote execution of python scripts.
They all required a custom module which was copied to the remote temp  
folders (a primitive implementation of cacheFile)

So this is what I did:  just after #!/usr/bin/env python

import sys
sys.path.append('.')
import mylib
dostuff

so that your module can be found in the current path.
It should work thereafter
Regards
Saptarshi

On May 22, 2008, at 7:39 PM, Martin Blom wrote:

> Hello all,
>
> I'm trying to stream a little python script on my small hadoop
> cluster, and it doesn't work like I thought it would.
>
> The script looks something like
>
> #!/usr/bin/env python
> import mylib
> dostuff
>
> where mylib is a small python library that I want included, and I
> launch the whole thing with something like
>
> bin/hadoop jar contrib/streaming/hadoop-0.16.4-streaming.jar
> -cacheFile "hdfs://master:54310/user/hadoop/mylib.py#mylib.py" -file
> scrpit.py -mapper "script.py" -input input -output output
>




> so it seems to me like the library should be available to the script.
> When I run the script locally on my machine everything works perfectly
> fine. However, when I run it it the script can't find the library.
> Does hadoop do anything strange to default paths? Am I missing
> something obvious? Any pointers or ideas on how to fix this would be
> great.
>
> Martin Blom

Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
You love your home and want it to be beautiful.