You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Andrei <fa...@gmail.com> on 2013/08/12 08:50:22 UTC

How to import custom Python module in MapReduce job?

(cross-posted from
StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
)

I have a MapReduce job defined in file *main.py*, which imports module lib from
file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop cluster
as follows:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar

    -files lib.py,main.py
    -mapper "./main.py map" -reducer "./main.py reduce"
    -input input -output output

In my understanding, this should put both main.py and lib.py into *distributed
cache folder* on each computing machine and thus make module lib available
to main. But it doesn't happen - from log file I see, that files *are
really copied* to the same directory, but main can't import lib, throwing*
ImportError*.

Adding current directory to the path didn't work:

import sys
sys.path.append(os.path.realpath(__file__))import lib# ImportError

though, loading module manually did the trick:

import imp
lib = imp.load_source('lib', 'lib.py')

But that's not what I want. So why Python interpreter can see other .py files
in the same directory, but can't import them? Note, I have already tried
adding empty __init__.py file to the same directory without effect.

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

For some reason using -archives option leads to "Error in configuring
object" without any further information. However, I found out that -files
option works pretty well for this purpose. I was able to run my example as
follows.

1. I put `main.py` and `lib.py` into `app` directory.
2. In `main.py` I used `lib.py` directly, that is, import string is just

    import lib

3. Instead of uploading to HDFS and using -archives option I just pointed
to `app` directory in -files option:

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar *-files
app*-mapper "
*app/*main.py map" -reducer "*app/*main.py reduce" -input input -output
output

It did the trick. Note, that I tested with both - CPython (2.6) and PyPy
(1.9), so I think it's quite safe to assume this way correct for Python
scripts.

Thanks for your help, Binglin, without it I wouldn't be able to figure it
out anyway.

On Mon, Aug 12, 2013 at 1:12 PM, Binglin Chang <de...@gmail.com> wrote:

> Maybe you doesn't specify symlink name in you cmd line, so the symlink
> name will be just lib.jar, so I am not sure how you import lib module in
> your main.py file.
> Please try this:
> put main.py lib.py in same jar file, e.g.  app.zip
> *-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
> map" -reducer "app/main.py reduce"
> in main.py:
> import app.lib
> or:
> import .lib
>
>

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

For some reason using -archives option leads to "Error in configuring
object" without any further information. However, I found out that -files
option works pretty well for this purpose. I was able to run my example as
follows.

1. I put `main.py` and `lib.py` into `app` directory.
2. In `main.py` I used `lib.py` directly, that is, import string is just

    import lib

3. Instead of uploading to HDFS and using -archives option I just pointed
to `app` directory in -files option:

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar *-files
app*-mapper "
*app/*main.py map" -reducer "*app/*main.py reduce" -input input -output
output

It did the trick. Note, that I tested with both - CPython (2.6) and PyPy
(1.9), so I think it's quite safe to assume this way correct for Python
scripts.

Thanks for your help, Binglin, without it I wouldn't be able to figure it
out anyway.

On Mon, Aug 12, 2013 at 1:12 PM, Binglin Chang <de...@gmail.com> wrote:

> Maybe you doesn't specify symlink name in you cmd line, so the symlink
> name will be just lib.jar, so I am not sure how you import lib module in
> your main.py file.
> Please try this:
> put main.py lib.py in same jar file, e.g.  app.zip
> *-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
> map" -reducer "app/main.py reduce"
> in main.py:
> import app.lib
> or:
> import .lib
>
>

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

For some reason using -archives option leads to "Error in configuring
object" without any further information. However, I found out that -files
option works pretty well for this purpose. I was able to run my example as
follows.

1. I put `main.py` and `lib.py` into `app` directory.
2. In `main.py` I used `lib.py` directly, that is, import string is just

    import lib

3. Instead of uploading to HDFS and using -archives option I just pointed
to `app` directory in -files option:

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar *-files
app*-mapper "
*app/*main.py map" -reducer "*app/*main.py reduce" -input input -output
output

It did the trick. Note, that I tested with both - CPython (2.6) and PyPy
(1.9), so I think it's quite safe to assume this way correct for Python
scripts.

Thanks for your help, Binglin, without it I wouldn't be able to figure it
out anyway.

On Mon, Aug 12, 2013 at 1:12 PM, Binglin Chang <de...@gmail.com> wrote:

> Maybe you doesn't specify symlink name in you cmd line, so the symlink
> name will be just lib.jar, so I am not sure how you import lib module in
> your main.py file.
> Please try this:
> put main.py lib.py in same jar file, e.g.  app.zip
> *-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
> map" -reducer "app/main.py reduce"
> in main.py:
> import app.lib
> or:
> import .lib
>
>

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

For some reason using -archives option leads to "Error in configuring
object" without any further information. However, I found out that -files
option works pretty well for this purpose. I was able to run my example as
follows.

1. I put `main.py` and `lib.py` into `app` directory.
2. In `main.py` I used `lib.py` directly, that is, import string is just

    import lib

3. Instead of uploading to HDFS and using -archives option I just pointed
to `app` directory in -files option:

    hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar *-files
app*-mapper "
*app/*main.py map" -reducer "*app/*main.py reduce" -input input -output
output

It did the trick. Note, that I tested with both - CPython (2.6) and PyPy
(1.9), so I think it's quite safe to assume this way correct for Python
scripts.

Thanks for your help, Binglin, without it I wouldn't be able to figure it
out anyway.

On Mon, Aug 12, 2013 at 1:12 PM, Binglin Chang <de...@gmail.com> wrote:

> Maybe you doesn't specify symlink name in you cmd line, so the symlink
> name will be just lib.jar, so I am not sure how you import lib module in
> your main.py file.
> Please try this:
> put main.py lib.py in same jar file, e.g.  app.zip
> *-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
> map" -reducer "app/main.py reduce"
> in main.py:
> import app.lib
> or:
> import .lib
>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Maybe you doesn't specify symlink name in you cmd line, so the symlink name
will be just lib.jar, so I am not sure how you import lib module in your
main.py file.
Please try this:
put main.py lib.py in same jar file, e.g.  app.zip
*-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
map" -reducer "app/main.py reduce"
in main.py:
import app.lib
or:
import .lib




On Mon, Aug 12, 2013 at 6:01 PM, Andrei <fa...@gmail.com> wrote:

> Hi Binglin,
>
> thanks for your explanation, now it makes sense. However, I'm not sure how
> to implement suggested method with.
>
> First of all, I found out that `-cachArchive` option is deprecated, so I
> had to use `-archives` instead. I put my `lib.py` to directory `lib` and
> then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
>  linked it in call to Streaming API as follows:
>
>   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files
> main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper
> "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine"
> -input input -output output
>
> But script failed, and from logs I see that lib.jar hasn't been unpacked.
> What am I missing?
>
>
>
>
> On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com>wrote:
>
>> Hi,
>>
>> The problem seems to caused by symlink, hadoop uses file cache, so every
>> file is in fact a symlink.
>>
>> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
>> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
>> [root@master01 tmp]# ./main.py
>> Traceback (most recent call last):
>>   File "./main.py", line 3, in ?
>>     import lib
>> ImportError: No module named lib
>>
>> This should be a python bug: when using import, it can't handle symlink
>>
>> You can try to use a directory containing lib.py and use -cacheArchive,
>> so the symlink actually links to a directory, python may handle this case
>> well.
>>
>> Thanks,
>> Binglin
>>
>>
>>
>> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com>wrote:
>>
>>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>>> )
>>>
>>> I have a MapReduce job defined in file *main.py*, which imports module
>>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>>> Hadoop cluster as follows:
>>>
>>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>
>>>     -files lib.py,main.py
>>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>>     -input input -output output
>>>
>>>  In my understanding, this should put both main.py and lib.py into *distributed
>>> cache folder* on each computing machine and thus make module lib available
>>> to main. But it doesn't happen - from log file I see, that files *are
>>> really copied* to the same directory, but main can't import lib,
>>> throwing*ImportError*.
>>>
>>> Adding current directory to the path didn't work:
>>>
>>> import sys
>>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>>
>>> though, loading module manually did the trick:
>>>
>>> import imp
>>> lib = imp.load_source('lib', 'lib.py')
>>>
>>>  But that's not what I want. So why Python interpreter can see other .py files
>>> in the same directory, but can't import them? Note, I have already tried
>>> adding empty __init__.py file to the same directory without effect.
>>>
>>>
>>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Maybe you doesn't specify symlink name in you cmd line, so the symlink name
will be just lib.jar, so I am not sure how you import lib module in your
main.py file.
Please try this:
put main.py lib.py in same jar file, e.g.  app.zip
*-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
map" -reducer "app/main.py reduce"
in main.py:
import app.lib
or:
import .lib




On Mon, Aug 12, 2013 at 6:01 PM, Andrei <fa...@gmail.com> wrote:

> Hi Binglin,
>
> thanks for your explanation, now it makes sense. However, I'm not sure how
> to implement suggested method with.
>
> First of all, I found out that `-cachArchive` option is deprecated, so I
> had to use `-archives` instead. I put my `lib.py` to directory `lib` and
> then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
>  linked it in call to Streaming API as follows:
>
>   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files
> main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper
> "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine"
> -input input -output output
>
> But script failed, and from logs I see that lib.jar hasn't been unpacked.
> What am I missing?
>
>
>
>
> On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com>wrote:
>
>> Hi,
>>
>> The problem seems to caused by symlink, hadoop uses file cache, so every
>> file is in fact a symlink.
>>
>> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
>> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
>> [root@master01 tmp]# ./main.py
>> Traceback (most recent call last):
>>   File "./main.py", line 3, in ?
>>     import lib
>> ImportError: No module named lib
>>
>> This should be a python bug: when using import, it can't handle symlink
>>
>> You can try to use a directory containing lib.py and use -cacheArchive,
>> so the symlink actually links to a directory, python may handle this case
>> well.
>>
>> Thanks,
>> Binglin
>>
>>
>>
>> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com>wrote:
>>
>>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>>> )
>>>
>>> I have a MapReduce job defined in file *main.py*, which imports module
>>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>>> Hadoop cluster as follows:
>>>
>>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>
>>>     -files lib.py,main.py
>>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>>     -input input -output output
>>>
>>>  In my understanding, this should put both main.py and lib.py into *distributed
>>> cache folder* on each computing machine and thus make module lib available
>>> to main. But it doesn't happen - from log file I see, that files *are
>>> really copied* to the same directory, but main can't import lib,
>>> throwing*ImportError*.
>>>
>>> Adding current directory to the path didn't work:
>>>
>>> import sys
>>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>>
>>> though, loading module manually did the trick:
>>>
>>> import imp
>>> lib = imp.load_source('lib', 'lib.py')
>>>
>>>  But that's not what I want. So why Python interpreter can see other .py files
>>> in the same directory, but can't import them? Note, I have already tried
>>> adding empty __init__.py file to the same directory without effect.
>>>
>>>
>>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Maybe you doesn't specify symlink name in you cmd line, so the symlink name
will be just lib.jar, so I am not sure how you import lib module in your
main.py file.
Please try this:
put main.py lib.py in same jar file, e.g.  app.zip
*-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
map" -reducer "app/main.py reduce"
in main.py:
import app.lib
or:
import .lib




On Mon, Aug 12, 2013 at 6:01 PM, Andrei <fa...@gmail.com> wrote:

> Hi Binglin,
>
> thanks for your explanation, now it makes sense. However, I'm not sure how
> to implement suggested method with.
>
> First of all, I found out that `-cachArchive` option is deprecated, so I
> had to use `-archives` instead. I put my `lib.py` to directory `lib` and
> then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
>  linked it in call to Streaming API as follows:
>
>   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files
> main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper
> "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine"
> -input input -output output
>
> But script failed, and from logs I see that lib.jar hasn't been unpacked.
> What am I missing?
>
>
>
>
> On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com>wrote:
>
>> Hi,
>>
>> The problem seems to caused by symlink, hadoop uses file cache, so every
>> file is in fact a symlink.
>>
>> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
>> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
>> [root@master01 tmp]# ./main.py
>> Traceback (most recent call last):
>>   File "./main.py", line 3, in ?
>>     import lib
>> ImportError: No module named lib
>>
>> This should be a python bug: when using import, it can't handle symlink
>>
>> You can try to use a directory containing lib.py and use -cacheArchive,
>> so the symlink actually links to a directory, python may handle this case
>> well.
>>
>> Thanks,
>> Binglin
>>
>>
>>
>> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com>wrote:
>>
>>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>>> )
>>>
>>> I have a MapReduce job defined in file *main.py*, which imports module
>>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>>> Hadoop cluster as follows:
>>>
>>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>
>>>     -files lib.py,main.py
>>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>>     -input input -output output
>>>
>>>  In my understanding, this should put both main.py and lib.py into *distributed
>>> cache folder* on each computing machine and thus make module lib available
>>> to main. But it doesn't happen - from log file I see, that files *are
>>> really copied* to the same directory, but main can't import lib,
>>> throwing*ImportError*.
>>>
>>> Adding current directory to the path didn't work:
>>>
>>> import sys
>>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>>
>>> though, loading module manually did the trick:
>>>
>>> import imp
>>> lib = imp.load_source('lib', 'lib.py')
>>>
>>>  But that's not what I want. So why Python interpreter can see other .py files
>>> in the same directory, but can't import them? Note, I have already tried
>>> adding empty __init__.py file to the same directory without effect.
>>>
>>>
>>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Maybe you doesn't specify symlink name in you cmd line, so the symlink name
will be just lib.jar, so I am not sure how you import lib module in your
main.py file.
Please try this:
put main.py lib.py in same jar file, e.g.  app.zip
*-archives hdfs://hdfs-namenode/user/me/app.zip#app* -mapper "app/main.py
map" -reducer "app/main.py reduce"
in main.py:
import app.lib
or:
import .lib




On Mon, Aug 12, 2013 at 6:01 PM, Andrei <fa...@gmail.com> wrote:

> Hi Binglin,
>
> thanks for your explanation, now it makes sense. However, I'm not sure how
> to implement suggested method with.
>
> First of all, I found out that `-cachArchive` option is deprecated, so I
> had to use `-archives` instead. I put my `lib.py` to directory `lib` and
> then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
>  linked it in call to Streaming API as follows:
>
>   hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files
> main.py *-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper
> "./main.py map" -reducer "./main.py reduce" -combiner "./main.py combine"
> -input input -output output
>
> But script failed, and from logs I see that lib.jar hasn't been unpacked.
> What am I missing?
>
>
>
>
> On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com>wrote:
>
>> Hi,
>>
>> The problem seems to caused by symlink, hadoop uses file cache, so every
>> file is in fact a symlink.
>>
>> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
>> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
>> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
>> [root@master01 tmp]# ./main.py
>> Traceback (most recent call last):
>>   File "./main.py", line 3, in ?
>>     import lib
>> ImportError: No module named lib
>>
>> This should be a python bug: when using import, it can't handle symlink
>>
>> You can try to use a directory containing lib.py and use -cacheArchive,
>> so the symlink actually links to a directory, python may handle this case
>> well.
>>
>> Thanks,
>> Binglin
>>
>>
>>
>> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com>wrote:
>>
>>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>>> )
>>>
>>> I have a MapReduce job defined in file *main.py*, which imports module
>>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>>> Hadoop cluster as follows:
>>>
>>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>>
>>>     -files lib.py,main.py
>>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>>     -input input -output output
>>>
>>>  In my understanding, this should put both main.py and lib.py into *distributed
>>> cache folder* on each computing machine and thus make module lib available
>>> to main. But it doesn't happen - from log file I see, that files *are
>>> really copied* to the same directory, but main can't import lib,
>>> throwing*ImportError*.
>>>
>>> Adding current directory to the path didn't work:
>>>
>>> import sys
>>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>>
>>> though, loading module manually did the trick:
>>>
>>> import imp
>>> lib = imp.load_source('lib', 'lib.py')
>>>
>>>  But that's not what I want. So why Python interpreter can see other .py files
>>> in the same directory, but can't import them? Note, I have already tried
>>> adding empty __init__.py file to the same directory without effect.
>>>
>>>
>>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

Hi Binglin,

thanks for your explanation, now it makes sense. However, I'm not sure how
to implement suggested method with.

First of all, I found out that `-cachArchive` option is deprecated, so I
had to use `-archives` instead. I put my `lib.py` to directory `lib` and
then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
 linked it in call to Streaming API as follows:

  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files main.py
*-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper "./main.py map"
-reducer "./main.py reduce" -combiner "./main.py combine" -input input
-output output

But script failed, and from logs I see that lib.jar hasn't been unpacked.
What am I missing?




On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com> wrote:

> Hi,
>
> The problem seems to caused by symlink, hadoop uses file cache, so every
> file is in fact a symlink.
>
> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
> [root@master01 tmp]# ./main.py
> Traceback (most recent call last):
>   File "./main.py", line 3, in ?
>     import lib
> ImportError: No module named lib
>
> This should be a python bug: when using import, it can't handle symlink
>
> You can try to use a directory containing lib.py and use -cacheArchive,
> so the symlink actually links to a directory, python may handle this case
> well.
>
> Thanks,
> Binglin
>
>
>
> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:
>
>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>> )
>>
>> I have a MapReduce job defined in file *main.py*, which imports module
>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>> Hadoop cluster as follows:
>>
>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>
>>     -files lib.py,main.py
>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>     -input input -output output
>>
>>  In my understanding, this should put both main.py and lib.py into *distributed
>> cache folder* on each computing machine and thus make module lib available
>> to main. But it doesn't happen - from log file I see, that files *are
>> really copied* to the same directory, but main can't import lib, throwing
>> *ImportError*.
>>
>> Adding current directory to the path didn't work:
>>
>> import sys
>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>
>> though, loading module manually did the trick:
>>
>> import imp
>> lib = imp.load_source('lib', 'lib.py')
>>
>>  But that's not what I want. So why Python interpreter can see other .py files
>> in the same directory, but can't import them? Note, I have already tried
>> adding empty __init__.py file to the same directory without effect.
>>
>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

Hi Binglin,

thanks for your explanation, now it makes sense. However, I'm not sure how
to implement suggested method with.

First of all, I found out that `-cachArchive` option is deprecated, so I
had to use `-archives` instead. I put my `lib.py` to directory `lib` and
then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
 linked it in call to Streaming API as follows:

  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files main.py
*-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper "./main.py map"
-reducer "./main.py reduce" -combiner "./main.py combine" -input input
-output output

But script failed, and from logs I see that lib.jar hasn't been unpacked.
What am I missing?




On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com> wrote:

> Hi,
>
> The problem seems to caused by symlink, hadoop uses file cache, so every
> file is in fact a symlink.
>
> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
> [root@master01 tmp]# ./main.py
> Traceback (most recent call last):
>   File "./main.py", line 3, in ?
>     import lib
> ImportError: No module named lib
>
> This should be a python bug: when using import, it can't handle symlink
>
> You can try to use a directory containing lib.py and use -cacheArchive,
> so the symlink actually links to a directory, python may handle this case
> well.
>
> Thanks,
> Binglin
>
>
>
> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:
>
>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>> )
>>
>> I have a MapReduce job defined in file *main.py*, which imports module
>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>> Hadoop cluster as follows:
>>
>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>
>>     -files lib.py,main.py
>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>     -input input -output output
>>
>>  In my understanding, this should put both main.py and lib.py into *distributed
>> cache folder* on each computing machine and thus make module lib available
>> to main. But it doesn't happen - from log file I see, that files *are
>> really copied* to the same directory, but main can't import lib, throwing
>> *ImportError*.
>>
>> Adding current directory to the path didn't work:
>>
>> import sys
>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>
>> though, loading module manually did the trick:
>>
>> import imp
>> lib = imp.load_source('lib', 'lib.py')
>>
>>  But that's not what I want. So why Python interpreter can see other .py files
>> in the same directory, but can't import them? Note, I have already tried
>> adding empty __init__.py file to the same directory without effect.
>>
>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

Hi Binglin,

thanks for your explanation, now it makes sense. However, I'm not sure how
to implement suggested method with.

First of all, I found out that `-cachArchive` option is deprecated, so I
had to use `-archives` instead. I put my `lib.py` to directory `lib` and
then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
 linked it in call to Streaming API as follows:

  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files main.py
*-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper "./main.py map"
-reducer "./main.py reduce" -combiner "./main.py combine" -input input
-output output

But script failed, and from logs I see that lib.jar hasn't been unpacked.
What am I missing?




On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com> wrote:

> Hi,
>
> The problem seems to caused by symlink, hadoop uses file cache, so every
> file is in fact a symlink.
>
> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
> [root@master01 tmp]# ./main.py
> Traceback (most recent call last):
>   File "./main.py", line 3, in ?
>     import lib
> ImportError: No module named lib
>
> This should be a python bug: when using import, it can't handle symlink
>
> You can try to use a directory containing lib.py and use -cacheArchive,
> so the symlink actually links to a directory, python may handle this case
> well.
>
> Thanks,
> Binglin
>
>
>
> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:
>
>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>> )
>>
>> I have a MapReduce job defined in file *main.py*, which imports module
>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>> Hadoop cluster as follows:
>>
>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>
>>     -files lib.py,main.py
>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>     -input input -output output
>>
>>  In my understanding, this should put both main.py and lib.py into *distributed
>> cache folder* on each computing machine and thus make module lib available
>> to main. But it doesn't happen - from log file I see, that files *are
>> really copied* to the same directory, but main can't import lib, throwing
>> *ImportError*.
>>
>> Adding current directory to the path didn't work:
>>
>> import sys
>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>
>> though, loading module manually did the trick:
>>
>> import imp
>> lib = imp.load_source('lib', 'lib.py')
>>
>>  But that's not what I want. So why Python interpreter can see other .py files
>> in the same directory, but can't import them? Note, I have already tried
>> adding empty __init__.py file to the same directory without effect.
>>
>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Andrei <fa...@gmail.com>.

Hi Binglin,

thanks for your explanation, now it makes sense. However, I'm not sure how
to implement suggested method with.

First of all, I found out that `-cachArchive` option is deprecated, so I
had to use `-archives` instead. I put my `lib.py` to directory `lib` and
then zipped it to `lib.zip`. After that I uploaded archive to HDFS and
 linked it in call to Streaming API as follows:

  hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar  -files main.py
*-archives hdfs://hdfs-namenode/user/me/lib.jar* -mapper "./main.py map"
-reducer "./main.py reduce" -combiner "./main.py combine" -input input
-output output

But script failed, and from logs I see that lib.jar hasn't been unpacked.
What am I missing?




On Mon, Aug 12, 2013 at 11:33 AM, Binglin Chang <de...@gmail.com> wrote:

> Hi,
>
> The problem seems to caused by symlink, hadoop uses file cache, so every
> file is in fact a symlink.
>
> lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
> lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
> /root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
> [root@master01 tmp]# ./main.py
> Traceback (most recent call last):
>   File "./main.py", line 3, in ?
>     import lib
> ImportError: No module named lib
>
> This should be a python bug: when using import, it can't handle symlink
>
> You can try to use a directory containing lib.py and use -cacheArchive,
> so the symlink actually links to a directory, python may handle this case
> well.
>
> Thanks,
> Binglin
>
>
>
> On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:
>
>> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
>> )
>>
>> I have a MapReduce job defined in file *main.py*, which imports module
>> lib from file *lib.py*. I use Hadoop Streaming to submit this job to
>> Hadoop cluster as follows:
>>
>> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>>
>>     -files lib.py,main.py
>>     -mapper "./main.py map" -reducer "./main.py reduce"
>>     -input input -output output
>>
>>  In my understanding, this should put both main.py and lib.py into *distributed
>> cache folder* on each computing machine and thus make module lib available
>> to main. But it doesn't happen - from log file I see, that files *are
>> really copied* to the same directory, but main can't import lib, throwing
>> *ImportError*.
>>
>> Adding current directory to the path didn't work:
>>
>> import sys
>> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>>
>> though, loading module manually did the trick:
>>
>> import imp
>> lib = imp.load_source('lib', 'lib.py')
>>
>>  But that's not what I want. So why Python interpreter can see other .py files
>> in the same directory, but can't import them? Note, I have already tried
>> adding empty __init__.py file to the same directory without effect.
>>
>>
>>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Hi,

The problem seems to caused by symlink, hadoop uses file cache, so every
file is in fact a symlink.

lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
[root@master01 tmp]# ./main.py
Traceback (most recent call last):
  File "./main.py", line 3, in ?
    import lib
ImportError: No module named lib

This should be a python bug: when using import, it can't handle symlink

You can try to use a directory containing lib.py and use -cacheArchive, so
the symlink actually links to a directory, python may handle this case well.

Thanks,
Binglin



On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:

> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
> )
>
> I have a MapReduce job defined in file *main.py*, which imports module lib from
> file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop
> cluster as follows:
>
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>
>     -files lib.py,main.py
>     -mapper "./main.py map" -reducer "./main.py reduce"
>     -input input -output output
>
>  In my understanding, this should put both main.py and lib.py into *distributed
> cache folder* on each computing machine and thus make module lib available
> to main. But it doesn't happen - from log file I see, that files *are
> really copied* to the same directory, but main can't import lib, throwing*
> ImportError*.
>
> Adding current directory to the path didn't work:
>
> import sys
> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>
> though, loading module manually did the trick:
>
> import imp
> lib = imp.load_source('lib', 'lib.py')
>
>  But that's not what I want. So why Python interpreter can see other .py files
> in the same directory, but can't import them? Note, I have already tried
> adding empty __init__.py file to the same directory without effect.
>
>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Hi,

The problem seems to caused by symlink, hadoop uses file cache, so every
file is in fact a symlink.

lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
[root@master01 tmp]# ./main.py
Traceback (most recent call last):
  File "./main.py", line 3, in ?
    import lib
ImportError: No module named lib

This should be a python bug: when using import, it can't handle symlink

You can try to use a directory containing lib.py and use -cacheArchive, so
the symlink actually links to a directory, python may handle this case well.

Thanks,
Binglin



On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:

> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
> )
>
> I have a MapReduce job defined in file *main.py*, which imports module lib from
> file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop
> cluster as follows:
>
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>
>     -files lib.py,main.py
>     -mapper "./main.py map" -reducer "./main.py reduce"
>     -input input -output output
>
>  In my understanding, this should put both main.py and lib.py into *distributed
> cache folder* on each computing machine and thus make module lib available
> to main. But it doesn't happen - from log file I see, that files *are
> really copied* to the same directory, but main can't import lib, throwing*
> ImportError*.
>
> Adding current directory to the path didn't work:
>
> import sys
> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>
> though, loading module manually did the trick:
>
> import imp
> lib = imp.load_source('lib', 'lib.py')
>
>  But that's not what I want. So why Python interpreter can see other .py files
> in the same directory, but can't import them? Note, I have already tried
> adding empty __init__.py file to the same directory without effect.
>
>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Hi,

The problem seems to caused by symlink, hadoop uses file cache, so every
file is in fact a symlink.

lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
[root@master01 tmp]# ./main.py
Traceback (most recent call last):
  File "./main.py", line 3, in ?
    import lib
ImportError: No module named lib

This should be a python bug: when using import, it can't handle symlink

You can try to use a directory containing lib.py and use -cacheArchive, so
the symlink actually links to a directory, python may handle this case well.

Thanks,
Binglin



On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:

> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
> )
>
> I have a MapReduce job defined in file *main.py*, which imports module lib from
> file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop
> cluster as follows:
>
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>
>     -files lib.py,main.py
>     -mapper "./main.py map" -reducer "./main.py reduce"
>     -input input -output output
>
>  In my understanding, this should put both main.py and lib.py into *distributed
> cache folder* on each computing machine and thus make module lib available
> to main. But it doesn't happen - from log file I see, that files *are
> really copied* to the same directory, but main can't import lib, throwing*
> ImportError*.
>
> Adding current directory to the path didn't work:
>
> import sys
> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>
> though, loading module manually did the trick:
>
> import imp
> lib = imp.load_source('lib', 'lib.py')
>
>  But that's not what I want. So why Python interpreter can see other .py files
> in the same directory, but can't import them? Note, I have already tried
> adding empty __init__.py file to the same directory without effect.
>
>
>

Re: How to import custom Python module in MapReduce job?

Posted by Binglin Chang <de...@gmail.com>.

Hi,

The problem seems to caused by symlink, hadoop uses file cache, so every
file is in fact a symlink.

lrwxrwxrwx 1 root root 65 Aug 12 15:22 lib.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/13/lib.py
lrwxrwxrwx 1 root root 66 Aug 12 15:23 main.py ->
/root/hadoop3/data/nodemanager/usercache/root/filecache/12/main.py
[root@master01 tmp]# ./main.py
Traceback (most recent call last):
  File "./main.py", line 3, in ?
    import lib
ImportError: No module named lib

This should be a python bug: when using import, it can't handle symlink

You can try to use a directory containing lib.py and use -cacheArchive, so
the symlink actually links to a directory, python may handle this case well.

Thanks,
Binglin



On Mon, Aug 12, 2013 at 2:50 PM, Andrei <fa...@gmail.com> wrote:

> (cross-posted from StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
> )
>
> I have a MapReduce job defined in file *main.py*, which imports module lib from
> file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop
> cluster as follows:
>
> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar
>
>     -files lib.py,main.py
>     -mapper "./main.py map" -reducer "./main.py reduce"
>     -input input -output output
>
>  In my understanding, this should put both main.py and lib.py into *distributed
> cache folder* on each computing machine and thus make module lib available
> to main. But it doesn't happen - from log file I see, that files *are
> really copied* to the same directory, but main can't import lib, throwing*
> ImportError*.
>
> Adding current directory to the path didn't work:
>
> import sys
> sys.path.append(os.path.realpath(__file__))import lib# ImportError
>
> though, loading module manually did the trick:
>
> import imp
> lib = imp.load_source('lib', 'lib.py')
>
>  But that's not what I want. So why Python interpreter can see other .py files
> in the same directory, but can't import them? Note, I have already tried
> adding empty __init__.py file to the same directory without effect.
>
>
>