You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Xiong Deng <db...@gmail.com> on 2011/09/01 14:01:19 UTC
Problem with Python + Hadoop: how to link .so outside Python?
Hi,
I have successfully installed scipy on my Python 2.7 on my local Linux, and
I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python
MapReduce scripts, like this:
20 ${HADOOP_HOME}/bin/hadoop streaming \$
21 -input "${input}" \$
22 -output "${output}" \$
23 -mapper "python27/bin/python27.sh rp_extractMap.py" \$
24 -reducer "python27/bin/python27.sh rp_extractReduce.py" \$
25 -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
\$
26 -file rp_extractMap.py \$
27 -file rp_extractReduce.py \$
28 -file shitu_conf.py \$
29 -cacheArchive "/share/python27.tar.gz#python27" \$
30 -outputformat org.apache.hadoop.mapred.TextOutputFormat \$
31 -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$
32 -jobconf mapred.max.split.size="512000000" \$
33 -jobconf mapred.job.name="[reserve_price][rp_extract]" \$
34 -jobconf mapred.job.priority=HIGH \$
35 -jobconf mapred.job.map.capacity=1000 \$
36 -jobconf mapred.job.reduce.capacity=200 \$
37 -jobconf mapred.reduce.tasks=200$
38 -jobconf num.key.fields.for.partition=2$
I have to do this, because the Hadoop server installed its own python of
very low version which may not support some of my python scripts, and I do
not have privilege to install scipy lib on that server. So,I have to use the
-cacheArchieve command to include my own python2.7 with scipy....
But, I find out that some of the .so in scipy are linked to other dynamic
libs outside Python2.7.. For example
$ ldd
~/local/python-2.7.2/lib/python2.7/site-packages/scipy/linalg/flapack.so
liblapack.so => /usr/local/atlas/lib/liblapack.so
(0x0000002a956fd000)
libatlas.so => /usr/local/atlas/lib/libatlas.so (0x0000002a95df3000)
libgfortran.so.3 =>
/home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x0000002a9668d000)
libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a968b6000)
libgcc_s.so.1 => /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1
(0x0000002a96a3c000)
libquadmath.so.0 =>
/home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x0000002a96b51000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a96c87000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a96ebb000)
/lib64/ld-linux-x86-64.so.2 (0x000000552aaaa000)
So, my question is: how can I include this libs? Should I search for all the
linked .so and .a under my local linux and pack them together with
Python2.7??? If yes, How can I get a full list of the libs needed and How
can make the packed Python2.7 know where to find the new libs??
Thanks
Xiong
Re: Problem with Python + Hadoop: how to link .so outside Python?
Posted by Guang-Nan Cheng <ch...@gmail.com>.
You can do it.
If you understand how Hadoop works, then you should realized that it's
a Python question and a Linux question.
Pass the native files via -files and setup environment variables
via "mapred.child.env".
I've done a similar thing with Ruby. For Ruby, the environment
variables are PATH, GEM_HOME, GEM_PATH, LD_LIBRARY_PATH and RUBYLIB.
-D mapred.child.env=PATH=ruby-1.9.2-p180/bin:'$PATH',GEM_HOME=ruby-1.9.2-p180,LD_LIBRARY_PATH=ruby-1.9.2-p180/lib,GEM_PATH=ruby-1.9.2-p180,RUBYLIB=ruby-1.9.2-p180/lib/ruby/site_ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/site_ruby/1.9.1/x86_64-linux:ruby-1.9.2-p180/lib/ruby/site_ruby:ruby-1.9.2-p180/lib/ruby/vendor_ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/vendor_ruby/1.9.1/x86_64-linux:ruby-1.9.2-p180/lib/ruby/vendor_ruby:ruby-1.9.2-p180/lib/ruby/1.9.1:ruby-1.9.2-p180/lib/ruby/1.9.1/x86_64-linux
\
-files ruby-1.9.2-p180 \
On Thu, Sep 1, 2011 at 8:01 PM, Xiong Deng <db...@gmail.com> wrote:
> Hi,
>
> I have successfully installed scipy on my Python 2.7 on my local Linux, and
> I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python
> MapReduce scripts, like this:
>
> 20 ${HADOOP_HOME}/bin/hadoop streaming \$
> 21 -input "${input}" \$
> 22 -output "${output}" \$
> 23 -mapper "python27/bin/python27.sh rp_extractMap.py" \$
> 24 -reducer "python27/bin/python27.sh rp_extractReduce.py" \$
> 25 -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
> \$
> 26 -file rp_extractMap.py \$
> 27 -file rp_extractReduce.py \$
> 28 -file shitu_conf.py \$
> 29 -cacheArchive "/share/python27.tar.gz#python27" \$
> 30 -outputformat org.apache.hadoop.mapred.TextOutputFormat \$
> 31 -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$
> 32 -jobconf mapred.max.split.size="512000000" \$
> 33 -jobconf mapred.job.name="[reserve_price][rp_extract]" \$
> 34 -jobconf mapred.job.priority=HIGH \$
> 35 -jobconf mapred.job.map.capacity=1000 \$
> 36 -jobconf mapred.job.reduce.capacity=200 \$
> 37 -jobconf mapred.reduce.tasks=200$
> 38 -jobconf num.key.fields.for.partition=2$
>
> I have to do this, because the Hadoop server installed its own python of
> very low version which may not support some of my python scripts, and I do
> not have privilege to install scipy lib on that server. So,I have to use the
> -cacheArchieve command to include my own python2.7 with scipy....
>
> But, I find out that some of the .so in scipy are linked to other dynamic
> libs outside Python2.7.. For example
>
> $ ldd
> ~/local/python-2.7.2/lib/python2.7/site-packages/scipy/linalg/flapack.so
> liblapack.so => /usr/local/atlas/lib/liblapack.so
> (0x0000002a956fd000)
> libatlas.so => /usr/local/atlas/lib/libatlas.so (0x0000002a95df3000)
> libgfortran.so.3 =>
> /home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x0000002a9668d000)
> libm.so.6 => /lib64/tls/libm.so.6 (0x0000002a968b6000)
> libgcc_s.so.1 => /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1
> (0x0000002a96a3c000)
> libquadmath.so.0 =>
> /home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x0000002a96b51000)
> libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a96c87000)
> libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000002a96ebb000)
> /lib64/ld-linux-x86-64.so.2 (0x000000552aaaa000)
>
>
> So, my question is: how can I include this libs? Should I search for all the
> linked .so and .a under my local linux and pack them together with
> Python2.7??? If yes, How can I get a full list of the libs needed and How
> can make the packed Python2.7 know where to find the new libs??
>
> Thanks
> Xiong
>