You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Elkhan Dadashov <el...@gmail.com> on 2015/07/17 20:23:56 UTC

Has anyone run Python Spark application on Yarn-cluster mode ? (which has 3rd party Python modules (i.e., numpy) to be shipped with)

Hi all,

After  SPARK-5479 <https://issues.apache.org/jira/browse/SPARK-5479> issue
fix (thanks to Marcelo Vanzin), now pyspark handles several python files
(or in zip folder with __init__.py) addition to PYTHONPATH correctly in
yarn-cluster mode.

But adding python module as zip folder, still fails - if that zip folder
have other file types (compiled byte code or c code) other than python
files.

Let's say you want to provide numpy package to --py-files flag, which is
downloaded from as numpy-1.9.2.zip from this link
<https://pypi.python.org/pypi/numpy> does not work - complaining import
numpy line has failed.

numpy module need to be *installed* before importing it in Spark Python
script.

So does that mean you need to install on all machines required python
modules before using pyspark ?

Or what is best pattern using any python 3rd party module in Spark Python
job ?

Thanks.


On Thu, Jun 25, 2015 at 12:55 PM, Marcelo Vanzin <va...@cloudera.com>
wrote:

> Please take a look at the pull request with the actual fix; that will
> explain why it's the same issue.
>
> On Thu, Jun 25, 2015 at 12:51 PM, Elkhan Dadashov <el...@gmail.com>
> wrote:
>
>> Thanks Marcelo.
>>
>> But my case is different. My mypython/libs/numpy-1.9.2.zip is in *local
>> directory* (can also put in HDFS), but still fails.
>>
>> But SPARK-5479 <https://issues.apache.org/jira/browse/SPARK-5479> is :
>> PySpark on yarn mode need to support *non-local* python files.
>>
>> The job fails only when i try to include 3rd party dependency from local
>> computer with --py-files (in Spark 1.4)
>>
>> Both of these commands succeed:
>>
>> ./bin/spark-submit --master yarn-cluster --verbose hdfs:///pi.py
>> ./bin/spark-submit --master yarn-cluster --deploy-mode cluster  --verbose
>> examples/src/main/python/pi.py
>>
>> But in this particular example with 3rd party numpy module:
>>
>> ./bin/spark-submit --verbose --master yarn-cluster --py-files
>>  mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
>> mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0
>>
>>
>> All these files :
>>
>> mypython/libs/numpy-1.9.2.zip,  mypython/scripts/kmeans.py are local
>> files, kmeans_data.txt is in HDFS.
>>
>>
>> Thanks.
>>
>>
>> On Thu, Jun 25, 2015 at 12:22 PM, Marcelo Vanzin <va...@cloudera.com>
>> wrote:
>>
>>> That sounds like SPARK-5479 which is not in 1.4...
>>>
>>> On Thu, Jun 25, 2015 at 12:17 PM, Elkhan Dadashov <el...@gmail.com>
>>> wrote:
>>>
>>>> In addition to previous emails, when i try to execute this command from
>>>> command line:
>>>>
>>>> ./bin/spark-submit --verbose --master yarn-cluster --py-files
>>>>  mypython/libs/numpy-1.9.2.zip --deploy-mode cluster
>>>> mypython/scripts/kmeans.py /kmeans_data.txt 5 1.0
>>>>
>>>>
>>>> - numpy-1.9.2.zip - is downloaded numpy package
>>>> - kmeans.py is default example which comes with Spark 1.4
>>>> - kmeans_data.txt  - is default data file which comes with Spark 1.4
>>>>
>>>>
>>>> It fails saying that it could not find numpy:
>>>>
>>>> File "kmeans.py", line 31, in <module>
>>>>     import numpy
>>>> ImportError: No module named numpy
>>>>
>>>> Has anyone run Python Spark application on Yarn-cluster mode ? (which
>>>> has 3rd party Python modules to be shipped with)
>>>>
>>>> What are the configurations or installations to be done before running
>>>> Python Spark job with 3rd party dependencies on Yarn-cluster ?
>>>>
>>>> Thanks in advance.
>>>>
>>>>
>>> --
>>> Marcelo
>>>
>>
>>
>>
>> --
>>
>> Best regards,
>> Elkhan Dadashov
>>
>
>
>
> --
> Marcelo
>



-- 

Best regards,
Elkhan Dadashov