You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Rodriguez <df...@gmail.com> on 2014/09/02 18:31:41 UTC

Spark on Mesos: Pyspark python libraries

Hi all,

I am getting started with spark and mesos, I already have spark running on
a mesos cluster and I am able to start the scala spark and pyspark shells,
yay!. I still have questions on how to distribute 3rd party python
libraries since i want to use stuff like nltk and mlib on pyspark that
requires numpy.

I am using salt for the configuration management so it is really easy for
me to create an anaconda virtual environment and install all the libraries
there on each mesos slave.

My main question is if that's the recommended way of doing it 3rd party
libraries?
If the answer its yes, how do i tell pyspark to use that virtual
environment (and not the default python) on the spark workers?

I notice that there are some addFile addPyFile functions on the
SparkContext but i don't want to distribute the libraries every single time
if I can just do that once by writing some salt states for that. I am
specially worried about numpy and its requirements.

Hopefully this makes some sense.

Thanks,
Daniel Rodriguez

Re: Spark on Mesos: Pyspark python libraries

Posted by Davies Liu <da...@databricks.com>.

PYSPARK_PYTHON may work for you, it's used to specify which Python
interpreter should
be used in both driver and worker. For example, if  anaconda was
installed as /anaconda on all the machines, then you can specify
PYSPARK_PYTHON=/anaconda/bin/python to use anaconda virtual
environment in PySpark.

PYSPARK_PYTHON=/anaconda/bin/python spark-submit xxxx.py

Or if you want to use it by default, you can put this environment in somewhere:

export PYSPARK_PYTHON=/anaconda/bin/python

On Tue, Sep 2, 2014 at 9:31 AM, Daniel Rodriguez
<df...@gmail.com> wrote:
>
> Hi all,
>
> I am getting started with spark and mesos, I already have spark running on a
> mesos cluster and I am able to start the scala spark and pyspark shells,
> yay!. I still have questions on how to distribute 3rd party python libraries
> since i want to use stuff like nltk and mlib on pyspark that requires numpy.
>
> I am using salt for the configuration management so it is really easy for me
> to create an anaconda virtual environment and install all the libraries
> there on each mesos slave.
>
> My main question is if that's the recommended way of doing it 3rd party
> libraries?
> If the answer its yes, how do i tell pyspark to use that virtual environment
> (and not the default python) on the spark workers?
>
> I notice that there are some addFile addPyFile functions on the SparkContext
> but i don't want to distribute the libraries every single time if I can just
> do that once by writing some salt states for that. I am specially worried
> about numpy and its requirements.
>
> Hopefully this makes some sense.
>
> Thanks,
> Daniel Rodriguez

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org