You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by rajat kumar <ku...@gmail.com> on 2021/01/16 18:16:23 UTC

Running pyspark job from virtual environment

Hey Users,

I want to run spark job from virtual environment using Python.

Please note I am creating virtual env (using python3 -m venv env)

I see that there are 3 variables for PYTHON which we have to set:
PYTHONPATH
PYSPARK_DRIVER_PYTHON
PYSPARK_PYTHON

I have 2 doubts:
1. If i want to use Virtual env, do I need to point python path of virtual
environment to all these variables?
2. Should I set these variables in spark-env.sh or should I set them using
export statements.

Regards
Rajat

Re: Running pyspark job from virtual environment

Posted by Mich Talebzadeh <mi...@gmail.com>.
Well. When you or application log in to Linux host (whether a physical tin
box or a virtual node), they execute a script called .bashrc at home
directory.

If it is a scheduled job then it will also execute the same as well.

In my Google Data proc cluster of three (one master and two workers), in
the master node I automatically activate the virtual environment as below

cd /usr/src/Python-3.7.9/environments; source virtualenv/bin/activate

Then I execute spark-submit script as follows:

spark-submit \
 --master yarn \
 --deploy-mode client \
 --jars /home/hduser/jars/spark-bigquery-latest.jar \
   analyze_house_prices_GCP.py

Note that as I understand using virtual environment is only necessary in
the master node, I don't touch worker nodes.

My suggestion is that you first test running a script in your master node
interactively once you have activated your virtual environment and see how
it goes.

HTH



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 17 Jan 2021 at 17:22, rajat kumar <ku...@gmail.com>
wrote:

> Hi Mich,
>
> Thanks for response. I am running it through CLI (on the cluster).
>
> Since this will be scheduled job. I do not want to activate the
> environment manually. It should automatically take the path of virtual
> environment to run the job.
>
> For that I saw 3 properties which I mentioned. I think setting  some of
> them to point to environment binary will help to run the job from venv.
>
> PYTHONPATH
> PYSPARK_DRIVER_PYTHON
> PYSPARK_PYTHON
>
> Also, It has to be set in env.sh or bashrc file? What is the difference
> between spark-env.sh and bashrc
>
> Thanks
> Rajat
>
>
>
> On Sun, Jan 17, 2021 at 10:32 PM Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi Rajat,
>>
>> Are you running this through an IDE like PyCharm or on CLI?
>>
>> If you already have a Python Virtual environment, then just activate it
>>
>> The only env variable you need to set is export PYTHONPATH that you can
>> do it in your startup shell script .bashrc etc.
>>
>> Once you are in virtual environment, then you run:
>>
>> $SPARK_HOME/bin/spark-submit <Python.py)
>>
>> Alternatively you can chmod +x <python file), and add the following line
>> to the file
>>
>> #! /usr/bin/env python3
>>
>> and then you can run it as.
>>
>> ./<python.py>
>>
>> HTH
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 17 Jan 2021 at 13:41, rajat kumar <ku...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Can anyone confirm here please?
>>>
>>> Regards
>>> Rajat
>>>
>>> On Sat, Jan 16, 2021 at 11:46 PM rajat kumar <ku...@gmail.com>
>>> wrote:
>>>
>>>> Hey Users,
>>>>
>>>> I want to run spark job from virtual environment using Python.
>>>>
>>>> Please note I am creating virtual env (using python3 -m venv env)
>>>>
>>>> I see that there are 3 variables for PYTHON which we have to set:
>>>> PYTHONPATH
>>>> PYSPARK_DRIVER_PYTHON
>>>> PYSPARK_PYTHON
>>>>
>>>> I have 2 doubts:
>>>> 1. If i want to use Virtual env, do I need to point python path of
>>>> virtual environment to all these variables?
>>>> 2. Should I set these variables in spark-env.sh or should I set them
>>>> using export statements.
>>>>
>>>> Regards
>>>> Rajat
>>>>
>>>>
>>>>

Re: Running pyspark job from virtual environment

Posted by rajat kumar <ku...@gmail.com>.
Hi Mich,

Thanks for response. I am running it through CLI (on the cluster).

Since this will be scheduled job. I do not want to activate the environment
manually. It should automatically take the path of virtual environment to
run the job.

For that I saw 3 properties which I mentioned. I think setting  some of
them to point to environment binary will help to run the job from venv.

PYTHONPATH
PYSPARK_DRIVER_PYTHON
PYSPARK_PYTHON

Also, It has to be set in env.sh or bashrc file? What is the difference
between spark-env.sh and bashrc

Thanks
Rajat



On Sun, Jan 17, 2021 at 10:32 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Rajat,
>
> Are you running this through an IDE like PyCharm or on CLI?
>
> If you already have a Python Virtual environment, then just activate it
>
> The only env variable you need to set is export PYTHONPATH that you can do
> it in your startup shell script .bashrc etc.
>
> Once you are in virtual environment, then you run:
>
> $SPARK_HOME/bin/spark-submit <Python.py)
>
> Alternatively you can chmod +x <python file), and add the following line
> to the file
>
> #! /usr/bin/env python3
>
> and then you can run it as.
>
> ./<python.py>
>
> HTH
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 17 Jan 2021 at 13:41, rajat kumar <ku...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Can anyone confirm here please?
>>
>> Regards
>> Rajat
>>
>> On Sat, Jan 16, 2021 at 11:46 PM rajat kumar <ku...@gmail.com>
>> wrote:
>>
>>> Hey Users,
>>>
>>> I want to run spark job from virtual environment using Python.
>>>
>>> Please note I am creating virtual env (using python3 -m venv env)
>>>
>>> I see that there are 3 variables for PYTHON which we have to set:
>>> PYTHONPATH
>>> PYSPARK_DRIVER_PYTHON
>>> PYSPARK_PYTHON
>>>
>>> I have 2 doubts:
>>> 1. If i want to use Virtual env, do I need to point python path of
>>> virtual environment to all these variables?
>>> 2. Should I set these variables in spark-env.sh or should I set them
>>> using export statements.
>>>
>>> Regards
>>> Rajat
>>>
>>>
>>>

Re: Running pyspark job from virtual environment

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Rajat,

Are you running this through an IDE like PyCharm or on CLI?

If you already have a Python Virtual environment, then just activate it

The only env variable you need to set is export PYTHONPATH that you can do
it in your startup shell script .bashrc etc.

Once you are in virtual environment, then you run:

$SPARK_HOME/bin/spark-submit <Python.py)

Alternatively you can chmod +x <python file), and add the following line to
the file

#! /usr/bin/env python3

and then you can run it as.

./<python.py>

HTH



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 17 Jan 2021 at 13:41, rajat kumar <ku...@gmail.com>
wrote:

> Hello,
>
> Can anyone confirm here please?
>
> Regards
> Rajat
>
> On Sat, Jan 16, 2021 at 11:46 PM rajat kumar <ku...@gmail.com>
> wrote:
>
>> Hey Users,
>>
>> I want to run spark job from virtual environment using Python.
>>
>> Please note I am creating virtual env (using python3 -m venv env)
>>
>> I see that there are 3 variables for PYTHON which we have to set:
>> PYTHONPATH
>> PYSPARK_DRIVER_PYTHON
>> PYSPARK_PYTHON
>>
>> I have 2 doubts:
>> 1. If i want to use Virtual env, do I need to point python path of
>> virtual environment to all these variables?
>> 2. Should I set these variables in spark-env.sh or should I set them
>> using export statements.
>>
>> Regards
>> Rajat
>>
>>
>>

Re: Running pyspark job from virtual environment

Posted by rajat kumar <ku...@gmail.com>.
Hello,

Can anyone confirm here please?

Regards
Rajat

On Sat, Jan 16, 2021 at 11:46 PM rajat kumar <ku...@gmail.com>
wrote:

> Hey Users,
>
> I want to run spark job from virtual environment using Python.
>
> Please note I am creating virtual env (using python3 -m venv env)
>
> I see that there are 3 variables for PYTHON which we have to set:
> PYTHONPATH
> PYSPARK_DRIVER_PYTHON
> PYSPARK_PYTHON
>
> I have 2 doubts:
> 1. If i want to use Virtual env, do I need to point python path of virtual
> environment to all these variables?
> 2. Should I set these variables in spark-env.sh or should I set them using
> export statements.
>
> Regards
> Rajat
>
>
>