You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Kevin Su <pi...@gmail.com> on 2022/12/12 22:29:07 UTC

How can I get the same spark context in two different python processes

Hey there, How can I get the same spark context in two different python
processes?
Let’s say I create a context in Process A, and then I want to use python
subprocess B to get the spark context created by Process A. How can I
achieve that?

I've tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(),
but it will create a new spark context.

Re: How can I get the same spark context in two different python processes

Posted by Kevin Su <pi...@gmail.com>.

Hi Jack,

My use case is a bit different, I created a subprocess instead of thread. I
can't pass the args to subprocess.

Jack Goodson <ja...@gmail.com> 於 2022年12月12日 週一 晚上8:03寫道：

> apologies, the code should read as below
>
> from threading import Thread
>
> context = pyspark.sql.SparkSession.builder.appName("spark").getOrCreate()
>
> t1 = Thread(target=my_func, args=(context,))
> t1.start()
>
> t2 = Thread(target=my_func, args=(context,))
> t2.start()
>
> On Tue, Dec 13, 2022 at 4:10 PM Jack Goodson <ja...@gmail.com>
> wrote:
>
>> Hi Kevin,
>>
>> I had a similar use case (see below code) but with something that wasn’t
>> spark related. I think the below should work for you, you may need to edit
>> the context variable to suit your needs but hopefully it gives the general
>> idea of sharing a single object between multiple threads.
>>
>> Thanks
>>
>>
>> from threading import Thread
>>
>> context = pyspark.sql.SparkSession.builder.appName("spark").getOrCreate()
>>
>> t1 = Thread(target=order_creator, args=(app_id, sleep_time,))
>> t1.start(target=my_func, args=(context,))
>>
>> t2 = Thread(target=order_creator, args=(app_id, sleep_time,))
>> t2.start(target=my_func, args=(context,))
>>
>

Re: How can I get the same spark context in two different python processes

Posted by Jack Goodson <ja...@gmail.com>.

apologies, the code should read as below

from threading import Thread

context = pyspark.sql.SparkSession.builder.appName("spark").getOrCreate()

t1 = Thread(target=my_func, args=(context,))
t1.start()

t2 = Thread(target=my_func, args=(context,))
t2.start()

On Tue, Dec 13, 2022 at 4:10 PM Jack Goodson <ja...@gmail.com> wrote:

> Hi Kevin,
>
> I had a similar use case (see below code) but with something that wasn’t
> spark related. I think the below should work for you, you may need to edit
> the context variable to suit your needs but hopefully it gives the general
> idea of sharing a single object between multiple threads.
>
> Thanks
>
>
> from threading import Thread
>
> context = pyspark.sql.SparkSession.builder.appName("spark").getOrCreate()
>
> t1 = Thread(target=order_creator, args=(app_id, sleep_time,))
> t1.start(target=my_func, args=(context,))
>
> t2 = Thread(target=order_creator, args=(app_id, sleep_time,))
> t2.start(target=my_func, args=(context,))
>

Re: How can I get the same spark context in two different python processes

Posted by Jack Goodson <ja...@gmail.com>.

Hi Kevin,

I had a similar use case (see below code) but with something that wasn’t
spark related. I think the below should work for you, you may need to edit
the context variable to suit your needs but hopefully it gives the general
idea of sharing a single object between multiple threads.

Thanks


from threading import Thread

context = pyspark.sql.SparkSession.builder.appName("spark").getOrCreate()

t1 = Thread(target=order_creator, args=(app_id, sleep_time,))
t1.start(target=my_func, args=(context,))

t2 = Thread(target=order_creator, args=(app_id, sleep_time,))
t2.start(target=my_func, args=(context,))

Re: How can I get the same spark context in two different python processes

Posted by Maciej <ms...@gmail.com>.

Hi,

Unfortunately, I don't have a working example I could share at hand, but 
the flow will be roughly like this

- Retrieve an existing Python ClientServer  (gateway) from the SparkContext
- Get its gateway_parameters (some are constant for PySpark, but you'll 
need at least port and auth_token)
- Pass these to a new process and use them to initialize a new ClientServer
- From ClientServer jvm retrieve bindings for JVM SparkContext
- Use JVM binding and gateway to initialize Python SparkContext in your 
process.


Just to reiterate ‒ it is not something that we support (or Py4j for 
that matter) so don't do it unless you fully understand the implications 
(including, but not limited to, risk of leaking the token). Use this 
approach at your own risk.


On 12/13/22 03:52, Kevin Su wrote:
> Maciej, Thanks for the reply.
> Could you share an example to achieve it?
> 
> Maciej <mszymkiewicz@gmail.com <ma...@gmail.com>> 於 2022 
> 年12月12日 週一 下午4:41寫道：
> 
>     Technically speaking, it is possible in stock distribution (can't speak
>     for Databricks) and not super hard to do (just check out how we
>     initialize sessions), but definitely not something that we test or
>     support, especially in a scenario you described.
> 
>     If you want to achieve concurrent execution, multithreading is normally
>     more than sufficient and avoids problems with the context.
> 
> 
> 
>     On 12/13/22 00:40, Kevin Su wrote:
>      > I ran my spark job by using databricks job with a single python
>     script.
>      > IIUC, the databricks platform will create a spark context for this
>      > python script.
>      > However, I create a new subprocess in this script and run some spark
>      > code in this subprocess, but this subprocess can't find the
>      > context created by databricks.
>      > Not sure if there is any api I can use to get the default context.
>      >
>      > bo yang <bobyangbo@gmail.com <ma...@gmail.com>
>     <mailto:bobyangbo@gmail.com <ma...@gmail.com>>> 於 2022年
>     12月
>      > 12日 週一 下午3:27寫道：
>      >
>      >     In theory, maybe a Jupyter notebook or something similar could
>      >     achieve this? e.g. running some Jypyter kernel inside Spark
>     driver,
>      >     then another Python process could connect to that kernel.
>      >
>      >     But in the end, this is like Spark Connect :)
>      >
>      >
>      >     On Mon, Dec 12, 2022 at 2:55 PM Kevin Su <pingsutw@gmail.com
>     <ma...@gmail.com>
>      >     <mailto:pingsutw@gmail.com <ma...@gmail.com>>> wrote:
>      >
>      >         Also, is there any way to workaround this issue without
>      >         using Spark connect?
>      >
>      >         Kevin Su <pingsutw@gmail.com <ma...@gmail.com>
>     <mailto:pingsutw@gmail.com <ma...@gmail.com>>> 於
>      >         2022年12月12日 週一 下午2:52寫道：
>      >
>      >             nvm, I found the ticket.
>      >             Also, is there any way to workaround this issue without
>      >             using Spark connect?
>      >
>      >             Kevin Su <pingsutw@gmail.com
>     <ma...@gmail.com> <mailto:pingsutw@gmail.com
>     <ma...@gmail.com>>> 於
>      >             2022年12月12日 週一 下午2:42寫道：
>      >
>      >                 Thanks for the quick response? Do we have any PR
>     or Jira
>      >                 ticket for it?
>      >
>      >                 Reynold Xin <rxin@databricks.com
>     <ma...@databricks.com>
>      >                 <mailto:rxin@databricks.com
>     <ma...@databricks.com>>> 於 2022年12月12日 週一 下
>      >                 午2:39寫道：
>      >
>      >                     Spark Connect :)
>      >
>      >                     (It’s work in progress)
>      >
>      >
>      >                     On Mon, Dec 12 2022 at 2:29 PM, Kevin Su
>      >                     <pingsutw@gmail.com
>     <ma...@gmail.com> <mailto:pingsutw@gmail.com
>     <ma...@gmail.com>>> wrote:
>      >
>      >                         Hey there, How can I get the same spark
>     context
>      >                         in two different python processes?
>      >                         Let’s say I create a context in Process
>     A, and
>      >                         then I want to use python subprocess B to get
>      >                         the spark context created by Process A.
>     How can
>      >                         I achieve that?
>      >
>      >                         I've
>      >                       
>       tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it will create a new spark context.
>      >
> 
>     -- 
>     Best regards,
>     Maciej Szymkiewicz
> 
>     Web: https://zero323.net <https://zero323.net>
>     PGP: A30CEF0C31A501EC
> 

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

Re: How can I get the same spark context in two different python processes

Posted by Kevin Su <pi...@gmail.com>.

Maciej, Thanks for the reply.
Could you share an example to achieve it?

Maciej <ms...@gmail.com> 於 2022年12月12日 週一 下午4:41寫道：

> Technically speaking, it is possible in stock distribution (can't speak
> for Databricks) and not super hard to do (just check out how we
> initialize sessions), but definitely not something that we test or
> support, especially in a scenario you described.
>
> If you want to achieve concurrent execution, multithreading is normally
> more than sufficient and avoids problems with the context.
>
>
>
> On 12/13/22 00:40, Kevin Su wrote:
> > I ran my spark job by using databricks job with a single python script.
> > IIUC, the databricks platform will create a spark context for this
> > python script.
> > However, I create a new subprocess in this script and run some spark
> > code in this subprocess, but this subprocess can't find the
> > context created by databricks.
> > Not sure if there is any api I can use to get the default context.
> >
> > bo yang <bobyangbo@gmail.com <ma...@gmail.com>> 於 2022年12月
> > 12日 週一 下午3:27寫道：
> >
> >     In theory, maybe a Jupyter notebook or something similar could
> >     achieve this? e.g. running some Jypyter kernel inside Spark driver,
> >     then another Python process could connect to that kernel.
> >
> >     But in the end, this is like Spark Connect :)
> >
> >
> >     On Mon, Dec 12, 2022 at 2:55 PM Kevin Su <pingsutw@gmail.com
> >     <ma...@gmail.com>> wrote:
> >
> >         Also, is there any way to workaround this issue without
> >         using Spark connect?
> >
> >         Kevin Su <pingsutw@gmail.com <ma...@gmail.com>> 於
> >         2022年12月12日 週一 下午2:52寫道：
> >
> >             nvm, I found the ticket.
> >             Also, is there any way to workaround this issue without
> >             using Spark connect?
> >
> >             Kevin Su <pingsutw@gmail.com <ma...@gmail.com>> 於
> >             2022年12月12日 週一 下午2:42寫道：
> >
> >                 Thanks for the quick response? Do we have any PR or Jira
> >                 ticket for it?
> >
> >                 Reynold Xin <rxin@databricks.com
> >                 <ma...@databricks.com>> 於 2022年12月12日 週一 下
> >                 午2:39寫道：
> >
> >                     Spark Connect :)
> >
> >                     (It’s work in progress)
> >
> >
> >                     On Mon, Dec 12 2022 at 2:29 PM, Kevin Su
> >                     <pingsutw@gmail.com <ma...@gmail.com>>
> wrote:
> >
> >                         Hey there, How can I get the same spark context
> >                         in two different python processes?
> >                         Let’s say I create a context in Process A, and
> >                         then I want to use python subprocess B to get
> >                         the spark context created by Process A. How can
> >                         I achieve that?
> >
> >                         I've
> >
>  tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but
> it will create a new spark context.
> >
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> PGP: A30CEF0C31A501EC
>
>

Re: How can I get the same spark context in two different python processes

Posted by Maciej <ms...@gmail.com>.

Technically speaking, it is possible in stock distribution (can't speak 
for Databricks) and not super hard to do (just check out how we 
initialize sessions), but definitely not something that we test or 
support, especially in a scenario you described.

If you want to achieve concurrent execution, multithreading is normally 
more than sufficient and avoids problems with the context.



On 12/13/22 00:40, Kevin Su wrote:
> I ran my spark job by using databricks job with a single python script.
> IIUC, the databricks platform will create a spark context for this 
> python script.
> However, I create a new subprocess in this script and run some spark 
> code in this subprocess, but this subprocess can't find the 
> context created by databricks.
> Not sure if there is any api I can use to get the default context.
> 
> bo yang <bobyangbo@gmail.com <ma...@gmail.com>> 於 2022年12月 
> 12日 週一 下午3:27寫道：
> 
>     In theory, maybe a Jupyter notebook or something similar could
>     achieve this? e.g. running some Jypyter kernel inside Spark driver,
>     then another Python process could connect to that kernel.
> 
>     But in the end, this is like Spark Connect :)
> 
> 
>     On Mon, Dec 12, 2022 at 2:55 PM Kevin Su <pingsutw@gmail.com
>     <ma...@gmail.com>> wrote:
> 
>         Also, is there any way to workaround this issue without
>         using Spark connect?
> 
>         Kevin Su <pingsutw@gmail.com <ma...@gmail.com>> 於
>         2022年12月12日 週一 下午2:52寫道：
> 
>             nvm, I found the ticket.
>             Also, is there any way to workaround this issue without
>             using Spark connect?
> 
>             Kevin Su <pingsutw@gmail.com <ma...@gmail.com>> 於
>             2022年12月12日 週一 下午2:42寫道：
> 
>                 Thanks for the quick response? Do we have any PR or Jira
>                 ticket for it?
> 
>                 Reynold Xin <rxin@databricks.com
>                 <ma...@databricks.com>> 於 2022年12月12日 週一 下
>                 午2:39寫道：
> 
>                     Spark Connect :)
> 
>                     (It’s work in progress)
> 
> 
>                     On Mon, Dec 12 2022 at 2:29 PM, Kevin Su
>                     <pingsutw@gmail.com <ma...@gmail.com>> wrote:
> 
>                         Hey there, How can I get the same spark context
>                         in two different python processes?
>                         Let’s say I create a context in Process A, and
>                         then I want to use python subprocess B to get
>                         the spark context created by Process A. How can
>                         I achieve that?
> 
>                         I've
>                         tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it will create a new spark context.
> 

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

Re: How can I get the same spark context in two different python processes

Posted by Kevin Su <pi...@gmail.com>.

I ran my spark job by using databricks job with a single python script.
IIUC, the databricks platform will create a spark context for this python
script.
However, I create a new subprocess in this script and run some spark code
in this subprocess, but this subprocess can't find the context created by
databricks.
Not sure if there is any api I can use to get the default context.

bo yang <bo...@gmail.com> 於 2022年12月12日 週一 下午3:27寫道：

> In theory, maybe a Jupyter notebook or something similar could achieve
> this? e.g. running some Jypyter kernel inside Spark driver, then another
> Python process could connect to that kernel.
>
> But in the end, this is like Spark Connect :)
>
>
> On Mon, Dec 12, 2022 at 2:55 PM Kevin Su <pi...@gmail.com> wrote:
>
>> Also, is there any way to workaround this issue without using Spark
>> connect?
>>
>> Kevin Su <pi...@gmail.com> 於 2022年12月12日 週一 下午2:52寫道：
>>
>>> nvm, I found the ticket.
>>> Also, is there any way to workaround this issue without using Spark
>>> connect?
>>>
>>> Kevin Su <pi...@gmail.com> 於 2022年12月12日 週一 下午2:42寫道：
>>>
>>>> Thanks for the quick response? Do we have any PR or Jira ticket for it?
>>>>
>>>> Reynold Xin <rx...@databricks.com> 於 2022年12月12日 週一 下午2:39寫道：
>>>>
>>>>> Spark Connect :)
>>>>>
>>>>> (It’s work in progress)
>>>>>
>>>>>
>>>>> On Mon, Dec 12 2022 at 2:29 PM, Kevin Su <pi...@gmail.com> wrote:
>>>>>
>>>>>> Hey there, How can I get the same spark context in two different
>>>>>> python processes?
>>>>>> Let’s say I create a context in Process A, and then I want to use
>>>>>> python subprocess B to get the spark context created by Process A. How can
>>>>>> I achieve that?
>>>>>>
>>>>>> I've
>>>>>> tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but
>>>>>> it will create a new spark context.
>>>>>>
>>>>>

Re: How can I get the same spark context in two different python processes

Posted by bo yang <bo...@gmail.com>.

In theory, maybe a Jupyter notebook or something similar could achieve
this? e.g. running some Jypyter kernel inside Spark driver, then another
Python process could connect to that kernel.

But in the end, this is like Spark Connect :)


On Mon, Dec 12, 2022 at 2:55 PM Kevin Su <pi...@gmail.com> wrote:

> Also, is there any way to workaround this issue without using Spark
> connect?
>
> Kevin Su <pi...@gmail.com> 於 2022年12月12日 週一 下午2:52寫道：
>
>> nvm, I found the ticket.
>> Also, is there any way to workaround this issue without using Spark
>> connect?
>>
>> Kevin Su <pi...@gmail.com> 於 2022年12月12日 週一 下午2:42寫道：
>>
>>> Thanks for the quick response? Do we have any PR or Jira ticket for it?
>>>
>>> Reynold Xin <rx...@databricks.com> 於 2022年12月12日 週一 下午2:39寫道：
>>>
>>>> Spark Connect :)
>>>>
>>>> (It’s work in progress)
>>>>
>>>>
>>>> On Mon, Dec 12 2022 at 2:29 PM, Kevin Su <pi...@gmail.com> wrote:
>>>>
>>>>> Hey there, How can I get the same spark context in two different
>>>>> python processes?
>>>>> Let’s say I create a context in Process A, and then I want to use
>>>>> python subprocess B to get the spark context created by Process A. How can
>>>>> I achieve that?
>>>>>
>>>>> I've
>>>>> tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but
>>>>> it will create a new spark context.
>>>>>
>>>>

Re: How can I get the same spark context in two different python processes

Posted by Kevin Su <pi...@gmail.com>.

Also, is there any way to workaround this issue without using Spark connect?

Kevin Su <pi...@gmail.com> 於 2022年12月12日 週一 下午2:52寫道：

> nvm, I found the ticket.
> Also, is there any way to workaround this issue without using Spark
> connect?
>
> Kevin Su <pi...@gmail.com> 於 2022年12月12日 週一 下午2:42寫道：
>
>> Thanks for the quick response? Do we have any PR or Jira ticket for it?
>>
>> Reynold Xin <rx...@databricks.com> 於 2022年12月12日 週一 下午2:39寫道：
>>
>>> Spark Connect :)
>>>
>>> (It’s work in progress)
>>>
>>>
>>> On Mon, Dec 12 2022 at 2:29 PM, Kevin Su <pi...@gmail.com> wrote:
>>>
>>>> Hey there, How can I get the same spark context in two different python
>>>> processes?
>>>> Let’s say I create a context in Process A, and then I want to use
>>>> python subprocess B to get the spark context created by Process A. How can
>>>> I achieve that?
>>>>
>>>> I've
>>>> tried pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but
>>>> it will create a new spark context.
>>>>
>>>

Re: How can I get the same spark context in two different python processes

Posted by Reynold Xin <rx...@databricks.com.INVALID>.

Spark Connect :)

(It’s work in progress)

On Mon, Dec 12 2022 at 2:29 PM, Kevin Su < pingsutw@gmail.com > wrote:

> 
> Hey there, How can I get the same spark context in two different python
> processes?
> Let’s say I create a context in Process A, and then I want to use python
> subprocess B to get the spark context created by Process A. How can I
> achieve that?
> 
> 
> I've tried
> pyspark.sql.SparkSession.builder.appName("spark").getOrCreate(), but it
> will create a new spark context.
>