You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Li Jin <ic...@gmail.com> on 2022/07/19 15:54:05 UTC
How does PySpark send "import" to the worker when executing Python UDFs?
Hi,
I have a question about how does "imports" get send to the python worker.
For example, I have
def foo(x):
return np.abs(x)
If I run this code directly, it obviously failed (because np is undefined
on the driver process):
sc.paralleilize([1, 2, 3]).map(foo).collect()
However, if I add the import statement "import numpy as np" on the driver,
it works. So somehow driver is sending that "imports" to the worker when
executing foo on the worker but I cannot seem t o find the code that does
this - Can someone please send me a pointer?
Thanks,
Li
Re: How does PySpark send "import" to the worker when executing Python UDFs?
Posted by Li Jin <ic...@gmail.com>.
Aha I see. Thanks Hyukjin!
On Tue, Jul 19, 2022 at 9:09 PM Hyukjin Kwon <gu...@gmail.com> wrote:
> This is done by cloudpickle. They pickle global variables referred within
> the func together, and register it to the global imported modules.
>
> On Wed, 20 Jul 2022 at 00:55, Li Jin <ic...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question about how does "imports" get send to the python worker.
>>
>> For example, I have
>>
>> def foo(x):
>> return np.abs(x)
>>
>> If I run this code directly, it obviously failed (because np is undefined
>> on the driver process):
>>
>> sc.paralleilize([1, 2, 3]).map(foo).collect()
>>
>> However, if I add the import statement "import numpy as np" on the
>> driver, it works. So somehow driver is sending that "imports" to the worker
>> when executing foo on the worker but I cannot seem t o find the code that
>> does this - Can someone please send me a pointer?
>>
>> Thanks,
>> Li
>>
>
Re: How does PySpark send "import" to the worker when executing Python UDFs?
Posted by Hyukjin Kwon <gu...@gmail.com>.
This is done by cloudpickle. They pickle global variables referred within
the func together, and register it to the global imported modules.
On Wed, 20 Jul 2022 at 00:55, Li Jin <ic...@gmail.com> wrote:
> Hi,
>
> I have a question about how does "imports" get send to the python worker.
>
> For example, I have
>
> def foo(x):
> return np.abs(x)
>
> If I run this code directly, it obviously failed (because np is undefined
> on the driver process):
>
> sc.paralleilize([1, 2, 3]).map(foo).collect()
>
> However, if I add the import statement "import numpy as np" on the driver,
> it works. So somehow driver is sending that "imports" to the worker when
> executing foo on the worker but I cannot seem t o find the code that does
> this - Can someone please send me a pointer?
>
> Thanks,
> Li
>