You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Li Jin <ic...@gmail.com> on 2022/07/19 15:54:05 UTC

How does PySpark send "import" to the worker when executing Python UDFs?

Hi,

I have a question about how does "imports" get send to the python worker.

For example, I have

def foo(x):
    return np.abs(x)

If I run this code directly, it obviously failed (because np is undefined
on the driver process):

sc.paralleilize([1, 2, 3]).map(foo).collect()

However, if I add the import statement "import numpy as np" on the driver,
it works. So somehow driver is sending that "imports" to the worker when
executing foo on the worker but I cannot seem t o find the code that does
this - Can someone please send me a pointer?

Thanks,
Li

Re: How does PySpark send "import" to the worker when executing Python UDFs?

Posted by Li Jin <ic...@gmail.com>.

Aha I see. Thanks Hyukjin!

On Tue, Jul 19, 2022 at 9:09 PM Hyukjin Kwon <gu...@gmail.com> wrote:

> This is done by cloudpickle. They pickle global variables referred within
> the func together, and register it to the global imported modules.
>
> On Wed, 20 Jul 2022 at 00:55, Li Jin <ic...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a question about how does "imports" get send to the python worker.
>>
>> For example, I have
>>
>> def foo(x):
>>     return np.abs(x)
>>
>> If I run this code directly, it obviously failed (because np is undefined
>> on the driver process):
>>
>> sc.paralleilize([1, 2, 3]).map(foo).collect()
>>
>> However, if I add the import statement "import numpy as np" on the
>> driver, it works. So somehow driver is sending that "imports" to the worker
>> when executing foo on the worker but I cannot seem t o find the code that
>> does this - Can someone please send me a pointer?
>>
>> Thanks,
>> Li
>>
>

Re: How does PySpark send "import" to the worker when executing Python UDFs?

Posted by Hyukjin Kwon <gu...@gmail.com>.

This is done by cloudpickle. They pickle global variables referred within
the func together, and register it to the global imported modules.

On Wed, 20 Jul 2022 at 00:55, Li Jin <ic...@gmail.com> wrote:

> Hi,
>
> I have a question about how does "imports" get send to the python worker.
>
> For example, I have
>
> def foo(x):
>     return np.abs(x)
>
> If I run this code directly, it obviously failed (because np is undefined
> on the driver process):
>
> sc.paralleilize([1, 2, 3]).map(foo).collect()
>
> However, if I add the import statement "import numpy as np" on the driver,
> it works. So somehow driver is sending that "imports" to the worker when
> executing foo on the worker but I cannot seem t o find the code that does
> this - Can someone please send me a pointer?
>
> Thanks,
> Li
>