You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Eskilson,Aleksander" <Al...@Cerner.com> on 2015/07/06 21:55:43 UTC

User Defined Functions - Execution on Clusters

Hi there,

I’m trying to get a feel for how User Defined Functions from SparkSQL (as written in Python and registered using the udf function from pyspark.sql.functions) are run behind the scenes. Trying to grok the source it seems that the native Python function is serialized for distribution to the clusters. In practice, it seems to be able to check for other variables and functions defined elsewhere in the namepsace and include those in the function’s serialization.

Following all this though, when actually run, are Python interpreter instances on each node brought up to actually run the function against the RDDs, or can the serialized function somehow be run on just the JVM? If bringing up Python instances is the execution model, what is the overhead of PySpark UDFs like as compared to those registered in Scala?

Thanks,
Alek

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Re: User Defined Functions - Execution on Clusters

Posted by "Eskilson,Aleksander" <Al...@Cerner.com>.

Interesting, thanks for the heads up.

On 7/6/15, 7:19 PM, "Davies Liu" <da...@databricks.com> wrote:

>Currently, Python UDFs run in a Python instances, are MUCH slower than
>Scala ones (from 10 to 100x). There is JIRA to improve the
>performance： https://issues.apache.org/jira/browse/SPARK-8632, After
>that, they will be still much slower than Scala ones (because Python
>is lower and the overhead for calling Python).
>
>On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander
><Al...@cerner.com> wrote:
>> Hi there,
>>
>> I’m trying to get a feel for how User Defined Functions from SparkSQL
>>(as
>> written in Python and registered using the udf function from
>> pyspark.sql.functions) are run behind the scenes. Trying to grok the
>>source
>> it seems that the native Python function is serialized for distribution
>>to
>> the clusters. In practice, it seems to be able to check for other
>>variables
>> and functions defined elsewhere in the namepsace and include those in
>>the
>> function’s serialization.
>>
>> Following all this though, when actually run, are Python interpreter
>> instances on each node brought up to actually run the function against
>>the
>> RDDs, or can the serialized function somehow be run on just the JVM? If
>> bringing up Python instances is the execution model, what is the
>>overhead of
>> PySpark UDFs like as compared to those registered in Scala?
>>
>> Thanks,
>> Alek
>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>from
>> Cerner Corporation and are intended only for the addressee. The
>>information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use
>>of
>> such information is strictly prohibited and may be unlawful. If you are
>>not
>> the addressee, please promptly delete this message and notify the
>>sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices
>>in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: User Defined Functions - Execution on Clusters

Posted by Davies Liu <da...@databricks.com>.

Currently, Python UDFs run in a Python instances, are MUCH slower than
Scala ones (from 10 to 100x). There is JIRA to improve the
performance： https://issues.apache.org/jira/browse/SPARK-8632, After
that, they will be still much slower than Scala ones (because Python
is lower and the overhead for calling Python).

On Mon, Jul 6, 2015 at 12:55 PM, Eskilson,Aleksander
<Al...@cerner.com> wrote:
> Hi there,
>
> I’m trying to get a feel for how User Defined Functions from SparkSQL (as
> written in Python and registered using the udf function from
> pyspark.sql.functions) are run behind the scenes. Trying to grok the source
> it seems that the native Python function is serialized for distribution to
> the clusters. In practice, it seems to be able to check for other variables
> and functions defined elsewhere in the namepsace and include those in the
> function’s serialization.
>
> Following all this though, when actually run, are Python interpreter
> instances on each node brought up to actually run the function against the
> RDDs, or can the serialized function somehow be run on just the JVM? If
> bringing up Python instances is the execution model, what is the overhead of
> PySpark UDFs like as compared to those registered in Scala?
>
> Thanks,
> Alek
> CONFIDENTIALITY NOTICE This message and any included attachments are from
> Cerner Corporation and are intended only for the addressee. The information
> contained in this message is confidential and may constitute inside or
> non-public information under international, federal, or state securities
> laws. Unauthorized forwarding, printing, copying, distribution, or use of
> such information is strictly prohibited and may be unlawful. If you are not
> the addressee, please promptly delete this message and notify the sender of
> the delivery error by e-mail or you may call Cerner's corporate offices in
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org