You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jacob Eisinger (JIRA)" <ji...@apache.org> on 2016/09/29 17:44:20 UTC

[jira] [Updated] (SPARK-17728) UDFs are run too many times

     [ https://issues.apache.org/jira/browse/SPARK-17728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jacob Eisinger updated SPARK-17728:
-----------------------------------
    Attachment: Defect - Over Optimized UDF.html

> UDFs are run too many times
> ---------------------------
>
>                 Key: SPARK-17728
>                 URL: https://issues.apache.org/jira/browse/SPARK-17728
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>         Environment: Databricks Cloud / Spark 2.0.0
>            Reporter: Jacob Eisinger
>            Priority: Minor
>         Attachments: Defect - Over Optimized UDF.html
>
>
> h3. Background
> The UDF functionality is very useful in Spark. In particular, longer running processes that might run analytics or contact external services can be used here. The response might not just be a field, but instead a structure of information. When attempting to break out this information, it is critical that query is optimized correctly.
> h3. Steps to Reproduce
> # Create some sample data.
> # Create a UDF that returns a multiple attributes.
> # Run UDF over some data.
> # Create new columns from the multiple attributes.
> # Observe run time.
> h3. Actual Results
> The UDF is executed multiple times **per row.**
> h3. Expected Results
> The UDF should only be executed once **per row.**
> h3. Workaround
> Cache the Dataset after UDF execution.
> h3. Details
> See attached Databricks Notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org