You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jacob Eisinger (JIRA)" <ji...@apache.org> on 2016/09/29 17:44:20 UTC

[jira] [Created] (SPARK-17728) UDFs are run too many times

Jacob Eisinger created SPARK-17728:
--------------------------------------

             Summary: UDFs are run too many times
                 Key: SPARK-17728
                 URL: https://issues.apache.org/jira/browse/SPARK-17728
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0
         Environment: Databricks Cloud / Spark 2.0.0
            Reporter: Jacob Eisinger
            Priority: Minor
         Attachments: Defect - Over Optimized UDF.html

h3. Background
The UDF functionality is very useful in Spark. In particular, longer running processes that might run analytics or contact external services can be used here. The response might not just be a field, but instead a structure of information. When attempting to break out this information, it is critical that query is optimized correctly.

h3. Steps to Reproduce
# Create some sample data.
# Create a UDF that returns a multiple attributes.
# Run UDF over some data.
# Create new columns from the multiple attributes.
# Observe run time.

h3. Actual Results
The UDF is executed multiple times **per row.**

h3. Expected Results
The UDF should only be executed once **per row.**

h3. Workaround
Cache the Dataset after UDF execution.

h3. Details
See attached Databricks Notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org