You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jacob Eisinger (JIRA)" <ji...@apache.org> on 2016/09/29 17:44:20 UTC

[jira] [Created] (SPARK-17728) UDFs are run too many times

Jacob Eisinger created SPARK-17728:
--------------------------------------

Summary: UDFs are run too many times
Key: SPARK-17728
URL: https://issues.apache.org/jira/browse/SPARK-17728
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.0.0
Environment: Databricks Cloud / Spark 2.0.0
Reporter: Jacob Eisinger
Priority: Minor
Attachments: Defect - Over Optimized UDF.html

h3. Background
The UDF functionality is very useful in Spark. In particular, longer running processes that might run analytics or contact external services can be used here. The response might not just be a field, but instead a structure of information. When attempting to break out this information, it is critical that query is optimized correctly.

h3. Steps to Reproduce
# Create some sample data.
# Create a UDF that returns a multiple attributes.
# Run UDF over some data.
# Create new columns from the multiple attributes.
# Observe run time.

h3. Actual Results
The UDF is executed multiple times **per row.**

h3. Expected Results
The UDF should only be executed once **per row.**

h3. Workaround
Cache the Dataset after UDF execution.

h3. Details
See attached Databricks Notebook.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org