You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Bijay Kumar Pathak (JIRA)" <ji...@apache.org> on 2016/03/24 05:44:25 UTC

[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

    [ https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209743#comment-15209743 ] 

Bijay Kumar Pathak commented on SPARK-8632:
-------------------------------------------

I am still having this issue in Spark 1.6.0 on EMR. The job fails with OOM. I have DataFrame with 250 columns and I am applying UDF on more than 50 of the columns. I am registering the DataFrame as temptable and  applying the UDF in hive_context sql statement. I am applying the UDF after sort merge join of two DataFrame (each of around 4GB) and multiple broadcast joins of 22 Dim table.

Below is how I am applying the UDF.

``` python
data_frame.registerTempTable("temp_table")
new_df = hive_context.sql("select python_udf(column_1),python_udf(column_2), ... , from temp_table")
```


> Poor Python UDF performance because of RDD caching
> --------------------------------------------------
>
>                 Key: SPARK-8632
>                 URL: https://issues.apache.org/jira/browse/SPARK-8632
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0
>            Reporter: Justin Uang
>            Assignee: Davies Liu
>            Priority: Blocker
>             Fix For: 1.5.1, 1.6.0
>
>
> {quote}
> We have been running into performance problems using Python UDFs with DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was to reuse the PythonRDD code. It caches the entire child RDD so that it can do two passes over the data. One to give to the PythonRDD, then one to join the python lambda results with the original row (which may have java objects that should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be processed by the Python UDF. In the cases I was working with, I had a 500 column table, and i wanted to use a python UDF for one column, and it ended up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org