You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/11/21 22:16:58 UTC
[jira] [Commented] (SPARK-18532) Code generation memory issue

    [ https://issues.apache.org/jira/browse/SPARK-18532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684908#comment-15684908 ] 

Herman van Hovell commented on SPARK-18532:
-------------------------------------------

The code generated by whole stage code generation is hitting a JVM limit. Spark should fallback to a non-codegenerated path in that case, this is what you see happening with 20 columns. It should not OOM, but this depends on the complexity of the UDF used and the number of conversions Spark has to do invoking the UDF.

Could you share a reproducible example? Otherwise an overview of the query plan df.explain(), and the generated code would also help.

You can get the generated code by executing the following lines in the shell:
{noformat}
import org.apache.spark.sql.execution.debug._
df.debugCodegen
{noformat}

> Code generation memory issue
> ----------------------------
>
>                 Key: SPARK-18532
>                 URL: https://issues.apache.org/jira/browse/SPARK-18532
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.2
>         Environment: osx / macbook pro / local spark
>            Reporter: Georg Heiler
>
> Trying to create a spark data frame with multiple additional columns based on conditions like this
> df
>     .withColumn("name1", someCondition1)
>     .withColumn("name2", someCondition2)
>     .withColumn("name3", someCondition3)
>     .withColumn("name4", someCondition4)
>     .withColumn("name5", someCondition5)
>     .withColumn("name6", someCondition6)
>     .withColumn("name7", someCondition7)
> I am faced with the following exception in case more than 6 .withColumn clauses are added
> org.codehaus.janino.JaninoRuntimeException: Code of method "()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
> f even more columns are created e.g. around 20 I do no longer receive the aforementioned exception, but rather get the following error after 5 minutes of waiting:
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> What I want to perform is a spelling/error correction. some simple cases could be handled easily via a map& replacement in a UDF. Still, several other cases with multiple chained conditions remain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org