You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Yahui Liu (Jira)" <ji...@apache.org> on 2021/05/25 02:46:01 UTC

[jira] [Resolved] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query performance

     [ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yahui Liu resolved SPARK-35500.
-------------------------------
    Resolution: Duplicate

> GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query performance
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35500
>                 URL: https://issues.apache.org/jira/browse/SPARK-35500
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.5
>            Reporter: Yahui Liu
>            Priority: Minor
>              Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a array<int>);
>  # Add log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is generated, but the code for the class cannot be reused because everytime the id for two variables in the generated class is changed: MapObjects_loopValue and MapObjects_loopIsNull. So even the class generated before has been cached, new code cannot match the cache key so that new code need to be compiled again which cost some time.  The time cost for compile is increasing with the growth of column number, for wide table, this cost can more than 2s. 
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
> {code:java}
> class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue1;
>  private boolean MapObjects_loopIsNull1;
>  private UTF8String MapObjects_loopValue2;
>  private boolean MapObjects_loopIsNull2;
> }
> {code}
> Second time run:
> {code:java}
> class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>  private int MapObjects_loopValue3;
>  private boolean MapObjects_loopIsNull3;
>  private UTF8String MapObjects_loopValue4;
>  private boolean MapObjects_loopIsNull4;
> }{code}
> Expectation:
> The code generated by GenerateSafeProjection can be reused if the query is same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org