You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2016/04/02 21:02:25 UTC
[jira] [Comment Edited] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

    [ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223010#comment-15223010 ] 

Josh Rosen edited comment on SPARK-14083 at 4/2/16 7:02 PM:
------------------------------------------------------------

Here's one example of how we might aim to preserve Java/Scala closure API null behavior for field accesses:

Consider the following closure:

{code}
    val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, null)).toDF()
    ds.filter(r => r.getInt(1) == 2).collect()
{code}

This code will fail with a NullPointerException in the getInt() call (per its contract). This closure's bytecode looks like this:

{code}
aload_1
iconst_1
invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I)
iconst_2
if_icmpne 15
iconst_1
goto 16
iconst_0
ireturn
{code}

My most recent prototype converts this into

{code}
cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

where {{npeonnull}} is a new non-SQL expression which throws a null pointer exception on null inputs. If we trust our nullability analysis optimization rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} calls when their children are non-nullable.

If a user wanted to implement the SQL filter semantics here, then they could rewrite their closure to

{code}
      ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2)
{code}

My prototype translates this closure into

{code}
cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

Again, I think that this could be easily simplified given some new optimizer rules:

- We can propagate the negation of the `if` condition into the attributes of the else branch.
- Therefore, we can conclude that column 2 is not null when analyzing the else case and can strip out the `npeonnull` check.
- After both optimizations plus cast pushdown, constant folding, and an optimization for rewriting {{if(condition, trueLiteral, falseLiteral)}} expressions with non-nullable conditions by the condition expression itself, I think we could produce exactly the same {{filter _2#3 = 2}} expression that the Catalyst expression DSL would have given us.


was (Author: joshrosen):
Here's one example of how we might aim to preserve Java/Scala closure API null behavior for field accesses:

Consider the following closure:

{code}
    val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, null)).toDF()
    ds.filter(r => r.getInt(1) == 2).collect()
{code}

This code will fail with a NullPointerException in the getInt() call (per its contract). This closure's bytecode looks like this:

{code}
aload_1
iconst_1
invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I)
iconst_2
if_icmpne 15
iconst_1
goto 16
iconst_0
ireturn
{code}

My most recent prototype converts this into

{code}
cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

where {{npeonnull}} is a new non-SQL expression which throws a null pointer exception on null inputs. If we trust our nullability analysis optimization rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} calls when their children are non-nullable.

If a user wanted to implement the SQL filter semantics here, then they could rewrite their closure to

{code}
      ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2)
{code}

My prototype translates this closure into

{code}
cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

Again, I think that this could be easily simplified given some new optimizer rules:

- We can propagate the negation of the `if` condition into the attributes of the else branch.
- Therefore, we can conclude that column 2 is not null when analyzing the else case and can strip out the `npeonnull` check.
- After both optimizations plus cast pushdown, constant folding, and an optimization for rewriting {{if}} expressions with non-nullable conditions by the condition expression itself, I think we could produce exactly the same {{filter _2#3 = 2}} expression that the Catalyst expression DSL would have given us.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> ----------------------------------------------------------------
>
>                 Key: SPARK-14083
>                 URL: https://issues.apache.org/jira/browse/SPARK-14083
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we have more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis and use that to convert closures/lambdas into Catalyst expressions in order to speed up Dataset execution. It is a little bit futuristic, but I believe it is very doable. The framework should be easy to reason about (e.g. similar to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org