You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2016/04/02 03:06:25 UTC
[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

    [ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222611#comment-15222611 ] 

Josh Rosen commented on SPARK-14083:
------------------------------------

Null-handling is going to present a major design challenge here. If we want to exactly preserve the behavior of the Java closure then we need to ensure that our translation does not add implicit null-handling which differs from the JVM's own handling. For example:

- What happens today if a user calls .getInt(..) on a column which is null? We need to preserve the current behavior.
- What if a user calls .getString(...).equals("foo") on a row where the string column is null? Today the user's code will throw a NullPointerException. To preserve this behavior, we might need to add an expression which throws exceptions on null values.
- In Java, casting null to a numeric type returns the zero-value of that type, whereas SQL casts preserve nulls.

I think that we don't have a choice but to faithfully preserve Java's null semantics. If we didn't, then subtle differences in Java closures or in compilers' emitted bytecode could alter the result of queries.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> ----------------------------------------------------------------
>
>                 Key: SPARK-14083
>                 URL: https://issues.apache.org/jira/browse/SPARK-14083
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we have more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis and use that to convert closures/lambdas into Catalyst expressions in order to speed up Dataset execution. It is a little bit futuristic, but I believe it is very doable. The framework should be easy to reason about (e.g. similar to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org