You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/02 17:06:58 UTC

[GitHub] [spark] aokolnychyi opened a new pull request #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets

aokolnychyi opened a new pull request #24515: [SPARK-14083][WIP] Basic bytecode analyzer to speed up Datasets
URL: https://github.com/apache/spark/pull/24515

**Disclaimer!** The purpose of this PoC PR is to trigger a discussion on whether it is feasible to leverage bytecode analysis to speed up Datasets. This PR shows what can be achieved by interpreting stack-based bytecode directly. The current version does not handle all edge cases. If the community agrees that bytecode analysis can give substantial benefits, we will need to define the scope (i.e., which cases we are targeting) and decide which approach to take (e.g., stack-based bytecode interpretation, flow-control graphs, AST, Jimple). I would like to emphasize that this PR doesn't answer those questions. Instead, it shows what can be done with the simplest approach. Also, there is no intention to integrate this logic into Spark directly. It might be a separate package.

### Scope

Bytecode analysis can be used for several purposes:
- Provide information on what's going on inside closures so that Spark can perform additional optimizations while still relying on closures during execution (e.g., analyze which fields are accessed/modified and use that information to optimize the plan).
- Replace typed operations that rely on closures with equivalent untyped operations that rely on Catalyst expressions.

Rewriting closures is challenging but gives more benefits. So, this PR focuses on the second use case.

Right now, the code covers typed map and filter operations that involve primitive values, boxed values (not all are implemented), objects that are represented as structs (e.g., case classes). The same logic can be applied to UDFs/UDAFs and other operations.

### Algorithm

There are two ways to derive Catalyst expressions:
- Translate stack-based bytecode directly.
- Convert bytecode into an intermediate format that it is easier to work with (e.g., flow-control graphs, AST, Jimple).

Both approaches have their own tradeoffs, which are well-described in comments to [SPARK-14083](https://issues.apache.org/jira/browse/SPARK-14083). This PR translates stack-based bytecode directly and simulates what happens to the operand stack. The optimizer is supposed to simplify the derived expression. I think it is valid to start with the simplest approach and evaluate its limitations before considering a more advanced implementation.

Originally, we wanted to cover trivial cases. However, the current scope is much bigger, so we should consider an intermediate format and will be glad to hear more opinions. As it was mentioned before, the decision highly depends on our target use cases.

This PR adds a new optimizer rule `RewriteTypedOperations` that uses `TypedOperations` in order to convert typed operations into untyped. `TypedOperations` follows a trivial algorithm that is described below.

#### Step 1: Get closure bytecode

First of all, we need to get bytecode for the closure. Scala 2.12 migrated to Java lambdas and uses `LambdaMetafactory` (LMF) (seems like there are rare cases when Scala doesn't use LMF). This PR relies on the existing logic in Spark to obtain `SerializedLambda`, which has enough information to get bytecode for closures.

Scala uses "adapted" methods for LMF to encapsulate differences in boxing semantics (i.e., unboxing null in Scala gives 0 and NPE in Java). The current code will obtain bytecode of the non-adapted method whenever the args are primitives to avoid a round of unnecessary boxing/unboxing.

See `ClosureUtils$getMethod` for more information.

#### Step 2: Build a correct local variable array

Once we have bytecode for our closure, we need to build a local variable array that references correct Catalyst expressions. This can be achieved by translating the deserializer for a typed operation. Deserializers define how data is converted from the Spark internal format into Java objects. Frequently, deserializers contain `StaticInvoke`, `Invoke`, `NewInstance` expressions, which can be translated using the same algorithm.

See `TypedOperations$convertDeserializer` for more information.

#### Step 3: Create an operand stack

The next step is to create an operand stack for storing partial Catalyst expressions. The current code uses `mutable.ArrayStack[Expression]` for this purpose.

#### Step 4: Interpret instructions

Once we have bytecode, the array of local variables and the operand stack, we can follow our bytecode instructions one by one and simulate what happens to the operand stack.

#### Step 5: Assign the result back to the expected attributes

Once we have the result expression, we need to assign it to columns. At this point, the serializer is important as it contains information about the expected attributes and their data types.

### Trying things out

`DatasetBytecodeAnalysisBenchmark` and `BytecodeAnalysisSuite` show currently supported cases. The focus is on the conceptual approach and not on addressing every possible method/instruction.

### Open Questions

- We need to discuss the scope of this work. As mentioned before, we can either translate closures into Catalyst expressions or just derive additional information about the content of closures (if that's useful enough).
- We need to discuss the overall approach we want to take (e.g., bytecode interpretation, control-flow graphs, AST, Jimple).
- We need to discuss which library (if any) to use.
- We need to handle too complicated result expressions as they can slow down the computation. For example, we can introduce some thresholds. Apart from that, we can introduce more optimization rules to simplify expressions.
- We need to ensure the conversion is lightweight and doesn't slow down the job planning time.
- We need to handle cases when the conversion doesn't terminate (e.g., infinite recursion).
- We need to ensure that edge cases work properly (e.g., null handling, exceptions, arithmetic expressions).
- We need to decide how to properly handle flow control instructions (e.g., if statements). The current code handles them via recursion and jumps.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org