You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/06 12:38:18 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #2703: [EPIC] Full JIT support for `DataFusion`

alamb opened a new issue, #2703:
URL: https://github.com/apache/arrow-datafusion/issues/2703

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   TLDR: The benefit of this are:
   1. Speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. [#2427](https://github.com/apache/arrow-datafusion/issues/2427))
   2. (secondarily) Avoid unnecessary intermediate allocation that we can speed up complex / nested expressions 
   
   DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de
   
   As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.
   
   In addition, the datafusion `PhysicalExpr` and `arrow-rs` library currently evaluate expressions by "materializing intermediate results" -- for example `(a + b) + c` results in first evaluating `(a+b)` to a temporary location and then adding `c` to form the final result. Note however, there is a tradeoff here between the speed gained using the LLVM optimized vectorized kernels in `arrow-rs` and cranelift generated JIT expressions where JIT may not actually be faster. 
   
   Another example can be found in [these slides](https://github.com/apache/arrow-datafusion/files/8843957/Expression_evaluation.pdf) from [this presentation](https://docs.google.com/presentation/d/1owNlmpNpC2-eBd-jEYRCt0L8_sW4PWYx70gqnFC0twM/edit#slide=id.gba4e8a221c_0_125)
   
   @yjshen added initial support for JIT'ing in https://github.com/apache/arrow-datafusion/pull/1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. 
   
   This ticket aims to be a central location for tracking the status of JIT compiling expressions for anyone who wants to contribute to this effort
   
   
   **Describe the solution you'd like**
   
   We should be able to take in a collection of a `RecordBatch` / named `Array`s and compile an expression like `(a + b)/ 2` to a loop that results in a new `Array`.
   ```
   fn compile(schema: SchemaRef, expr: Expr) ->  CompiledFunction {
   
   }
   ```
   
   The loop itself also must be included in the to-be compiled expression, to remove call overhead and allow for possible use of SIMD instructions, either explicitly by instrumenting cranelift enough or through auto-vectorization.
   
   **Describe alternatives you've considered**
   n/a
   
   **Additional context**
   
   **Work Items**
   - [ ] https://github.com/apache/arrow-datafusion/pull/2587
   - [ ] https://github.com/apache/arrow-datafusion/issues/2122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #2703: [EPIC] JIT support for `DataFusion`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2703:
URL: https://github.com/apache/arrow-datafusion/issues/2703#issuecomment-1147406507

   Actually, this is basically the same as https://github.com/apache/arrow-datafusion/issues/1861 so closing in favor of that ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #2703: [EPIC] JIT support for `DataFusion`

Posted by GitBox <gi...@apache.org>.
alamb closed issue #2703: [EPIC] JIT support for `DataFusion`
URL: https://github.com/apache/arrow-datafusion/issues/2703


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org