You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/29 19:26:35 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new pull request #2124: JIT-compille DataFusion expression with alias / column name #2123

Dandandan opened a new pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124


   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/2123
   
    # Rationale for this change
    Allowing to compile more realistic expressions, like (a+1) as a step towards generating code to run on Arrays.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

yjshen commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1082546362


   Thanks again @Dandandan for working on JIT! I hope I can have time to work with you on jit soon. 
   
   FWIW, one of our blockers is Cranelift has no ability to inline assembly, so it causes performance regression for columnar to row conversion (as recognized in https://github.com/apache/arrow-datafusion/pull/1975). A possible solution is to use LLVM instead of Cranelift because PostgreSQL and Impala have proved that inline is feasible in LLVM. I would pick it up if it's not taken already at the time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1082673469


   > Thanks again @Dandandan for working on JIT! I hope I can have time to work with you on jit soon. 
   > 
   > FWIW, one of our blockers is Cranelift has no ability to inline assembly, so it causes performance regression for columnar to row conversion (as recognized in https://github.com/apache/arrow-datafusion/pull/1975). A possible solution is to use LLVM instead of Cranelift because PostgreSQL and Impala have proved that inline is feasible in LLVM. I would pick it up if it's not taken already at the time.
   
   Cool, thanks for the context.
   
   I was thinking for generating the code for compiling expressions to cranelift jit, this would not be bad, as we could generate the whole loop instead to generate the new array contents which doesn't need to call any function. I am not sure if we run into the same problem as that thread, as it is about accessors for datastructures? It seems if we can operate on arrays with primitive datatypes this wouldn't be problematic?
   
   I was thinking of the following strategy:
   
   * Get the start address, array length and the bytes for the primitive type.
   * Pre-allocate an output array, pass the address
   * Use this info to generate the full loop where the loop uses the length, the assignment uses the addresses and increases the pointer with the number of bytes.
   
   Is this somewhere going to get stuck?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1083480888


   >  chaining expressions and avoid repeated array allocation through Jit？
   
   FWIW I think if we "JIT'd" the entire expression, an approach like https://github.com/jorgecarleitao/arrow2/issues/627  will likely not help all that much (as the temporaries will be in registers rather than being materialized in intermediate arrow arrays)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1083532239


   > I'm a bit confused here. Do you mean https://github.com/jorgecarleitao/arrow2/issues/627 chaining expressions and avoid repeated array allocation through Jit？
   
   The goal in that issue is similar yes, but here I suggest using generated code instead of reusing arrays.
   
   The idea is that we can compile the entire loop.
   The compiled expression `a + b` would roughly compile to something like the following pseudo code:
   
   ```
   i = 0
   while i < length {
       *item = *a + *b;
   
       a += size_a;
       b += size_b;
       item += size_item;
   }
   ```
   
   Here item is the pointer to items in the target array and a / b  are pointing to items in arrays a and b.
   In this case we only need to allocate one target array, instead of intermediate arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1084366628


   > I think we are good as long as we don't invoke rust functions from the JIT'd code. I am curious if cranelift will be smart enough to auto vectorize the loop though :)
   > 
   > If we keep the code gen API similar, then when we switch to LLVM, the migration shouldn't be too hard.
   
   :100: Maybe it could event coexist together (both a Cranelift / LLVM backend).
   
   Also not sure on the autovectorization, though I am hopeful we might be able to instrument cranelift enough to have it compile to SIMD instructions (there are for example SIMD types https://docs.rs/cranelift/latest/cranelift/prelude/types/index.html).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan merged pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

Dandandan merged pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

Dandandan commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1083532239


   > I'm a bit confused here. Do you mean https://github.com/jorgecarleitao/arrow2/issues/627 chaining expressions and avoid repeated array allocation through Jit？
   
   The goal in that issue is similar yes, but using generated code instead of reusing arrays.
   
   The idea is that we can compile the entire loop.
   The compiled expression `a + b` would roughly compile to something like the following pseudo code:
   
   ```
   i = 0
   while i < length {
       *item = *a + *b;
   
       a += size_a;
       b += size_b;
       item += size_item;
   }
   ```
   
   Here item is the pointer to items in the target array and a / b  are pointing to items in arrays a and b.
   In this case we only need to allocate one target array, instead of intermediate arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

houqp commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1084085956


   I think we are good as long as we don't invoke rust functions from the JIT'd code. I am curious if cranelift will be smart enough to auto vectorize the loop though :)
   
   If we keep the code gen API similar, then when we switch to LLVM, the migration shouldn't be too hard.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

yjshen commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1083434280


   I'm a bit confused here. Do you mean https://github.com/jorgecarleitao/arrow2/issues/627 chaining expressions and avoid repeated array allocation through Jit？


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan edited a comment on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

Dandandan edited a comment on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1084366628


   > I think we are good as long as we don't invoke rust functions from the JIT'd code. I am curious if cranelift will be smart enough to auto vectorize the loop though :)
   > 
   > If we keep the code gen API similar, then when we switch to LLVM, the migration shouldn't be too hard.
   
   :100: Maybe it could event coexist together (both a Cranelift / LLVM backend).
   
   Also not sure on the autovectorization, though I am hopeful we might be able to instrument cranelift enough to have it compile to SIMD instructions, given that we now more about the data types and code than a generic for-loop.
    (there are for example SIMD types https://docs.rs/cranelift/latest/cranelift/prelude/types/index.html).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2124: JIT-compille DataFusion expression with column name

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #2124:
URL: https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1086610708


   I am sorry I am not following along this work this too carefully, but I really like where this is headed. JIT compiling predicate evaluation is definitely state of the art and I love to see it arriving in DataFusion. I think we can potentially get pretty far just JIT compiling the most common predicates even if for more complicated one we (initially) fall back to vectorized evaluation
   
   So thank you all for pushing this forward


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org