You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/22 14:36:15 UTC

[GitHub] [arrow-datafusion] waynexia opened a new pull request, #2587: Evaluate JIT'd expression over arrays

waynexia opened a new pull request, #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   This PR tries a way to perform JIT'd computation over Arrow array.
   
   As I understand, we have (at least) two ways to JIT the query execution. One is to glue all the low-level compute functions together (those in arrow compute kernel), and another is this PR, which tries to perform all the computation in the JIT engine.
   
   The first way is easier to implement (compared to the second one). And can get performance improvement from eliminated dispatch and branch. However, the second fully compiled way will take lots of effort as it requires a JIT version of compute kernel. [Gandiva](https://github.com/apache/arrow/tree/master/cpp/src/gandiva) in Arrow C++ is an LLVM-based compute kernel that might help, but I'm not very familiar with it. Whatever, being able to combine both ways should be a better situation :laughing:.
   
   Back to this PR, it will generate a loop like @Dandandan presented [here](https://github.com/apache/arrow-datafusion/pull/2124#issuecomment-1083532239). I haven't inspected whether the compiler will vectorize it. Currently, it only wraps over one `expr`, but we can explore the possibility to compile multiple `plan`s into one loop like [here](https://www.vldb.org/pvldb/vol4/p539-neumann.pdf). The row format for pipeline breaker is also significant to fully JIT.
   
   This PR only implements a very early stage "example" with many hard-code like types and fn sig. Please let me know what do you think of it!
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r879851451


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;

Review Comment:
   Hmm, so the generated func only accepts i64 arrays?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r879773733


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(
+        df_expr: DFExpr,
+        schema: Arc<DFSchema>,
+        lhs: PrimitiveArray<Int64Type>,
+        rhs: PrimitiveArray<Int64Type>,
+    ) -> Result<PrimitiveArray<Int64Type>> {
+        if lhs.null_count() != 0 || rhs.null_count() != 0 {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on nullable array not yet supported".to_string(),
+            ));
+        }
+        if lhs.len() != rhs.len() {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on different length arrays not yet supported".to_string(),
+            ));
+        }
+
+        // translate DF Expr to JIT Expr
+        let input_fields = schema.field_names();
+        let jit_expr: JITExpr = (df_expr, schema).try_into()?;
+
+        // allocate memory for calc result
+        let len = lhs.len();
+        let result = vec![0i64; len];
+
+        // compile and run JIT code
+        let assembler = Assembler::default();
+        let gen_func = build_calc_fn(&assembler, jit_expr, input_fields)?;
+        let mut jit = assembler.create_jit();
+        let code_ptr = jit.compile(gen_func)?;
+        let code_fn =
+            unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };
+        code_fn(
+            lhs.values().as_ptr() as i64,
+            rhs.values().as_ptr() as i64,
+            result.as_ptr() as i64,
+            len as i64,
+        );
+
+        let result_array = PrimitiveArray::<Int64Type>::from_iter(result);
+        Ok(result_array)
+    }
+
+    #[test]
+    fn array_add() {
+        let array_a: PrimitiveArray<Int64Type> =
+            PrimitiveArray::from_iter_values((0..10).map(|x| x + 1));
+        let array_b: PrimitiveArray<Int64Type> =
+            PrimitiveArray::from_iter_values((0..10).map(|x| x + 1));
+        let expected =
+            arrow::compute::kernels::arithmetic::add(&array_a, &array_b).unwrap();
+
+        let df_expr = datafusion_expr::col("a") + datafusion_expr::col("b");
+        let schema = Arc::new(
+            DFSchema::new_with_metadata(
+                vec![
+                    DFField::new(Some("table1"), "a", DataType::Int64, false),
+                    DFField::new(Some("table1"), "b", DataType::Int64, false),
+                ],
+                HashMap::new(),
+            )
+            .unwrap(),
+        );
+
+        let result = run_df_expr(df_expr, schema, array_a, array_b).unwrap();
+        assert_eq!(result, expected);

Review Comment:
   Whoo, really nice 🎉



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
waynexia commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r880084522


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;

Review Comment:
   Changed to parameter 😉 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
waynexia commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r880054517


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(
+        df_expr: DFExpr,
+        schema: Arc<DFSchema>,
+        lhs: PrimitiveArray<Int64Type>,
+        rhs: PrimitiveArray<Int64Type>,
+    ) -> Result<PrimitiveArray<Int64Type>> {
+        if lhs.null_count() != 0 || rhs.null_count() != 0 {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on nullable array not yet supported".to_string(),
+            ));
+        }
+        if lhs.len() != rhs.len() {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on different length arrays not yet supported".to_string(),
+            ));
+        }
+
+        // translate DF Expr to JIT Expr
+        let input_fields = schema.field_names();
+        let jit_expr: JITExpr = (df_expr, schema).try_into()?;
+
+        // allocate memory for calc result
+        let len = lhs.len();
+        let result = vec![0i64; len];
+
+        // compile and run JIT code
+        let assembler = Assembler::default();
+        let gen_func = build_calc_fn(&assembler, jit_expr, input_fields)?;
+        let mut jit = assembler.create_jit();
+        let code_ptr = jit.compile(gen_func)?;
+        let code_fn =
+            unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };

Review Comment:
   Good suggestion 👍 
   
   > Extremely unsafe but feels very powerful
   
   Same!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
waynexia commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r880053159


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),

Review Comment:
   The generated code is working on pointer directly, it might be hard to do these check inside it. I check inputs' length before pass them to generated code at [here](https://github.com/apache/arrow-datafusion/pull/2587/files#diff-4d319f92665cd672813b17b0693fad4817830e8cd19c719ec738d28fceee642aR92-R96). 
   ```rust
           if lhs.len() != rhs.len() {
               return Err(DataFusionError::NotImplemented(
                   "Computing on different length arrays not yet supported".to_string(),
               ));
           }
   ```
   But I agree that we should consider how to improve safety of generated code when its logic get complicated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#issuecomment-1135022377

   I think it is important to understand what cranelift is, and what it isn't. Cranelift is a code generator originally intended to take optimised WASM and convert it to native code. It is **not** an optimising compiler like LLVM.
   
   I could see it being very well suited for doing runtime monomorphisation, i.e. removing conditional branches. I think it will struggle to out-perform the kernels in arrow-rs, some of which are hand-rolled and all of which benefit from LLVM compilation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r879772780


##########
datafusion/jit/src/api.rs:
##########
@@ -604,6 +606,15 @@ impl<'a> CodeBlock<'a> {
             internal_err!("No func with the name {} exist", fn_name)
         }
     }
+
+    pub fn deref(&self, ptr: Expr, ty: JITType) -> Result<Expr> {
+        Ok(Expr::Deref(Box::new(ptr), ty))
+    }
+
+    pub fn store(&mut self, value: Expr, ptr: Expr) -> Result<()> {

Review Comment:
   ```suggestion
       /// Store the value in `value` to the address in `ptr`
       pub fn store(&mut self, value: Expr, ptr: Expr) -> Result<()> {
   ```



##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(

Review Comment:
   In the longer term I would like to see this type of logic encapsulated somehow
   
   So we would have a function or struct that took an `Expr` and several `ArrayRefs` and then called a JIT or non-JIT version of evaluation depending on flags or options. 



##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(
+        df_expr: DFExpr,
+        schema: Arc<DFSchema>,
+        lhs: PrimitiveArray<Int64Type>,
+        rhs: PrimitiveArray<Int64Type>,
+    ) -> Result<PrimitiveArray<Int64Type>> {
+        if lhs.null_count() != 0 || rhs.null_count() != 0 {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on nullable array not yet supported".to_string(),
+            ));
+        }
+        if lhs.len() != rhs.len() {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on different length arrays not yet supported".to_string(),
+            ));
+        }
+
+        // translate DF Expr to JIT Expr
+        let input_fields = schema.field_names();
+        let jit_expr: JITExpr = (df_expr, schema).try_into()?;
+
+        // allocate memory for calc result
+        let len = lhs.len();
+        let result = vec![0i64; len];
+
+        // compile and run JIT code
+        let assembler = Assembler::default();
+        let gen_func = build_calc_fn(&assembler, jit_expr, input_fields)?;
+        let mut jit = assembler.create_jit();
+        let code_ptr = jit.compile(gen_func)?;
+        let code_fn =
+            unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };

Review Comment:
   I wonder why not cast to the types you really want, like `fn(*i64, *i64, *i64, i64)` and then you can avoid the `as i64` stuff below



##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(
+        df_expr: DFExpr,
+        schema: Arc<DFSchema>,
+        lhs: PrimitiveArray<Int64Type>,
+        rhs: PrimitiveArray<Int64Type>,
+    ) -> Result<PrimitiveArray<Int64Type>> {
+        if lhs.null_count() != 0 || rhs.null_count() != 0 {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on nullable array not yet supported".to_string(),
+            ));
+        }
+        if lhs.len() != rhs.len() {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on different length arrays not yet supported".to_string(),
+            ));
+        }
+
+        // translate DF Expr to JIT Expr
+        let input_fields = schema.field_names();
+        let jit_expr: JITExpr = (df_expr, schema).try_into()?;
+
+        // allocate memory for calc result
+        let len = lhs.len();
+        let result = vec![0i64; len];
+
+        // compile and run JIT code
+        let assembler = Assembler::default();
+        let gen_func = build_calc_fn(&assembler, jit_expr, input_fields)?;
+        let mut jit = assembler.create_jit();
+        let code_ptr = jit.compile(gen_func)?;
+        let code_fn =
+            unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };
+        code_fn(
+            lhs.values().as_ptr() as i64,
+            rhs.values().as_ptr() as i64,
+            result.as_ptr() as i64,
+            len as i64,
+        );
+
+        let result_array = PrimitiveArray::<Int64Type>::from_iter(result);
+        Ok(result_array)
+    }
+
+    #[test]
+    fn array_add() {
+        let array_a: PrimitiveArray<Int64Type> =

Review Comment:
   I recommend using different values for `array_a` and `array_b` so issues in argument handling would be evident
   
   Like maybe
   
   ```rust
           let array_b: PrimitiveArray<Int64Type> =
               PrimitiveArray::from_iter_values((10..20).map(|x| x + 1));
   ```



##########
datafusion/jit/src/api.rs:
##########
@@ -604,6 +606,15 @@ impl<'a> CodeBlock<'a> {
             internal_err!("No func with the name {} exist", fn_name)
         }
     }
+
+    pub fn deref(&self, ptr: Expr, ty: JITType) -> Result<Expr> {

Review Comment:
   What do you think about adding docstrings? 
   
   ```suggestion
       /// Return the value pointed to by the ptr stored in `ptr`
       pub fn deref(&self, ptr: Expr, ty: JITType) -> Result<Expr> {
   ```
   



##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(
+        df_expr: DFExpr,
+        schema: Arc<DFSchema>,
+        lhs: PrimitiveArray<Int64Type>,
+        rhs: PrimitiveArray<Int64Type>,
+    ) -> Result<PrimitiveArray<Int64Type>> {
+        if lhs.null_count() != 0 || rhs.null_count() != 0 {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on nullable array not yet supported".to_string(),
+            ));
+        }
+        if lhs.len() != rhs.len() {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on different length arrays not yet supported".to_string(),
+            ));
+        }
+
+        // translate DF Expr to JIT Expr
+        let input_fields = schema.field_names();
+        let jit_expr: JITExpr = (df_expr, schema).try_into()?;
+
+        // allocate memory for calc result
+        let len = lhs.len();
+        let result = vec![0i64; len];
+
+        // compile and run JIT code
+        let assembler = Assembler::default();
+        let gen_func = build_calc_fn(&assembler, jit_expr, input_fields)?;
+        let mut jit = assembler.create_jit();
+        let code_ptr = jit.compile(gen_func)?;
+        let code_fn =
+            unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };
+        code_fn(
+            lhs.values().as_ptr() as i64,
+            rhs.values().as_ptr() as i64,
+            result.as_ptr() as i64,
+            len as i64,
+        );
+
+        let result_array = PrimitiveArray::<Int64Type>::from_iter(result);
+        Ok(result_array)
+    }
+
+    #[test]
+    fn array_add() {
+        let array_a: PrimitiveArray<Int64Type> =
+            PrimitiveArray::from_iter_values((0..10).map(|x| x + 1));
+        let array_b: PrimitiveArray<Int64Type> =
+            PrimitiveArray::from_iter_values((0..10).map(|x| x + 1));
+        let expected =
+            arrow::compute::kernels::arithmetic::add(&array_a, &array_b).unwrap();
+
+        let df_expr = datafusion_expr::col("a") + datafusion_expr::col("b");
+        let schema = Arc::new(
+            DFSchema::new_with_metadata(
+                vec![
+                    DFField::new(Some("table1"), "a", DataType::Int64, false),
+                    DFField::new(Some("table1"), "b", DataType::Int64, false),
+                ],
+                HashMap::new(),
+            )
+            .unwrap(),
+        );
+
+        let result = run_df_expr(df_expr, schema, array_a, array_b).unwrap();
+        assert_eq!(result, expected);
+    }
+
+    #[test]
+    fn calc_fn_builder() {
+        let expr = JITExpr::Binary(BinaryExpr::Add(
+            Box::new(JITExpr::Identifier("table1.a".to_string(), I64)),
+            Box::new(JITExpr::Identifier("table1.b".to_string(), I64)),
+        ));
+        let fields = vec!["table1.a".to_string(), "table1.b".to_string()];
+
+        let expected = r#"fn calc_fn_0(table1.a_array: i64, table1.b_array: i64, result: i64, len: i64) -> () {
+    let index: i64;
+    index = 0;
+    while index < len {

Review Comment:
   I looked at this code and it looks 👍  to me



##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(
+        df_expr: DFExpr,
+        schema: Arc<DFSchema>,
+        lhs: PrimitiveArray<Int64Type>,
+        rhs: PrimitiveArray<Int64Type>,
+    ) -> Result<PrimitiveArray<Int64Type>> {
+        if lhs.null_count() != 0 || rhs.null_count() != 0 {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on nullable array not yet supported".to_string(),
+            ));
+        }
+        if lhs.len() != rhs.len() {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on different length arrays not yet supported".to_string(),
+            ));
+        }
+
+        // translate DF Expr to JIT Expr
+        let input_fields = schema.field_names();
+        let jit_expr: JITExpr = (df_expr, schema).try_into()?;
+
+        // allocate memory for calc result
+        let len = lhs.len();
+        let result = vec![0i64; len];
+
+        // compile and run JIT code
+        let assembler = Assembler::default();
+        let gen_func = build_calc_fn(&assembler, jit_expr, input_fields)?;
+        let mut jit = assembler.create_jit();
+        let code_ptr = jit.compile(gen_func)?;
+        let code_fn =
+            unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };

Review Comment:
   code like this is some of my favorite ❤️  -- cast some memory to a function pointer and call it ;)
   
   Extremely unsafe but feels very powerful
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r880579996


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),

Review Comment:
   I think in general the idea is that all the safety checks are done during JIT generation as suggested by @waynexia 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#issuecomment-1133954002

   This looks very cool @waynexia  -- I will give it a good look tomorrow. 
   
   > The first way is easier to implement (compared to the second one). And can get performance improvement from eliminated dispatch and branch. However, the second fully compiled way will take lots of effort as it requires a JIT version of compute kernel
   
   The other reason that JIT execution for DataFusion is interesting is due to a few operations which are fundamentally row-oriented (and thus not amenable to vectorized execution), the key being a multi-tuple comparison (not just equality) which appears in sorting, grouping, and joining)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
waynexia commented on PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#issuecomment-1135469545

   Thanks for all the reviews and information!
   
   >I think it is important to understand what cranelift is, and what it isn't. Cranelift is a code generator originally intended to take optimised WASM and convert it to native code. It is not an optimising compiler like LLVM.
   
   That makes sense. I could see a long way to implement this JIT framework and (another long way) to make it outperform the existing interpreter executor 🤣
   
   >I wonder why you chose the name deref rather than store? The underlying cranelift library seems to use load and store which I think are the more common terms in compilers for what Rust terms deref
   
   I also have hesitated at the naming, but have now changed it to `load()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r879852643


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),

Review Comment:
   Hmm, do we need sanity check for equal array lengths?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#issuecomment-1136007628

   I also agree with @tustvold  that in general it will be very hard to beat the performance of using the optimized, vectorized kernels in arrow-rs and in general I think we should continue to use them whenever possible.
   
   As I mentioned in https://github.com/apache/arrow-datafusion/pull/2587#issuecomment-1133954002 I think the super valuable usecase for JIT'ed Exprs is for operations that can't easily (or at all) be vectorized, namely those doing multi column comparisons during Merge / Group 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
alamb merged PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
waynexia commented on PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#issuecomment-1134774026

   I got this from compiled function:
   ```
   function u0:0(i64, i64, i64, i64) system_v {
   block0(v0: i64, v1: i64, v2: i64, v3: i64):
       v11 -> v0
       v14 -> v1
       v17 -> v2
       v6 -> v3
       v4 = iconst.i64 0
       jump block1(v4)
   
   block1(v5: i64):
       v7 = icmp slt v5, v6
       v8 = bint.i64 v7
       brz v8, block3
       jump block2
   
   block2:
       v9 = iconst.i64 8
       v10 = imul.i64 v5, v9
       v12 = iadd.i64 v11, v10
       v13 = load.i64 v12
       v15 = iadd.i64 v14, v10
       v16 = load.i64 v15
       v18 = iadd.i64 v17, v10
       v19 = iadd v13, v16
       store v19, v18
       v20 = iconst.i64 1
       v21 = iadd.i64 v5, v20
       jump block1(v21)
   
   block3:
       return
   }
   ```
   
   It just looks like a bare translation of what we build. So I suspect the vectorization is not done here (after translation). Further, I find this [`I64X8` type](https://docs.rs/cranelift-codegen/latest/cranelift_codegen/ir/types/constant.I64X8.html) from the document (we are currently using [`I64`](https://docs.rs/cranelift-codegen/latest/cranelift_codegen/ir/types/constant.I64.html)). Perhaps this means that we need to manually vectorize our computation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#issuecomment-1135017030

   > It just looks like a bare translation of what we build. So I suspect the vectorization is not done here (after translation). Further, I find this [I64X8 type](https://docs.rs/cranelift-codegen/latest/cranelift_codegen/ir/types/constant.I64X8.html) from the document (we are currently using [I64](https://docs.rs/cranelift-codegen/latest/cranelift_codegen/ir/types/constant.I64.html)). Perhaps this means that we need to manually vectorize our computation.
   
   I believe this is a known limitation with cranelift -- we can also potentially consider using llvm in the future, but that would likely result in longer query planning times. I believe @tustvold  has thought about this as well.
   
   Writing some sort of basic vectorization optimizations in cranelift (or in datafusion) is also a possbility


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
waynexia commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r880053266


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(
+        df_expr: DFExpr,
+        schema: Arc<DFSchema>,
+        lhs: PrimitiveArray<Int64Type>,
+        rhs: PrimitiveArray<Int64Type>,
+    ) -> Result<PrimitiveArray<Int64Type>> {
+        if lhs.null_count() != 0 || rhs.null_count() != 0 {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on nullable array not yet supported".to_string(),
+            ));
+        }
+        if lhs.len() != rhs.len() {
+            return Err(DataFusionError::NotImplemented(
+                "Computing on different length arrays not yet supported".to_string(),
+            ));
+        }
+
+        // translate DF Expr to JIT Expr
+        let input_fields = schema.field_names();
+        let jit_expr: JITExpr = (df_expr, schema).try_into()?;
+
+        // allocate memory for calc result
+        let len = lhs.len();
+        let result = vec![0i64; len];
+
+        // compile and run JIT code
+        let assembler = Assembler::default();
+        let gen_func = build_calc_fn(&assembler, jit_expr, input_fields)?;
+        let mut jit = assembler.create_jit();
+        let code_ptr = jit.compile(gen_func)?;
+        let code_fn =
+            unsafe { core::mem::transmute::<_, fn(i64, i64, i64, i64) -> ()>(code_ptr) };
+        code_fn(
+            lhs.values().as_ptr() as i64,
+            rhs.values().as_ptr() as i64,
+            result.as_ptr() as i64,
+            len as i64,
+        );
+
+        let result_array = PrimitiveArray::<Int64Type>::from_iter(result);
+        Ok(result_array)
+    }
+
+    #[test]
+    fn array_add() {
+        let array_a: PrimitiveArray<Int64Type> =

Review Comment:
   Nice catch 😆 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #2587: Evaluate JIT'd expression over arrays

Posted by GitBox <gi...@apache.org>.
waynexia commented on code in PR #2587:
URL: https://github.com/apache/arrow-datafusion/pull/2587#discussion_r880063760


##########
datafusion/jit/src/compile.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Compile DataFusion Expr to JIT'd function.
+
+use datafusion_common::Result;
+
+use crate::api::Assembler;
+use crate::{
+    api::GeneratedFunction,
+    ast::{Expr as JITExpr, I64, PTR_SIZE},
+};
+
+/// Wrap JIT Expr to array compute function.
+pub fn build_calc_fn(
+    assembler: &Assembler,
+    jit_expr: JITExpr,
+    input_names: Vec<String>,
+) -> Result<GeneratedFunction> {
+    let mut builder = assembler.new_func_builder("calc_fn");
+    for input in &input_names {
+        builder = builder.param(format!("{}_array", input), I64);
+    }
+    let mut builder = builder.param("result", I64).param("len", I64);
+
+    let mut fn_body = builder.enter_block();
+    fn_body.declare_as("index", fn_body.lit_i(0))?;
+    fn_body.while_block(
+        |cond| cond.lt(cond.id("index")?, cond.id("len")?),
+        |w| {
+            w.declare_as("offset", w.mul(w.id("index")?, w.lit_i(PTR_SIZE as i64))?)?;
+            for input in &input_names {
+                w.declare_as(
+                    format!("{}_ptr", input),
+                    w.add(w.id(format!("{}_array", input))?, w.id("offset")?)?,
+                )?;
+                w.declare_as(input, w.deref(w.id(format!("{}_ptr", input))?, I64)?)?;
+            }
+            w.declare_as("res_ptr", w.add(w.id("result")?, w.id("offset")?)?)?;
+            w.declare_as("res", jit_expr.clone())?;
+            w.store(w.id("res")?, w.id("res_ptr")?)?;
+
+            w.assign("index", w.add(w.id("index")?, w.lit_i(1))?)?;
+            Ok(())
+        },
+    )?;
+
+    let gen_func = fn_body.build();
+    Ok(gen_func)
+}
+
+#[cfg(test)]
+mod test {
+    use std::{collections::HashMap, sync::Arc};
+
+    use arrow::{
+        array::{Array, PrimitiveArray},
+        datatypes::{DataType, Int64Type},
+    };
+    use datafusion_common::{DFField, DFSchema, DataFusionError};
+    use datafusion_expr::Expr as DFExpr;
+
+    use crate::ast::BinaryExpr;
+
+    use super::*;
+
+    fn run_df_expr(

Review Comment:
   Definitely! 
   
   I'm going to figure out how to play with var length param list and type casting, and then encapsulate a plan node level interface (maybe take `LogicalPlan` and data as input, haven't thought in detail) as next step.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org