You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/28 23:30:49 UTC

[GitHub] [arrow] seddonm1 opened a new pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

seddonm1 opened a new pull request #9038:
URL: https://github.com/apache/arrow/pull/9038


   This PR is a work-in-progress simple implementation of `InList` (`'ABC' IN ('ABC', 'DEF')`) which currently only operates on strings.
   
   It uses the `kernels::comparison::contains` implementation but there are a few issues I am struggling with:
   
   1. `kernels::comparison::contains` allows each value in the input array to match against potentially different value arrays. My implementation is very inefficiently creating the same array n times to prevent the error of mismatched input lengths (https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs#L696). Is there a more efficient way to create these `ListArray`s?
   
   2. `kernels::comparison::contains` returns `false` if either of the comparison values is `null`. Is this the desired behavior? If not I can modify the kernel to return null instead.
   
   3. If the basic implementation looks correct I can add the rest of the data types (via macros).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r552947771



##########
File path: rust/datafusion/src/optimizer/utils.rs
##########
@@ -416,6 +424,7 @@ pub fn rewrite_expression(expr: &Expr, expressions: &Vec<Expr>) -> Result<Expr>
                 Ok(expr)
             }
         }
+        Expr::InList { .. } => Ok(expr.clone()),

Review comment:
       Here this is just cloning the while `InList` expression (not the `expr` in `InList`) as the optimiser is not doing anything for this Expression yet.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-754223732


   Ok, I have done a major refactor against a rebased master.
   
   I believe this now meets the ANSI behavior with regard to `NULL` handling but it does not yet support syntax where columns are referenced in the `list` parameter like:
   
   ```sql
   SELECT TRUE IN (col1, col2, FALSE)
   ```
   
   This has been implemented with a "make it work then make it fast" approach as this `InList` expression actually has a lot of complexity.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753241531


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=h1) Report
   > Merging [#9038](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=desc) (e4f69e6) into [master](https://codecov.io/gh/apache/arrow/commit/5db1d2aad0b6aaca6dfe4e01b0afeefb0311d109?el=desc) (5db1d2a) will **decrease** coverage by `0.04%`.
   > The diff coverage is `78.04%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9038/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9038      +/-   ##
   ==========================================
   - Coverage   82.60%   82.55%   -0.05%     
   ==========================================
     Files         204      204              
     Lines       50200    50664     +464     
   ==========================================
   + Hits        41467    41827     +360     
   - Misses       8733     8837     +104     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `58.18% <45.45%> (-0.54%)` | :arrow_down: |
   | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `53.92% <47.05%> (-0.68%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <77.27%> (+0.02%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `83.77% <78.53%> (-0.71%)` | :arrow_down: |
   | [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `84.15% <81.81%> (+0.17%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `78.63% <82.53%> (+0.74%)` | :arrow_up: |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.82% <100.00%> (-0.01%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvdXRpbHMucnM=) | `73.78% <0.00%> (-11.22%)` | :arrow_down: |
   | [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `71.69% <0.00%> (-3.75%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `84.61% <0.00%> (-0.27%)` | :arrow_down: |
   | ... and [19 more](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=footer). Last update [5db1d2a...9e8f888](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-756245785


   I also changed the title of this PR so that it doesn't say "WIP" anymore -- as I don't think it is WIP


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-756716167


   Thanks again @seddonm1 .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753241531


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=h1) Report
   > Merging [#9038](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=desc) (96cd76d) into [master](https://codecov.io/gh/apache/arrow/commit/98159f18dae0fdccfa967bfc452966a054e41cce?el=desc) (98159f1) will **decrease** coverage by `0.03%`.
   > The diff coverage is `77.75%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9038/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9038      +/-   ##
   ==========================================
   - Coverage   82.60%   82.57%   -0.04%     
   ==========================================
     Files         204      204              
     Lines       50496    50879     +383     
   ==========================================
   + Hits        41713    42013     +300     
   - Misses       8783     8866      +83     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow-pyarrow-integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy1weWFycm93LWludGVncmF0aW9uLXRlc3Rpbmcvc3JjL2xpYi5ycw==) | `0.00% <ø> (ø)` | |
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `7.02% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.50% <ø> (ø)` | |
   | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `58.18% <45.45%> (-0.54%)` | :arrow_down: |
   | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `53.92% <47.05%> (-0.68%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `78.51% <63.63%> (-1.49%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <68.62%> (+0.14%)` | :arrow_up: |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <77.27%> (+0.02%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `83.77% <78.53%> (-0.71%)` | :arrow_down: |
   | [rust/arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvZmZpLnJz) | `75.95% <80.00%> (+0.28%)` | :arrow_up: |
   | ... and [8 more](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=footer). Last update [0050795...96cd76d](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753308076


   Note that the clippy error https://github.com/apache/arrow/pull/9038/checks?check_run_id=1632226138 has been fixed on master so if you rebase this PR against master that CI check should pass


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r549551740



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,129 @@ impl PhysicalSortExpr {
     }
 }
 
+/// Not expression
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+impl InListExpr {
+    /// Create a new InList function
+    pub fn new(expr: Arc<dyn PhysicalExpr>, list: Vec<Arc<dyn PhysicalExpr>>) -> Self {
+        Self { expr, list }
+    }
+
+    fn compare_utf8(
+        &self,
+        array: Arc<dyn Array>,
+        list_values: Vec<ColumnarValue>,
+    ) -> Result<ColumnarValue> {
+        let array = array
+            .as_any()
+            .downcast_ref::<GenericStringArray<i32>>()
+            .unwrap();
+
+        // create the ListArray expected by comparison::contains_utf8
+        let values_builder = StringBuilder::new(list_values.len());
+        let mut builder = ListBuilder::new(values_builder);
+        for _ in 0..array.len() {
+            list_values.iter().for_each(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::Utf8(Some(v)) => {
+                        builder.values().append_value(v).unwrap();
+                    }
+                    _ => unimplemented!("not yet implemented"),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            });
+            builder.append(true).unwrap();
+        }
+        let list_array = builder.finish();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            kernels::comparison::contains_utf8(array, &list_array)
+                .map_err(DataFusionError::ArrowError)?,
+        )))
+    }
+}
+
+impl fmt::Display for InListExpr {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{} IN ({:?})", self.expr, self.list)
+    }
+}
+
+impl PhysicalExpr for InListExpr {
+    fn data_type(&self, _input_schema: &Schema) -> Result<DataType> {
+        Ok(DataType::Boolean)
+    }
+
+    fn nullable(&self, input_schema: &Schema) -> Result<bool> {
+        self.expr.nullable(input_schema)
+    }
+
+    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
+        let value = self.expr.evaluate(batch)?;
+        let value_data_type = value.data_type();
+
+        let list_values = self
+            .list
+            .iter()
+            .map(|expr| expr.evaluate(batch))
+            .collect::<Result<Vec<_>>>()?;
+        let list_values_data_types = list_values
+            .iter()
+            .map(|expr| expr.data_type())
+            .collect::<Vec<DataType>>();
+
+        if list_values_data_types
+            .iter()
+            .any(|dt| *dt != value_data_type)
+        {
+            return Err(DataFusionError::Internal(format!(

Review comment:
       I think we should do this earlier already when creating/checking the logical plan.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753241531


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=h1) Report
   > Merging [#9038](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=desc) (7a44a99) into [master](https://codecov.io/gh/apache/arrow/commit/5db1d2aad0b6aaca6dfe4e01b0afeefb0311d109?el=desc) (5db1d2a) will **decrease** coverage by `0.01%`.
   > The diff coverage is `78.97%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9038/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9038      +/-   ##
   ==========================================
   - Coverage   82.60%   82.58%   -0.02%     
   ==========================================
     Files         204      204              
     Lines       50200    50533     +333     
   ==========================================
   + Hits        41467    41735     +268     
   - Misses       8733     8798      +65     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `53.92% <47.05%> (-0.68%)` | :arrow_down: |
   | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `59.04% <71.42%> (+0.32%)` | :arrow_up: |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.92% <77.27%> (+0.02%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `83.77% <78.53%> (-0.71%)` | :arrow_down: |
   | [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `84.06% <81.81%> (+0.09%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `78.63% <82.53%> (+0.74%)` | :arrow_up: |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.83% <100.00%> (+<0.01%)` | :arrow_up: |
   | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `57.37% <0.00%> (+1.19%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=footer). Last update [5db1d2a...7a44a99](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753468520


   FWIW I think having a native `col IN (constant list)` is a useful thing to have (eventually). On other words I do like the direction of this PR
   
   The rationale for having `col IN (list)` type predicates is that they make recognizing the special case of "single column predicates" more easily (e.g. so you can split them off and push them down the plans into scans)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r549550179



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,

Review comment:
       I think it might be easier to convert it here already to a vec where each element should have the same datatype,. And we check that while generating it?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] yordan-pavlov edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
yordan-pavlov edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-755316885


   @seddonm1 looks great, the 'IN' operator is one of the features I have been missing and thinking about implementing myself but looks like you beat me to it :) 
   my only concern is that the current implementation doesn't appear to make use of SIMD; have you looked into comparing performance against an expression of the form `(eq(array, scalar1) or eq(array, scalar2) or ... or eq(array, scalarN)`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550877362



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       Ok, thanks @seddonm1 for checking . sounds good to me




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753392565


   @jhorstmann @Dandandan I like the simplicity of this idea but there are a lot of strange cases that need to be considered given how ANSI SQL handles NULL values.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-756239119


   @seddonm1  sadly this PR has some small conflicts -- can you please rebase it so I can merge it in? Thanks again for all your work to get this done.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r549888900



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,

Review comment:
       I cannot find another example where we do validation like checking same datatypes in the Logical Plan. Most of this type of validation is performed in the Physical Plan: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/expressions.rs#L1650




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9038:
URL: https://github.com/apache/arrow/pull/9038


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753241531


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=h1) Report
   > Merging [#9038](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=desc) (119c021) into [master](https://codecov.io/gh/apache/arrow/commit/cd22be6efedbf9832b5ea875ca59bb42de7b6c28?el=desc) (cd22be6) will **decrease** coverage by `0.06%`.
   > The diff coverage is `64.97%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9038/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9038      +/-   ##
   ==========================================
   - Coverage   82.55%   82.49%   -0.07%     
   ==========================================
     Files         203      203              
     Lines       50043    50240     +197     
   ==========================================
   + Hits        41315    41445     +130     
   - Misses       8728     8795      +67     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `58.42% <33.33%> (-0.29%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.28% <50.00%> (-0.62%)` | :arrow_down: |
   | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `54.49% <53.33%> (-0.11%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `82.39% <59.84%> (-2.09%)` | :arrow_down: |
   | [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `84.06% <81.81%> (+0.09%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `77.96% <83.33%> (+0.06%)` | :arrow_up: |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.83% <100.00%> (+<0.01%)` | :arrow_up: |
   | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `59.49% <0.00%> (+0.42%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=footer). Last update [cd22be6...119c021](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-752807355


   I have updated this PR with a reimplementation of the logic so that the kernel which has two undesired behaviour (see points 1 and 2) is no longer invoked. It should also support the full range of types as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550808963



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,212 @@ impl PhysicalSortExpr {
     }
 }
 
+/// InList
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+macro_rules! make_contains {
+    ($ARRAY:expr, $LIST_VALUES:expr, $SCALAR_VALUE:ident, $ARRAY_TYPE:ident) => {{
+        let array = $ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap();
+
+        let values = $LIST_VALUES
+            .iter()
+            .map(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::$SCALAR_VALUE(Some(v)) => v,
+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            })
+            .collect::<Vec<_>>();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            array
+                .iter()
+                .map(|x| x.map(|x| values.contains(&&x)))

Review comment:
       This approach of mapping the array was suggested by @jorgecarleitao when helping me with the StringExpressions: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/string_expressions.rs#L84. The benefit is that if the input value is `NULL` (i.e. `None`) then we don't have to do any work on it (the second `map`).
   
   I have confirmed this is the desired behavior against Postgres 13.1 so that any `NULL` input `expr` should return null:
   
   ```sql
   SELECT NULL IN ('a'); -> NULL
   SELECT NULL NOT IN ('a'); -> NULL
   SELECT NULL IN (NULL, 'a'); -> NULL
   SELECT NULL NOT IN (NULL, 'a'); -> NULL
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-756715134






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550760379



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,212 @@ impl PhysicalSortExpr {
     }
 }
 
+/// InList
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+macro_rules! make_contains {
+    ($ARRAY:expr, $LIST_VALUES:expr, $SCALAR_VALUE:ident, $ARRAY_TYPE:ident) => {{
+        let array = $ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap();
+
+        let values = $LIST_VALUES
+            .iter()
+            .map(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::$SCALAR_VALUE(Some(v)) => v,
+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            })
+            .collect::<Vec<_>>();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            array
+                .iter()
+                .map(|x| x.map(|x| values.contains(&&x)))
+                .collect::<BooleanArray>(),
+        )))
+    }};
+}
+
+impl InListExpr {
+    /// Create a new InList expression
+    pub fn new(expr: Arc<dyn PhysicalExpr>, list: Vec<Arc<dyn PhysicalExpr>>) -> Self {
+        Self { expr, list }
+    }
+
+    /// Compare for specific utf8 types
+    fn compare_utf8<T: StringOffsetSizeTrait>(
+        &self,
+        array: ArrayRef,
+        list_values: Vec<ColumnarValue>,
+    ) -> Result<ColumnarValue> {
+        let array = array
+            .as_any()
+            .downcast_ref::<GenericStringArray<T>>()
+            .unwrap();
+
+        let values = list_values
+            .iter()
+            .map(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::Utf8(Some(v)) => v.as_str(),
+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")

Review comment:
       this should probably generate an error earlier in planing too (e.g. if you see an expression like `my_col IN (my_other_col, 'foo')` )

##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,

Review comment:
       I think the rationale / idea (largely expressed by @jorgecarleitao ) was that actual type coercion happens during physical planning (so that we could potentially have different backend physical planning mechanisms but the same logical mechanisms).
   
   You could potentially use the coercion logic here: https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/type_coercion.rs#L118
   
   And coerce the in list items all to the same types 

##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,212 @@ impl PhysicalSortExpr {
     }
 }
 
+/// InList
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+macro_rules! make_contains {
+    ($ARRAY:expr, $LIST_VALUES:expr, $SCALAR_VALUE:ident, $ARRAY_TYPE:ident) => {{
+        let array = $ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap();
+
+        let values = $LIST_VALUES
+            .iter()
+            .map(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::$SCALAR_VALUE(Some(v)) => v,
+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            })
+            .collect::<Vec<_>>();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            array
+                .iter()
+                .map(|x| x.map(|x| values.contains(&&x)))

Review comment:
       I wonder if this handles `NULL` correctly -- like for a value of where `expr` is NULL the output should be NULL (not true/false). The semantics when there is a literal `NULL` in the inlist are even stranger (but likely could be handled as a follow on PR)
   
   For example:
   
   sqlite> create table t(c1 int);
   sqlite> insert into t values (10);
   sqlite> insert into t values (20);
   sqlite> insert into t values(NULL);
   sqlite> select c1, c1 IN (20, NULL) from t;
   10|
   20|1
   |
   sqlite> select c1, c1 IN (20) from t;
   10|0
   20|1
   |
   ```
   
   Note that `10 IN (20, NULL)` is actually `NULL` rather than `FALSE`. Crazy
   

##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,129 @@ impl PhysicalSortExpr {
     }
 }
 
+/// Not expression
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+impl InListExpr {
+    /// Create a new InList function
+    pub fn new(expr: Arc<dyn PhysicalExpr>, list: Vec<Arc<dyn PhysicalExpr>>) -> Self {
+        Self { expr, list }
+    }
+
+    fn compare_utf8(
+        &self,
+        array: Arc<dyn Array>,
+        list_values: Vec<ColumnarValue>,
+    ) -> Result<ColumnarValue> {
+        let array = array
+            .as_any()
+            .downcast_ref::<GenericStringArray<i32>>()
+            .unwrap();
+
+        // create the ListArray expected by comparison::contains_utf8
+        let values_builder = StringBuilder::new(list_values.len());
+        let mut builder = ListBuilder::new(values_builder);
+        for _ in 0..array.len() {
+            list_values.iter().for_each(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::Utf8(Some(v)) => {
+                        builder.values().append_value(v).unwrap();
+                    }
+                    _ => unimplemented!("not yet implemented"),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            });
+            builder.append(true).unwrap();
+        }
+        let list_array = builder.finish();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            kernels::comparison::contains_utf8(array, &list_array)
+                .map_err(DataFusionError::ArrowError)?,
+        )))
+    }
+}
+
+impl fmt::Display for InListExpr {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{} IN ({:?})", self.expr, self.list)
+    }
+}
+
+impl PhysicalExpr for InListExpr {
+    fn data_type(&self, _input_schema: &Schema) -> Result<DataType> {
+        Ok(DataType::Boolean)
+    }
+
+    fn nullable(&self, input_schema: &Schema) -> Result<bool> {
+        self.expr.nullable(input_schema)
+    }
+
+    fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
+        let value = self.expr.evaluate(batch)?;
+        let value_data_type = value.data_type();
+
+        let list_values = self
+            .list
+            .iter()
+            .map(|expr| expr.evaluate(batch))
+            .collect::<Result<Vec<_>>>()?;
+        let list_values_data_types = list_values
+            .iter()
+            .map(|expr| expr.data_type())
+            .collect::<Vec<DataType>>();
+
+        if list_values_data_types
+            .iter()
+            .any(|dt| *dt != value_data_type)
+        {
+            return Err(DataFusionError::Internal(format!(

Review comment:
       I suggest that the appropriate place would be to "coerce" all the in list  item types to the same data type during Logical --> Physical plan creation. 

##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -761,6 +761,28 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 high: Box::new(self.sql_expr_to_logical_expr(&high)?),
             }),
 
+            SQLExpr::InList {
+                ref expr,
+                ref list,
+                ref negated,
+            } => {
+                let list_expr = list
+                    .iter()
+                    .map(|e| self.sql_expr_to_logical_expr(e))

Review comment:
       Here is where I think you could add the type coercion / checking logic

##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1849,3 +1849,45 @@ async fn string_expressions() -> Result<()> {
     assert_eq!(expected, actual);
     Ok(())
 }
+
+#[tokio::test]

Review comment:
       Since there is also support in this PR for numeric types, I would also suggest some basic tests for IN lists with numbers as well (e.g. `c1 IN (1, 2 3)` as well as `c1 IN (1, NULL)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753240012


   @Dandandan @alamb rebased and added some tests. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-756245785


   I also changed the title of this PR so that it doesn't say "WIP" anymore -- as I don't think it is WIP (I hope not, given that I plan to merge it!)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-752382480


   1. Yes I think there should be a different/more efficient implementation that handles the "scalar" case, where the scalar in this case is the list with values.
   2. I believe the current behavior related to NULL is correct, just as field=NULL will also be false.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550809056



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,212 @@ impl PhysicalSortExpr {
     }
 }
 
+/// InList
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+macro_rules! make_contains {
+    ($ARRAY:expr, $LIST_VALUES:expr, $SCALAR_VALUE:ident, $ARRAY_TYPE:ident) => {{
+        let array = $ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap();
+
+        let values = $LIST_VALUES
+            .iter()
+            .map(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::$SCALAR_VALUE(Some(v)) => v,
+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            })
+            .collect::<Vec<_>>();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            array
+                .iter()
+                .map(|x| x.map(|x| values.contains(&&x)))

Review comment:
       As to the second problem, `NULL` in the `list` component it gets even crazier than your example (Postgres 13.1). What a mess.
   
   ```sql
   SELECT 'a' IN (NULL); -> NULL
   SELECT 'a' IN (NULL, 'a'); -> TRUE
   SELECT 'a' IN (NULL, 'b'); -> NULL
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r552950551



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -3769,4 +4002,166 @@ mod tests {
         let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
         Ok(batch)
     }
+
+    // applies the in_list expr to an input batch and list
+    macro_rules! in_list {
+        ($BATCH:expr, $LIST:expr, $NEGATED:expr, $EXPECTED:expr) => {{
+            let expr = in_list(col("a"), $LIST, $NEGATED).unwrap();
+            let result = expr.evaluate(&$BATCH)?.into_array($BATCH.num_rows());
+            let result = result
+                .as_any()
+                .downcast_ref::<BooleanArray>()
+                .expect("failed to downcast to BooleanArray");
+            let expected = &BooleanArray::from($EXPECTED);
+            assert_eq!(expected, result);
+        }};
+    }
+
+    #[test]
+    fn in_list_utf8() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("a", DataType::Utf8, true)]);
+        let a = StringArray::from(vec![Some("a"), Some("d"), None]);
+        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
+
+        // expression: "a in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), Some(false), None]);
+
+        // expression: "a not in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), Some(true), None]);
+
+        // expression: "a not in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), None, None]);
+
+        // expression: "a not in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), None, None]);
+
+        Ok(())
+    }
+
+    #[test]
+    fn in_list_int64() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("a", DataType::Int64, true)]);
+        let a = Int64Array::from(vec![Some(0), Some(2), None]);
+        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
+
+        // expression: "a in (0, 1)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), Some(false), None]);
+
+        // expression: "a not in (0, 1)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), Some(true), None]);
+
+        // expression: "a in (0, 1, NULL)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), None, None]);
+
+        // expression: "a not in (0, 1, NULL)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), None, None]);
+
+        Ok(())
+    }
+
+    #[test]
+    fn in_list_float64() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("a", DataType::Float64, true)]);
+        let a = Float64Array::from(vec![Some(0.0), Some(0.2), None]);
+        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
+
+        // expression: "a in (0.0, 0.2)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), Some(false), None]);
+
+        // expression: "a not in (0.0, 0.2)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), Some(true), None]);
+
+        // expression: "a in (0.0, 0.2, NULL)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), None, None]);
+
+        // expression: "a not in (0.0, 0.2, NULL)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+            lit(ScalarValue::Utf8(None)),

Review comment:
       The literal `Expr::Literal(ScalarValue::Utf8(None))` is a special case in DataFusion at the moment which represents the SQL logical `NULL`. It is being passed through to the evaluation as it is required to identify whether the list contains any literal `NULL` so that we can override the return value with `NULL`. I think this could be optimised in future.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550761831



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       I think supporting sql style `NOT IN` would be nice (though no changes needed in this PR) 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-751931814


   I think this is a great start @seddonm1 !
   
   Just some thoughts: an "ideal" implementation would convert the items upfront to an `Vec<i64>`, `Vec<String>`, etc on creation, so less work would be needed when executing the plan. Also, at some threshold for bigger arrays we could create a `HashSet<i64>`, `HashSet<String>` etc instead and write a kernel using a this HashSet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753311070


   An alternative implementation would be to translate `x IN ('ABC', 'DEF', 'GHI')` into `(x = 'ABC') OR (x = 'DEF') OR (x = 'GHI')` on the logical plan level. That should handle nulls correctly and for a short set of values it should also have good performance since there are special comparison kernels for comparing against literals. For a larger set of values the `contains` might be more performant because of the possibility for returning early.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753399579


   > But you are right, it's not that simple since the arrow `or` / `and` kernels currently do not follow this sql behaviour. We might want to change that or rather introduce separate boolean kernels with the sql behaviour regarding nulls.
   
   @jhorstmann I like this approach the best. We temporarily shelve this PR (and I can do more work on the early validation) whilst these kernels are implemented then invoke them like your idea.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r553017434



##########
File path: rust/benchmarks/src/bin/tpch.rs
##########
@@ -656,7 +656,7 @@ fn create_logical_plan(ctx: &mut ExecutionContext, query: usize) -> Result<Logic
             on
                 l_orderkey = o_orderkey
             where
-                (l_shipmode = 'MAIL' or l_shipmode = 'SHIP')
+                l_shipmode in ('MAIL', 'SHIP')

Review comment:
       👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550807996



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       Thanks for the comments. I have done some testing with Postgres 13.1 and found that it does not appear to make a difference. These are all equivalent and return `NULL`.
   
   ```sql
   SELECT NOT NULL IN ('a');
   SELECT NULL NOT IN ('a');
   SELECT NOT 'a' IN (NULL);
   SELECT 'a' NOT IN (NULL);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r552950902



##########
File path: rust/datafusion/src/optimizer/utils.rs
##########
@@ -305,6 +312,7 @@ pub fn expr_sub_expressions(expr: &Expr) -> Result<Vec<Expr>> {
             low.as_ref().to_owned(),
             high.as_ref().to_owned(),
         ]),
+        Expr::InList { expr, .. } => Ok(vec![expr.as_ref().to_owned()]),

Review comment:
       Thanks. I have updated the PR with this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-751897778


   https://issues.apache.org/jira/browse/ARROW-10356


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753397568


   Semantics for the `IN` and `OR` versions should be the same *in SQL*, for example [try the following query in postgres][1]:
   
   ```
   SELECT 'abc' in ('abc', 'def')
        , 'abc' in ('abc', 'def', null)
        , 'abc' in ('def', null)
        , null in ('abc', 'def')
        -- same expressions rewritten using OR
        , (('abc' = 'abc') OR ('abc' = 'def'))
        , (('abc' = 'abc') OR ('abc' = 'def') or ('abc' = null))
        , (('abc' = 'def') or ('abc' = null))
        , ((null = 'abc') OR (null = 'def'))
   ```
   
    1. result is `true`, no nulls involved
    2. result is `true`, additional `null` on the rhs does not change this since `(true OR null) = true`
    3. result is `null` since `(false OR null) IS NULL`
    4. result is `null` since lhs of each comparison is null
   
   But you are right, it's not that simple since the arrow `or` / `and` kernels currently do not follow this sql behaviour. We might want to change that or rather introduce separate boolean kernels with the sql behaviour regarding nulls.
   
   
    [1]: http://sqlfiddle.com/#!17/9eecb/67863


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9038:
URL: https://github.com/apache/arrow/pull/9038


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-756366060


   Thanks @alamb . Yes the WIP was a leftover :D I have rebased so once the CI passes it should merge!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r552193411



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1849,3 +1849,89 @@ async fn string_expressions() -> Result<()> {
     assert_eq!(expected, actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn in_list_array() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    register_aggregate_csv_by_sql(&mut ctx).await;
+    let sql = "SELECT
+            c1 IN ('a', 'c') AS utf8_in_true
+            ,c1 IN ('x', 'y') AS utf8_in_false
+            ,c1 NOT IN ('x', 'y') AS utf8_not_in_true
+            ,c1 NOT IN ('a', 'c') AS utf8_not_in_false
+            ,CAST(CAST(c1 AS int) AS varchar) IN ('a', 'c') AS utf8_in_null
+        FROM aggregate_test_100 WHERE c12 < 0.05";
+    let actual = execute(&mut ctx, sql).await;
+    let expected = vec![
+        vec!["true", "false", "true", "false", "NULL"],
+        vec!["true", "false", "true", "false", "NULL"],
+        vec!["true", "false", "true", "false", "NULL"],
+        vec!["false", "false", "true", "true", "NULL"],
+        vec!["false", "false", "true", "true", "NULL"],
+        vec!["false", "false", "true", "true", "NULL"],
+        vec!["false", "false", "true", "true", "NULL"],
+    ];
+    assert_eq!(expected, actual);
+    Ok(())
+}
+
+#[tokio::test]
+async fn in_list_scalar() -> Result<()> {

Review comment:
       ❤️ 

##########
File path: rust/datafusion/src/optimizer/utils.rs
##########
@@ -416,6 +424,7 @@ pub fn rewrite_expression(expr: &Expr, expressions: &Vec<Expr>) -> Result<Expr>
                 Ok(expr)
             }
         }
+        Expr::InList { .. } => Ok(expr.clone()),

Review comment:
       likewise here, I think we might want to include the `list` -- even though at the moment it only contains constants, it is a `Vec<Expr>`

##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -3769,4 +4002,166 @@ mod tests {
         let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
         Ok(batch)
     }
+
+    // applies the in_list expr to an input batch and list
+    macro_rules! in_list {
+        ($BATCH:expr, $LIST:expr, $NEGATED:expr, $EXPECTED:expr) => {{
+            let expr = in_list(col("a"), $LIST, $NEGATED).unwrap();
+            let result = expr.evaluate(&$BATCH)?.into_array($BATCH.num_rows());
+            let result = result
+                .as_any()
+                .downcast_ref::<BooleanArray>()
+                .expect("failed to downcast to BooleanArray");
+            let expected = &BooleanArray::from($EXPECTED);
+            assert_eq!(expected, result);
+        }};
+    }
+
+    #[test]
+    fn in_list_utf8() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("a", DataType::Utf8, true)]);
+        let a = StringArray::from(vec![Some("a"), Some("d"), None]);
+        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
+
+        // expression: "a in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), Some(false), None]);
+
+        // expression: "a not in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), Some(true), None]);
+
+        // expression: "a not in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), None, None]);
+
+        // expression: "a not in ("a", "b")"
+        let list = vec![
+            lit(ScalarValue::Utf8(Some("a".to_string()))),
+            lit(ScalarValue::Utf8(Some("b".to_string()))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), None, None]);
+
+        Ok(())
+    }
+
+    #[test]
+    fn in_list_int64() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("a", DataType::Int64, true)]);
+        let a = Int64Array::from(vec![Some(0), Some(2), None]);
+        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
+
+        // expression: "a in (0, 1)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), Some(false), None]);
+
+        // expression: "a not in (0, 1)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), Some(true), None]);
+
+        // expression: "a in (0, 1, NULL)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), None, None]);
+
+        // expression: "a not in (0, 1, NULL)"
+        let list = vec![
+            lit(ScalarValue::Int64(Some(0))),
+            lit(ScalarValue::Int64(Some(1))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), None, None]);
+
+        Ok(())
+    }
+
+    #[test]
+    fn in_list_float64() -> Result<()> {
+        let schema = Schema::new(vec![Field::new("a", DataType::Float64, true)]);
+        let a = Float64Array::from(vec![Some(0.0), Some(0.2), None]);
+        let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)])?;
+
+        // expression: "a in (0.0, 0.2)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), Some(false), None]);
+
+        // expression: "a not in (0.0, 0.2)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+        ];
+        in_list!(batch, list, &true, vec![Some(false), Some(true), None]);
+
+        // expression: "a in (0.0, 0.2, NULL)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+            lit(ScalarValue::Utf8(None)),
+        ];
+        in_list!(batch, list, &false, vec![Some(true), None, None]);
+
+        // expression: "a not in (0.0, 0.2, NULL)"
+        let list = vec![
+            lit(ScalarValue::Float64(Some(0.0))),
+            lit(ScalarValue::Float64(Some(0.1))),
+            lit(ScalarValue::Utf8(None)),

Review comment:
       I don't think it hurts, but given the coercion logic you added in the planner, I think the literals at this point should all be the same type as the expr value. In other words, can you really see `a NOT IN (NULL::Utf8)`?

##########
File path: rust/datafusion/src/optimizer/utils.rs
##########
@@ -305,6 +312,7 @@ pub fn expr_sub_expressions(expr: &Expr) -> Result<Vec<Expr>> {
             low.as_ref().to_owned(),
             high.as_ref().to_owned(),
         ]),
+        Expr::InList { expr, .. } => Ok(vec![expr.as_ref().to_owned()]),

Review comment:
       Shouldn't this also include the exprs in `list` as well ? 

##########
File path: rust/datafusion/src/physical_plan/planner.rs
##########
@@ -882,6 +937,53 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn in_list_types() -> Result<()> {
+        let testdata = arrow::util::test_util::arrow_test_data();
+        let path = format!("{}/csv/aggregate_test_100.csv", testdata);
+        let options = CsvReadOptions::new().schema_infer_max_records(100);
+
+        // expression: "a in ('a', 1)"
+        let list = vec![
+            Expr::Literal(ScalarValue::Utf8(Some("a".to_string()))),
+            Expr::Literal(ScalarValue::Int64(Some(1))),
+        ];
+        let logical_plan = LogicalPlanBuilder::scan_csv(&path, options, None)?
+            // filter clause needs the type coercion rule applied
+            .filter(col("c12").lt(lit(0.05)))?
+            .project(vec![col("c1").in_list(list, false)])?
+            .build()?;
+        let execution_plan = plan(&logical_plan)?;
+        // verify that the plan correctly adds cast from Int64(1) to Utf8

Review comment:
       👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-752776874


   > 1. Yes I think there should be a different/more efficient implementation that handles the "scalar" case, where the scalar in this case is the list with values.
   
   Agree. I can have a look as part of this PR.
   
   > 2. I believe the current behavior related to NULL is correct, just as field=NULL will also be false.
   
   I have tested in Postgres and it will return `NULL` if value is `NULL` unlike this implementation which returns `false`.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-752776874


   > 1. Yes I think there should be a different/more efficient implementation that handles the "scalar" case, where the scalar in this case is the list with values.
   Agree. I can have a look as part of this PR.
   
   > 2. I believe the current behavior related to NULL is correct, just as field=NULL will also be false.
   I have tested in Postgres and it will return `NULL` if value is `NULL` unlike this implementation which returns `false`.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] yordan-pavlov edited a comment on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
yordan-pavlov edited a comment on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-755316885


   @seddonm1 looks great, the 'IN' operator is one of the features I have been missing and thinking about implementing myself but looks like you beat me to it :) 
   my only concern is that the current implementation doesn't appear to make use of SIMD; have you looked into comparing performance against an expression of the form `(eq(array, scalar1) or eq(array, scalar2) or ... or eq(array, scalarN))`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r549554046



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       This helps keeping the logical plan simple, and also makes future code that uses the LP tree simple, e.g. an optimization rule on `not(..)`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r549889188



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       I mainly included `negated` to allow pretty printing like: `'z' NOT IN ('x','y')`. I have changed this so it now uses the `not` `expr` so will now display `NOT 'z' IN ('x','y')`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-752956409


   The full set of Rust CI tests did not run on this PR :(
   
   Can you please rebase this PR against [apache/master](https://github.com/apache/arrow) to pick up the changes in https://github.com/apache/arrow/pull/9056 so that they do? 
   
   I apologize for the inconvenience. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-752278501


   Thanks @Dandandan 
   
   Do you think we should re-evaluate the current behavior of the `kernels::comparison::contains` based on my comments 1 and 2?
   
   Creating the same array `n` times feels very inefficient if we can modify the kernel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550763046



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between `c NOT IN (..)` and `NOT c IN (....)` FWIW that might require representing them differnetly

##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between `c NOT IN (...)` and `NOT c IN (...)` FWIW that might require representing them differnetly

##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       I can't remember exactly, but I think there might be some semantic difference (regarding NULLs, of course) in SQL between `c NOT IN (...)` and `NOT c IN (...)` FWIW that might require representing them differently




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550762173



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       Would be nice indeed for next PR, I think we could have a special case to match on Not (ListIn (...) in the formatter instead :+1: 

##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       Would be nice indeed for a next PR, I think we could have a special case to match on Not (ListIn (...) in the formatter instead :+1: 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] yordan-pavlov commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-755316885


   @seddonm1 looks great, the 'IN' operator is one of the features I have been missing and thinking about implementing myself but looks like you beat me to it :) 
   my only concern is that the current implementation doesn't appear to make use of SIMD; have you looked into comparing performance against an expression of the form `(eq(array, scalar1) or eq(array, scalar2) or ...)`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550764406



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       hm ok... in that case my initial suggestion might have been wrong... would good to have some tests for this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r549551271



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,
+        /// Whether the expression is negated
+        negated: bool,

Review comment:
       We might keep negated out and use `not` instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753241531


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=h1) Report
   > Merging [#9038](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=desc) (ae61b57) into [master](https://codecov.io/gh/apache/arrow/commit/dd5fe7095bb662236e27d3343eb82bc4375f93ef?el=desc) (dd5fe70) will **decrease** coverage by `0.06%`.
   > The diff coverage is `64.97%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9038/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9038      +/-   ##
   ==========================================
   - Coverage   82.61%   82.55%   -0.07%     
   ==========================================
     Files         202      202              
     Lines       50052    50249     +197     
   ==========================================
   + Hits        41350    41481     +131     
   - Misses       8702     8768      +66     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/optimizer/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvdXRpbHMucnM=) | `58.42% <33.33%> (-0.29%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `76.28% <50.00%> (-0.62%)` | :arrow_down: |
   | [rust/datafusion/src/sql/utils.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvdXRpbHMucnM=) | `54.49% <53.33%> (-0.11%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `82.39% <59.84%> (-2.09%)` | :arrow_down: |
   | [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `84.06% <81.81%> (+0.09%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `77.96% <83.33%> (+0.06%)` | :arrow_up: |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.83% <100.00%> (+<0.01%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9038/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `59.49% <0.00%> (+0.42%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=footer). Last update [709f20d...ae61b57](https://codecov.io/gh/apache/arrow/pull/9038?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753309911


   > kernels::comparison::contains returns false if either of the comparison values is null. Is this the desired behavior? If not I can modify the kernel to return null instead.
   
   FWIW I think the `contains` kernel should return NULL in these cases


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550507926



##########
File path: rust/datafusion/src/logical_plan/expr.rs
##########
@@ -158,6 +158,15 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
     },
+    /// Returns whether the list contains the expr value.
+    InList {
+        /// The value to compare
+        expr: Box<Expr>,
+        /// The low end of the range
+        list: Vec<Expr>,

Review comment:
       Yes, I see. Maybe could be a future optimization so that we can convert it to a more efficient representation upfront, and generating an error earlier when it can not be executed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-755664456


   @alamb thanks for taking the time to review this as I know it ended up as quite a large PR 👍 . I have updated based on your comment.
   
   @yordan-pavlov yes this is basically as naive implementation as possible and could be heavily optimised. I think we should merge this PR to unblock TPC-H Query 12: `l_shipmode in ('MAIL', 'SHIP')` then look at optimisation. The test cases should help with any future optimisation work anyway.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-755778331


   I'll plan to merge this in as soon as the CI passes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#discussion_r550761457



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,212 @@ impl PhysicalSortExpr {
     }
 }
 
+/// InList
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+macro_rules! make_contains {
+    ($ARRAY:expr, $LIST_VALUES:expr, $SCALAR_VALUE:ident, $ARRAY_TYPE:ident) => {{
+        let array = $ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap();
+
+        let values = $LIST_VALUES
+            .iter()
+            .map(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::$SCALAR_VALUE(Some(v)) => v,
+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            })
+            .collect::<Vec<_>>();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            array
+                .iter()
+                .map(|x| x.map(|x| values.contains(&&x)))

Review comment:
       I wonder if this handles `NULL` correctly -- like for a value of where `expr` is NULL the output should be NULL (not true/false). The semantics when there is a literal `NULL` in the inlist are even stranger (but likely could be handled as a follow on PR)
   
   For example:
   
   ```
   sqlite> create table t(c1 int);
   sqlite> insert into t values (10);
   sqlite> insert into t values (20);
   sqlite> insert into t values(NULL);
   sqlite> select c1, c1 IN (20, NULL) from t;
   10|
   20|1
   |
   sqlite> select c1, c1 IN (20) from t;
   10|0
   20|1
   |
   ```
   
   Note that `10 IN (20, NULL)` is actually `NULL` rather than `FALSE`. Crazy
   

##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -2414,6 +2417,212 @@ impl PhysicalSortExpr {
     }
 }
 
+/// InList
+#[derive(Debug)]
+pub struct InListExpr {
+    expr: Arc<dyn PhysicalExpr>,
+    list: Vec<Arc<dyn PhysicalExpr>>,
+}
+
+macro_rules! make_contains {
+    ($ARRAY:expr, $LIST_VALUES:expr, $SCALAR_VALUE:ident, $ARRAY_TYPE:ident) => {{
+        let array = $ARRAY.as_any().downcast_ref::<$ARRAY_TYPE>().unwrap();
+
+        let values = $LIST_VALUES
+            .iter()
+            .map(|expr| match expr {
+                ColumnarValue::Scalar(s) => match s {
+                    ScalarValue::$SCALAR_VALUE(Some(v)) => v,
+                    datatype => unimplemented!("Unexpected type {} for InList", datatype),
+                },
+                ColumnarValue::Array(_) => {
+                    unimplemented!("InList should not receive Array")
+                }
+            })
+            .collect::<Vec<_>>();
+
+        Ok(ColumnarValue::Array(Arc::new(
+            array
+                .iter()
+                .map(|x| x.map(|x| values.contains(&&x)))

Review comment:
       I wonder if this handles `NULL` correctly -- like for a value of where `expr` is NULL the output should be NULL (not true/false). The semantics when there is a literal `NULL` in the inlist are even stranger (but likely could be handled as a follow on PR)
   
   For example:
   
   ```SQL
   sqlite> create table t(c1 int);
   sqlite> insert into t values (10);
   sqlite> insert into t values (20);
   sqlite> insert into t values(NULL);
   sqlite> select c1, c1 IN (20, NULL) from t;
   10|
   20|1
   |
   sqlite> select c1, c1 IN (20) from t;
   10|0
   20|1
   |
   ```
   
   Note that `10 IN (20, NULL)` is actually `NULL` rather than `FALSE`. Crazy
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in (WIP)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-753347779


   @jhorstmann nice idea! Maybe it would be better to do that in an optimization rule?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9038: ARROW-10356: [Rust][DataFusion] Add support for is_in

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9038:
URL: https://github.com/apache/arrow/pull/9038#issuecomment-756715134


   I filed https://issues.apache.org/jira/browse/ARROW-11182 to track possible improvements to performance


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org