You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/04 14:07:10 UTC

[GitHub] [arrow-datafusion] Ted-Jiang opened a new pull request, #2156: Add an InSet as an optimized version for IN_LIST

Ted-Jiang opened a new pull request, #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156

   # Which issue does this PR close?
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #2093.
   
    # Rationale for this change
   Optimized of In_List clause, when all filter values of In clause are static.
   Default list values use `Vec` to store, it  has time complexity O(n) to check `contains`,  In some situation use `Set` it has complexity O(1) 
   
   test sql:
   ```select count(*) from orders where o_orderkey in (2785313,
   2785314,
   2785315,
   2785316,
   ''' (1000 elements)
   2786311);
   ```
   Master branch:
   ```
   2786309,
   2786310,
   2786311);
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 255             |
   +-----------------+
   1 row in set. Query took 4.713 seconds.
   
   ```
   This pr:
   ```
   2786309,
   2786310,
   2786311);
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 255             |
   +-----------------+
   1 row in set. Query took 0.566 seconds.
   ```
   
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845260984


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   <img width="653" alt="image" src="https://user-images.githubusercontent.com/37145547/162232125-28122bd8-64c7-410f-9996-2df4bf27f942.png">
   Cool, you are right! @Dandandan Thanks!  utf8 columns threshold near 50.
   Maybe i will Set it by type👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r844793209


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   I don't think we should follow Spark, but should use some benchmarking to find a good heuristic for this threshold (where it's faster to use a hash set).
   
   The optimal value might be quite a bit higher at this moment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845784661


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   ❤️ @Dandandan  Thanks a lot for your info ! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r844675334


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -342,119 +434,202 @@ impl PhysicalExpr for InListExpr {
     fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
         let value = self.expr.evaluate(batch)?;
         let value_data_type = value.data_type();
-        let list_values = self
-            .list
-            .iter()
-            .map(|expr| expr.evaluate(batch))
-            .collect::<Result<Vec<_>>>()?;
-
-        let array = match value {
-            ColumnarValue::Array(array) => array,
-            ColumnarValue::Scalar(scalar) => scalar.to_array(),
-        };
 
-        match value_data_type {
-            DataType::Float32 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Float32,
-                    Float32Array
-                )
-            }
-            DataType::Float64 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Float64,
-                    Float64Array
-                )
-            }
-            DataType::Int16 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int16,
-                    Int16Array
-                )
-            }
-            DataType::Int32 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int32,
-                    Int32Array
-                )
-            }
-            DataType::Int64 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int64,
-                    Int64Array
-                )
-            }
-            DataType::Int8 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int8,
-                    Int8Array
-                )
-            }
-            DataType::UInt16 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt16,
-                    UInt16Array
-                )
-            }
-            DataType::UInt32 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt32,
-                    UInt32Array
-                )
-            }
-            DataType::UInt64 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt64,
-                    UInt64Array
-                )
-            }
-            DataType::UInt8 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt8,
-                    UInt8Array
-                )
-            }
-            DataType::Boolean => {
-                make_contains!(array, list_values, self.negated, Boolean, BooleanArray)
-            }
-            DataType::Utf8 => self.compare_utf8::<i32>(array, list_values, self.negated),
-            DataType::LargeUtf8 => {
-                self.compare_utf8::<i64>(array, list_values, self.negated)
+        if let Some(in_set) = &self.set {
+            let array = match value {
+                ColumnarValue::Array(array) => array,
+                ColumnarValue::Scalar(scalar) => scalar.to_array(),

Review Comment:
   Thanks! @alamb  i agree it doesn't make any performance,  it's rare to match ` ColumnarValue::Scalar`
   it also appears in https://github.com/apache/arrow-datafusion/blob/72a1194b9817df5ec7d87df6f5c3e45ed0e1ecd9/datafusion/physical-expr/src/expressions/in_list.rs#L517-L520.
   Maybe we can file an issue to improve it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

yjshen commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r842323638


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   I think it's because `ScalarValue` is an enum of a wrapper of value. So it would be overheads for both memory footprint and computation compared to HashSet of native data values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #2156: Optimize the evaluation of `IN` for large lists using InSet

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#issuecomment-1093196033

   Thanks everyone who contributed code and review to this PR 🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845211540


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   I have try to use cast 'u32.to_string' get same result. In my opinion, use string may cause more cost in `hash`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845783834


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen by the benchmark at
+/// https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845198369
+/// TODO: add switch codeGen in In_List

Review Comment:
   Yes change to  discuss link in this pr.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845211540


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   I have try to use cast 'u32.to_string' get same conclusion. In my opinion, use string may cause more cost in `hash`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r842320547


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   @Dandandan Thanks for your information❤️, Is there any specific reason why using ScalarValue is slower?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845769945


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   > where array -> ScalarValue copies / allocates happened in code
   
   FYI, (String) allocations happen here, when converting to a` ScalarValue`. 



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   > where array -> ScalarValue copies / allocates happened in code
   
   FYI, (String) allocations happen here, when converting to a` ScalarValue`. https://github.com/apache/arrow-datafusion/pull/2156/files#diff-ff8086fafbfe5021e5f7d51d96aaae2cf65f779ac3fae5fc182f87e956bb0550R214



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r844685982


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -220,17 +260,63 @@ fn not_in_list_utf8<OffsetSize: StringOffsetSizeTrait>(
     compare_op_scalar!(array, values, |x, v: &[&str]| !v.contains(&x))
 }
 
+//check all filter values of In clause are static.
+//include `CastExpr + Literal` or `Literal`
+fn check_all_static_filter_expr(list: &[Arc<dyn PhysicalExpr>]) -> bool {
+    list.iter().all(|v| {
+        let cast = v.as_any().downcast_ref::<expressions::CastExpr>();
+        if let Some(c) = cast {
+            c.expr()
+                .as_any()
+                .downcast_ref::<expressions::Literal>()
+                .is_some()
+        } else {
+            let cast = v.as_any().downcast_ref::<expressions::Literal>();
+            cast.is_some()
+        }
+    })
+}
+
+fn cast_static_filter_to_set(list: &[Arc<dyn PhysicalExpr>]) -> HashSet<ScalarValue> {
+    HashSet::from_iter(list.iter().map(|expr| {
+        if let Some(cast) = expr.as_any().downcast_ref::<expressions::CastExpr>() {
+            cast.expr()
+                .as_any()
+                .downcast_ref::<expressions::Literal>()
+                .unwrap()
+                .value()
+                .clone()
+        } else {
+            expr.as_any()
+                .downcast_ref::<expressions::Literal>()
+                .unwrap()
+                .value()
+                .clone()
+        }
+    }))
+}
+
 impl InListExpr {
     /// Create a new InList expression
     pub fn new(
         expr: Arc<dyn PhysicalExpr>,
         list: Vec<Arc<dyn PhysicalExpr>>,
         negated: bool,
     ) -> Self {
-        Self {
-            expr,
-            list,
-            negated,
+        if list.len() > OPTIMIZER_INSET_THRESHOLD && check_all_static_filter_expr(&list) {

Review Comment:
   According to not support `switch codeGen` change to 10, like spark 2.x



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r841780981


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -220,17 +260,63 @@ fn not_in_list_utf8<OffsetSize: StringOffsetSizeTrait>(
     compare_op_scalar!(array, values, |x, v: &[&str]| !v.contains(&x))
 }
 
+//check all filter values of In clause are static.
+//include `CastExpr + Literal` or `Literal`
+fn check_all_static_filter_expr(list: &[Arc<dyn PhysicalExpr>]) -> bool {
+    list.iter().all(|v| {
+        let cast = v.as_any().downcast_ref::<expressions::CastExpr>();
+        if let Some(c) = cast {
+            c.expr()
+                .as_any()
+                .downcast_ref::<expressions::Literal>()
+                .is_some()
+        } else {
+            let cast = v.as_any().downcast_ref::<expressions::Literal>();
+            cast.is_some()
+        }
+    })
+}
+
+fn cast_static_filter_to_set(list: &[Arc<dyn PhysicalExpr>]) -> HashSet<ScalarValue> {
+    HashSet::from_iter(list.iter().map(|expr| {
+        if let Some(cast) = expr.as_any().downcast_ref::<expressions::CastExpr>() {
+            cast.expr()
+                .as_any()
+                .downcast_ref::<expressions::Literal>()
+                .unwrap()
+                .value()
+                .clone()
+        } else {
+            expr.as_any()
+                .downcast_ref::<expressions::Literal>()
+                .unwrap()
+                .value()
+                .clone()
+        }
+    }))
+}
+
 impl InListExpr {
     /// Create a new InList expression
     pub fn new(
         expr: Arc<dyn PhysicalExpr>,
         list: Vec<Arc<dyn PhysicalExpr>>,
         negated: bool,
     ) -> Self {
-        Self {
-            expr,
-            list,
-            negated,
+        if list.len() > OPTIMIZER_INSET_THRESHOLD && check_all_static_filter_expr(&list) {

Review Comment:
   According to Spark, default set 400. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r841898112


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   Just a note: because of using `ScalarValue` we are a bit slower than if we could use basic types, like `HashSet<u32>`, etc. The same apllies to the existing implementation based on `Vec`. I think that could be a couple times faster for both implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

yjshen commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r842323638


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   I think it's because `ScalarValue` is an enum of an option wrapper of value. So it would be overheads for both computation and memory footprint compared to HashSet of native data values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r842061419


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   But this makes something for a future PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r841898112


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   Just a note: because of using `ScalarValue` we are a bit slower than if we could use basic types, like `HashSet<u32>`, etc. ~The same apllies to the existing implementation based on `Vec`~. I think that could be a couple times faster for both implementations.



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   Just a note: because of using `ScalarValue` we are a bit slower than if we could use basic types, like `HashSet<u32>`, etc. ~The same apllies to the existing implementation based on `Vec`~. I think that could be a couple times faster.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845373001


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   Amazing, thanks for the comparison. I would also be OK with changing it to some value like 30. Seems for both numeric as for utf8 data a good enough.
   At some point we could further optimize the implementation (https://github.com/apache/arrow-datafusion/issues/2165) after this we can adjust to a (lower) value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845206360


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   Cool analysis!
   I am wondering if other data types make it a bit different, such as strings / utf8 arrays? I expect the conversion to be a bit slower there, because of the extra conversion/allocations needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845235335


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   I meant to benchmark it on utf8 columns. The conversion from array -> `ScalarValue` for each string is extra expensive, as it copies / allocates new `String`s.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang closed pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang closed pull request #2156: Add an InSet as an optimized version for IN_LIST
URL: https://github.com/apache/arrow-datafusion/pull/2156


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r844685982


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -220,17 +260,63 @@ fn not_in_list_utf8<OffsetSize: StringOffsetSizeTrait>(
     compare_op_scalar!(array, values, |x, v: &[&str]| !v.contains(&x))
 }
 
+//check all filter values of In clause are static.
+//include `CastExpr + Literal` or `Literal`
+fn check_all_static_filter_expr(list: &[Arc<dyn PhysicalExpr>]) -> bool {
+    list.iter().all(|v| {
+        let cast = v.as_any().downcast_ref::<expressions::CastExpr>();
+        if let Some(c) = cast {
+            c.expr()
+                .as_any()
+                .downcast_ref::<expressions::Literal>()
+                .is_some()
+        } else {
+            let cast = v.as_any().downcast_ref::<expressions::Literal>();
+            cast.is_some()
+        }
+    })
+}
+
+fn cast_static_filter_to_set(list: &[Arc<dyn PhysicalExpr>]) -> HashSet<ScalarValue> {
+    HashSet::from_iter(list.iter().map(|expr| {
+        if let Some(cast) = expr.as_any().downcast_ref::<expressions::CastExpr>() {
+            cast.expr()
+                .as_any()
+                .downcast_ref::<expressions::Literal>()
+                .unwrap()
+                .value()
+                .clone()
+        } else {
+            expr.as_any()
+                .downcast_ref::<expressions::Literal>()
+                .unwrap()
+                .value()
+                .clone()
+        }
+    }))
+}
+
 impl InListExpr {
     /// Create a new InList expression
     pub fn new(
         expr: Arc<dyn PhysicalExpr>,
         list: Vec<Arc<dyn PhysicalExpr>>,
         negated: bool,
     ) -> Self {
-        Self {
-            expr,
-            list,
-            negated,
+        if list.len() > OPTIMIZER_INSET_THRESHOLD && check_all_static_filter_expr(&list) {

Review Comment:
   According to we not Support `switch codeGen` change to 10, like spark 2.x



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r844828456


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   Agree!  i will post a bench result. By the way, without subquery implement, Is there some way easy to do?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r843210109


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,15 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+static OPTIMIZER_INSET_THRESHOLD: usize = 400;

Review Comment:
   ```suggestion
   /// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
   /// Value chosen to be consistent with Spark
   static OPTIMIZER_INSET_THRESHOLD: usize = 400;
   ```



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -318,7 +404,13 @@ impl InListExpr {
 impl std::fmt::Display for InListExpr {
     fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
         if self.negated {
-            write!(f, "{} NOT IN ({:?})", self.expr, self.list)
+            if self.set.is_some() {
+                write!(f, "Use In_Set{} NOT IN ({:?})", self.expr, self.list)
+            } else {
+                write!(f, "{} NOT IN ({:?})", self.expr, self.list)
+            }
+        } else if self.set.is_some() {
+            write!(f, "Use In_Set{} IN ({:?} use In_Set)", self.expr, self.list)

Review Comment:
   I wonder if this would be clearer
   
   ```suggestion
               if self.set.is_some() {
                   write!(f, "{} NOT IN (SET) ({:?})", self.expr, self.list)
               } else {
                   write!(f, "{} NOT IN ({:?})", self.expr, self.list)
               }
           } else if self.set.is_some() {
               write!(f, "Use {} IN (SET) ({:?})", self.expr, self.list)
   ```
   
   



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   Filed https://github.com/apache/arrow-datafusion/issues/2165



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,15 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+static OPTIMIZER_INSET_THRESHOLD: usize = 400;

Review Comment:
   @Ted-Jiang  do you  happen to  have a link we could point to (as in I don't know how to verify this claim, I am just copying it from the PR comments)?



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -342,119 +434,202 @@ impl PhysicalExpr for InListExpr {
     fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
         let value = self.expr.evaluate(batch)?;
         let value_data_type = value.data_type();
-        let list_values = self
-            .list
-            .iter()
-            .map(|expr| expr.evaluate(batch))
-            .collect::<Result<Vec<_>>>()?;
-
-        let array = match value {
-            ColumnarValue::Array(array) => array,
-            ColumnarValue::Scalar(scalar) => scalar.to_array(),
-        };
 
-        match value_data_type {
-            DataType::Float32 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Float32,
-                    Float32Array
-                )
-            }
-            DataType::Float64 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Float64,
-                    Float64Array
-                )
-            }
-            DataType::Int16 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int16,
-                    Int16Array
-                )
-            }
-            DataType::Int32 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int32,
-                    Int32Array
-                )
-            }
-            DataType::Int64 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int64,
-                    Int64Array
-                )
-            }
-            DataType::Int8 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    Int8,
-                    Int8Array
-                )
-            }
-            DataType::UInt16 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt16,
-                    UInt16Array
-                )
-            }
-            DataType::UInt32 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt32,
-                    UInt32Array
-                )
-            }
-            DataType::UInt64 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt64,
-                    UInt64Array
-                )
-            }
-            DataType::UInt8 => {
-                make_contains_primitive!(
-                    array,
-                    list_values,
-                    self.negated,
-                    UInt8,
-                    UInt8Array
-                )
-            }
-            DataType::Boolean => {
-                make_contains!(array, list_values, self.negated, Boolean, BooleanArray)
-            }
-            DataType::Utf8 => self.compare_utf8::<i32>(array, list_values, self.negated),
-            DataType::LargeUtf8 => {
-                self.compare_utf8::<i64>(array, list_values, self.negated)
+        if let Some(in_set) = &self.set {
+            let array = match value {
+                ColumnarValue::Array(array) => array,
+                ColumnarValue::Scalar(scalar) => scalar.to_array(),

Review Comment:
   This is unfortunate -- turning a scalar into an array just to convert it back to a scalar if using InList
   
   I wonder if we can pass the columnar_value to `set_contains_with_negated` and only do the conversion when using the Vec (not the Set)?
   
   This probably doesn't really make any sort of performance difference for any case we care about, I just noticed it and thought I would mention it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #2156: Optimize the evaluation of `IN` for large lists using InSet

Posted by GitBox <gi...@apache.org>.

alamb merged PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845757545


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

yjshen commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r842323638


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   I think it's because `ScalarValue` is an enum of an option wrapper of value. So it would be overheads for both memory footprint and computation compared to HashSet of native data values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r843228610


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   In addition to higher memory usage and dispatching overhead there are two extra sources of overhead
   
   * Having to convert all values from array items to `ScalarValue`
   > * Hashing a `Scalarvalue` is slower than hashing a native type.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845260984


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   <img width="653" alt="image" src="https://user-images.githubusercontent.com/37145547/162232125-28122bd8-64c7-410f-9996-2df4bf27f942.png">
   Cool, you are right! @Dandandan Thanks!  utf8 columns threshold near 50.
   Maybe i will Set it by type👍
   
   btw: where array -> ScalarValue copies / allocates happened in code😂



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845198369


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen to be consistent with Spark
+/// https://github.com/apache/spark/blob/4e95738fdfc334c25f44689ff8c2db5aa7c726f2/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L259-L265
+/// TODO: add switch codeGen in In_List
+static OPTIMIZER_INSET_THRESHOLD: usize = 10;

Review Comment:
   ![image](https://user-images.githubusercontent.com/37145547/162219787-c7a0941e-72f1-426e-bd03-f86687f8b040.png)
   
   `X`: cost time. `Y`: filter value numbers 
   
   `Blue`: is use List, `orange`: is use Set.  
   
   I did a benchMark in my local, use `select count(*) from orders where o_orderkey in (x1, x2, ..., xn)`
   Obviously, Set has a fixed gradient, List cost time increases with the parameter number. 
   The intersection of the two lines is located is between 10～20 （same as Spark set 10). 
   So, i decided set `OPTIMIZER_INSET_THRESHOLD ` = 10 align with spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

Dandandan commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r843228610


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -69,6 +72,23 @@ pub struct InListExpr {
     expr: Arc<dyn PhysicalExpr>,
     list: Vec<Arc<dyn PhysicalExpr>>,
     negated: bool,
+    set: Option<InSet>,
+}
+
+/// InSet
+#[derive(Debug)]
+pub struct InSet {
+    set: HashSet<ScalarValue>,

Review Comment:
   In addition to higher memory usage and dispatching overhead there are two extra sources of overhead
   
   * Having to convert all values from array items to `ScalarValue`
   * Hashing a `Scalarvalue` is slower than hashing a native type.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #2156: Add an InSet as an optimized version for IN_LIST

Posted by GitBox <gi...@apache.org>.

yjshen commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845769129


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -32,13 +33,19 @@ use arrow::{
     record_batch::RecordBatch,
 };
 
-use crate::PhysicalExpr;
+use crate::{expressions, PhysicalExpr};
 use arrow::array::*;
 use arrow::buffer::{Buffer, MutableBuffer};
 use datafusion_common::ScalarValue;
 use datafusion_common::{DataFusionError, Result};
 use datafusion_expr::ColumnarValue;
 
+/// Size at which to use a Set rather than Vec for `IN` / `NOT IN`
+/// Value chosen by the benchmark at
+/// https://github.com/apache/arrow-datafusion/pull/2156#discussion_r845198369
+/// TODO: add switch codeGen in In_List

Review Comment:
   Is this line of doc still valid?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] Optimize the evaluation of `IN` for large lists using InSet [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on code in PR #2156:
URL: https://github.com/apache/arrow-datafusion/pull/2156#discussion_r1413122706


##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -318,7 +408,13 @@ impl InListExpr {
 impl std::fmt::Display for InListExpr {
     fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
         if self.negated {
-            write!(f, "{} NOT IN ({:?})", self.expr, self.list)
+            if self.set.is_some() {
+                write!(f, "{} NOT IN (SET) ({:?})", self.expr, self.list)
+            } else {
+                write!(f, "{} NOT IN ({:?})", self.expr, self.list)
+            }
+        } else if self.set.is_some() {
+            write!(f, "Use {} IN (SET) ({:?})", self.expr, self.list)

Review Comment:
   Sorry to bother you, is use in the "Use {} IN (SET) ({:?})" a typo?  @Ted-Jiang 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org