You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/11 13:11:07 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #6635: Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)`

alamb opened a new issue, #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635

   ### Is your feature request related to a problem or challenge?
   
   A user on slack was using datafusion to query parquet files from S3: https://the-asf.slack.com/archives/C01QUFS30TD/p1686147411917959
   
   They reported that the following predicate got 1000x slower when it had 100,000 distinct filter values:
   ```sql
   WHERE
   (rate.code = 'SC1' and rate.type = 'UT1' and rate.operation = 'RO1') 
   OR 
   (rate.code = 'SC2' and rate.type = 'UT2' and rate.operation = 'RO2')
   ...
   ```
   
   However, when it was rewritten as an inlist it went much faster:
   ```sql
   WHERE (rate.code || '_' || rate.type || '_' || rate.operation) IN ('SC1_UT1_RO1', 'SC2_UT2_RO2', ...)
   ```
   
   However, this rewrite is not general (for example, if `code`, `type` or `operation` contain `_` characters.
   
   SQL supports this type of predicate natively with "multi-column inlists" that look like;
   
   ```sql
   WHERE (rate.code,  rate.type, rate.operation) IN (('SC1', 'UT1', 'RO1'), ('SC2', 'UT2', 'RO2') ...)
   ```
   
    Substrait supports this kind of predicate too, which I take as some evidence it is widely used
   
   https://substrait.io/expressions/specialized_record_expressions/#or-list-equality-expression
   
   ```
   > A specialized structure that is often used is a large list of possible values. In SQL, these are typically large IN lists. They can be composed from one or more fields. There are two common patterns, single value and multi value. In pseudocode they are represented as:
   ```
   
   ### Describe the solution you'd like
   
   I would like multi-column predicates to work in DataFusion. 
   
   today, they result in an "unimplemented" error:
   
   ```sql
   ❯ create table foo (x int, y varchar);
   0 rows in set. Query took 0.001 seconds.
   ❯ insert into foo values (1, 'a'), (2, 'b'), (3, 'c');
   +-------+
   | count |
   +-------+
   | 3     |
   +-------+
   1 row in set. Query took 0.002 seconds.
   ❯ select * from foo where (x, y) IN ((1,'a'), (2, 'b'), (5, 'e'));
   This feature is not implemented: Unsupported ast node in sqltorel: Tuple([Value(Number("1", false)), Value(SingleQuotedString("a"))])
   ```
   
   Here is an example showing this feature working in posgres:
   
   ```sql
   postgres=# create table foo (x int, y varchar);
   CREATE TABLE
   postgres=# insert into foo values (1, 'a'), (2, 'b'), (3, 'c');
   INSERT 0 3
   postgres=# select * from foo;
    x | y
   ---+---
    1 | a
    2 | b
    3 | c
   (3 rows)
   postgres=# select * from foo where (x, y) IN ((1,'a'), (2, 'b'), (5, 'e'));
    x | y
   ---+---
    1 | a
    2 | b
   (2 rows)
   ```
   
   ### Describe alternatives you've considered
   
   The existing InList structure looks like this: https://docs.rs/datafusion-expr/26.0.0/datafusion_expr/expr/struct.InList.html
   
   ```rust
   pub struct InList {
       pub expr: Box<Expr>,
       pub list: Vec<Expr>,
       pub negated: bool,
   }
   ```
   
   I am not sure how best to implement this. One idea is to simply special case multi-inputs, something like 
   
   ```rust
   struct MultIInList {
       pub exprs: Vec<Expr>,
       pub lists: Vec<Vec<Expr>>,
       pub negated: bool,
   }
   ```
   
   However, my preferred approach would be to support `StructArray`s in `InList` and then implement a rewrite from
   
   ```sql
   (col1, col2, ...) in (('a1', 'a2', ..), ('b1', 'b2', ..), ...)
   ```
   into
   
   ```
   struct(col1, col2) In ({ col1: 'a', col2: 'a2', ..}, {col1: 'b1', col2: 'b2', ..}, ...)
   ```
   
   While likely more complicated this approach would then support structs in INLISTs directly which I think will be more and more valuable over time
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1843704044

   > Tuple(Vec<Column>)
   
   Perhaps it could be represented like `ScalarValue::List(..)` 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6635: Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)`

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1590004849

   > We can add a new type of Expr called StructArrays
   
   I think we could also use the existing `ScalarValue` like this : https://docs.rs/datafusion/latest/datafusion/scalar/enum.ScalarValue.html#variant.Struct
   
   The trick in that case would be implementing `StructArray` support in https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-expr/src/expressions/in_list.rs  -- @tustvold  said on slack "it would be straightforward" 😆  which it might be for him, I don't think it would be a good first issue
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1846114004

   I think the idea is that 
   
   | col1 | col2 |
   |--------|--------|
   | A | 1 |
   | B | 2 |
   | C | 3 | 
   
   So a query like `SELECT * FROM t WHERE (col1, col2) IN (('A', 1), ('B', 2))`
   
   Rather than actually implementing a second dimension in `InExpr` we could potentially keep things simpler with a rewrite like this:
   
   ```
   SELECT * 
   FROM t 
   WHERE (struct(col1, col2)) IN (struct('A', 1)), struct ('B', 2))`
   ```
   
   Where the struct function does this:
   
   ```
   DataFusion CLI v33.0.0
   ❯ select struct('A', 1);
   +----------------------------+
   | struct(Utf8("A"),Int64(1)) |
   +----------------------------+
   | {c0: A, c1: 1}             |
   +----------------------------+
   ```
   
   This would likely be a much simpler implementation and would be faster than the chained or shown above
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1847655478

   > But StructArray is not comparable in arrow-rs since it is nested. Should we implement compare in datafusion or upstream?
   
   I see your point. 
   
   The limitation appears to be in arrow itself: https://github.com/apache/arrow-rs/blob/7e289134a8d9f46a92a2759a7b2488b17993fd5b/arrow-ord/src/cmp.rs#L202-L204
   
   I think it might make sense to support equality of nested types upstream, though I am not sure about ordering comparsions (`<`, `<=`, etc)
   
   Perhaps it is worth filing a ticket in arrow-rs and proposing adding `StructArray` equality comparison support to eq and not_eq 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on issue #6635: Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)`

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.

mingmwang commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1589701466

   I also prefer the `StructArrays` approach.  We can add a new type of `Expr` called  `StructArrays`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1844407739

   > Perhaps it could be represented like `ScalarValue::List(..)` 🤔
   
   Sorry, maybe I don't get your point. I think ```(col1, col2, ...)``` likes multi-column projection from table rather than scalarvalue, so I don't think it can be be represented like ```ScalarValue::List(..)```.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1846796006

   Thanks for your patience @alamb. After your explanation, I feel like I understand it a little bit more clearly.
   
   ```Tuple([Value('a'), Value('b)], ....)``` can be handled as ScalarValue(struct)， ```struct(col1, col2)``` will be ```BuiltinScalarFunction::Struct``` whose args are col1 and col2 ...
   
   And in_list code is generalized.
   
   https://github.com/apache/arrow-datafusion/blob/c0c9e8888878c5d7f2586cf605702430c94ea425/datafusion/physical-expr/src/expressions/in_list.rs#L360-L363
   
   But StructArray is not comparable in arrow-rs since it is nested. Should we implement compare in datafusion or upstream?
   
   For example,
   
   ```
   ❯ CREATE TABLE colors (
       color_id INT PRIMARY KEY,
       color_name VARCHAR(50)
   );
   INSERT INTO colors (color_id, color_name) VALUES (1, 'Red'), (2, 'Blue');
   
   ❯ SELECT * FROM colors WHERE struct(color_id) IN (struct(arrow_cast(1,'Int32')));
   Arrow error: Invalid argument error: Invalid comparison operation: Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) == Struct([Field { name: "c0", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1859198097

   https://github.com/apache/arrow-rs/blob/9a1e8b572d11078e813fffe3d5c7106b6953d58c/arrow-cast/src/cast.rs#L163-L164
   The cast between structarrays also needs to support, I think these are the inherent problems of nested types. I'd like complete this ticket when these are all done in upstream.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support multi column IN lists like `(c1, c2) IN ((c1, c2), ,,,)` [arrow-datafusion]

Posted by "my-vegetable-has-exploded (via GitHub)" <gi...@apache.org>.

my-vegetable-has-exploded commented on issue #6635:
URL: https://github.com/apache/arrow-datafusion/issues/6635#issuecomment-1837509932

   I am interesting in this issue. But I'm not sure about some of the questions.
   
   - ```('a1', 'a2', ..)``` will be resolved as ```Tuple([Value('a'), Value('b)], ....)```, and ```(col1, col2, ...)``` will be ```Tuple([Identifier("col1"), Identifier("col1"), ......])```,  I don't know how represent ```Tuple(Vec<Column>)``` in logical_expr? Do we need to add a new type of Expr for it?
   
   > then implement a rewrite from
   - Does the Rewrite have a unified interface for that, or can I just write it where I want it to be ？
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org