You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/30 21:18:44 UTC

[GitHub] [arrow] westonpace opened a new issue, #36420: [C++] Add an sql-compatible is_in variant

westonpace opened a new issue, #36420:
URL: https://github.com/apache/arrow/issues/36420

   ### Describe the enhancement requested
   
   This could be a new compute function or it could be new options for the existing is_in function.  Currently, the way `is_in` treats `null` does not match SQL semantics which makes it difficult to use when interoperability is expected.
   
   The current behavior is summed up nicely in the `SetLookupOptions::skip_nulls` description:
   
   ```
   /// Whether nulls in `value_set` count for lookup.
   ///
   /// If true, any null in `value_set` is ignored and nulls in the input
   /// produce null (IndexIn) or false (IsIn) values in the output.
   /// If false, any null in `value_set` is successfully matched in
   /// the input.
   ```
   
   Or, to put it another way.  The behavior we currently get (regardless of skip_nulls) is:
   
   | Input Value | Lookup Set | Skip Nulls | Result |
   | --- | --- | --- | --- |
   | 5 | [1, NULL] | any value | false |
   | 5 | [5, NULL] | any value | true |
   | NULL | [5] | any value | false |
   | NULL | [5, NULL] | true | false |
   | NULL | [5, NULL] | false | true |
   
   This makes perfect sense for the "null as sentinel" understanding of null.  However, SQL typically takes a "null is unknown" approach.  In which case we would want the following behavior:
   
   | Input Value | Lookup Set | Skip Nulls | Result |
   | --- | --- | --- | --- |
   | 5 | [1, NULL] | any value | NULL |
   | 5 | [5, NULL] | any value | true |
   | NULL | [5] | any value | NULL |
   | NULL | [5, NULL] | any value | NULL |
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #36420:
URL: https://github.com/apache/arrow/issues/36420#issuecomment-1642008260

   @westonpace essentially, the behaviour you propose is to have a variant of `is_in` that is actually the equivalent of `(arr == val1) or (arr == val2) or ...`, right? 
   (i.e. how you can explain the "is_in" function, being an easier and more performant version of checking equality for multiple values, although at the moment it's not exactly equivalent when it comes to null handling)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] R-JunmingChen commented on issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.
R-JunmingChen commented on issue #36420:
URL: https://github.com/apache/arrow/issues/36420#issuecomment-1628874700

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] R-JunmingChen commented on issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.
R-JunmingChen commented on issue #36420:
URL: https://github.com/apache/arrow/issues/36420#issuecomment-1630850687

   I think adding a new Function named `in` could be a good way. Let me implement it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36420:
URL: https://github.com/apache/arrow/issues/36420#issuecomment-1642543049

   > @westonpace essentially, the behaviour you propose is to have a variant of is_in that is actually the equivalent of (arr == val1) or (arr == val2) or ..., right?
   
   Yes.  That is correct.  I encountered this when working on #34834 because Substrait translates SQL's `WHERE x IN [1, 5]` to a `SingularOrList` and I want to translate that `SingularOrList` to `is_in` to get better performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz closed issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz closed issue #36420: [C++] Add an sql-compatible is_in variant
URL: https://github.com/apache/arrow/issues/36420


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] R-JunmingChen commented on issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.
R-JunmingChen commented on issue #36420:
URL: https://github.com/apache/arrow/issues/36420#issuecomment-1628879930

   > This could be a new compute function or it could be new options for the existing is_in function. 
   
   Let me investigate, I will leave a message about my choice before I implement it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36420:
URL: https://github.com/apache/arrow/issues/36420#issuecomment-1629400547

   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] R-JunmingChen commented on issue #36420: [C++] Add an sql-compatible is_in variant

Posted by "R-JunmingChen (via GitHub)" <gi...@apache.org>.
R-JunmingChen commented on issue #36420:
URL: https://github.com/apache/arrow/issues/36420#issuecomment-1639989878

   Hi, @westonpace, I have implemented an `In` Function. However, the test codes for `is_in` is very large. Do i need to add the coresponding amount of tests or I just need to test the different part between `In` and  `is_in`? Besides, do I also need to add benchmark for `In` in `scalar_set_lookup_benchmark.cc`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org