You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/11 19:17:52 UTC

[GitHub] [arrow] pitrou opened a new pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

pitrou opened a new pull request #9164:
URL: https://github.com/apache/arrow/pull/9164


   * Reject duplicates in SetLookupOptions::value_set, because otherwise the indices returned by index_in would be relative to a deduplicated value_set.
   
   * Honour SetLookup::skip_nulls in is_in.
   
   * Revamp tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#discussion_r555793625



##########
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
##########
@@ -272,10 +278,9 @@ struct IsInVisitor {
 
   Status Visit(const DataType&) {
     const auto& state = checked_cast<const SetLookupState<NullType>&>(*ctx->state());
+    // XXX should skip_nulls be taken into account?

Review comment:
       Good point. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #9164:
URL: https://github.com/apache/arrow/pull/9164


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#discussion_r555650652



##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -290,17 +290,19 @@ Result<Datum> KleeneAndNot(const Datum& left, const Datum& right,
 /// \brief IsIn returns true for each element of `values` that is contained in
 /// `value_set`
 ///
-/// If null occurs in left, if null count in right is not 0,
-/// it returns true, else returns null.
+/// Behaviour of nulls is governed by SetLookupOptions::skip_nulls.

Review comment:
       Also, could maybe add a similar sentence "Behaviour of nulls is governed by SetLookupOptions::skip_nulls" to the `is_in_doc` ?

##########
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
##########
@@ -272,10 +278,9 @@ struct IsInVisitor {
 
   Status Visit(const DataType&) {
     const auto& state = checked_cast<const SetLookupState<NullType>&>(*ctx->state());
+    // XXX should skip_nulls be taken into account?

Review comment:
       It's a bit strange corner case, but for consistency probably yes? 
   
   Right now for null array the skip_nulls argument has no effect (using this PR):
   
   ```
   In [27]: pc.is_in(pa.array([None]), value_set=pa.array([None]), skip_null=False)
   Out[27]: 
   <pyarrow.lib.BooleanArray object at 0x7fac1d800880>
   [
     true
   ]
   
   In [28]: pc.is_in(pa.array([None]), value_set=pa.array([None]), skip_null=True)
   Out[28]: 
   <pyarrow.lib.BooleanArray object at 0x7fac1d800ca0>
   [
     true
   ]
   ```
   while if you add a non-null type to the array creation, but use the same values, you get a different result:
   
   ```
   In [30]: pc.is_in(pa.array([None], type="int64"), value_set=pa.array([None], type="int64"), skip_null=False)
   Out[30]: 
   <pyarrow.lib.BooleanArray object at 0x7fac1d82f880>
   [
     true
   ]
   
   In [31]: pc.is_in(pa.array([None], type="int64"), value_set=pa.array([None], type="int64"), skip_null=True)
   Out[31]: 
   <pyarrow.lib.BooleanArray object at 0x7fac1d7c2040>
   [
     false
   ]
   ```

##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -290,17 +290,19 @@ Result<Datum> KleeneAndNot(const Datum& left, const Datum& right,
 /// \brief IsIn returns true for each element of `values` that is contained in
 /// `value_set`
 ///
-/// If null occurs in left, if null count in right is not 0,
-/// it returns true, else returns null.
+/// Behaviour of nulls is governed by SetLookupOptions::skip_nulls.

Review comment:
       Could mention here that the default is `skip_nulls=False` (or in the docstring of SetLookupOptions)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758671039


   Thanks for the PR! 
   One additional thing I am wondering, while testing it out, would we ever want a behaviour where a `null` in the input gives a `null` in the output? Right now it's only possible to get `false` (if there is no null in the value_set, or if skip_nulls=True) or `true` (if there is a null in the value_set and skip_nulls=False).
   
   So something like `isin([1, 2, null], value_set=[1, 3]) -> [true, false, null]`
   
   If you see "isin"  as a shortcut to write multiple equality comparisons (`isin(input, value_set=[1, 3, ...]` -> `(input == 1) | (input == 3) | ...`), then you would get such behaviour. 
   But so it's a bit the question whether for "isin" we use "equality" semantics or "identity/lookup" semantics for nulls (and given it's now called "SetLookup" in the function names, we clearly go for the second, but I am not fully sure which of the two are most useful / expected in practice).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758671039


   Thanks for the PR! 
   One additional thing I am wondering, while testing it out, would we ever want a behaviour where a null in the input gives a null in the output? Right now it's only possible to get `


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758671039


   Thanks for the PR! 
   One additional thing I am wondering, while testing it out, would we ever want a behaviour where a `null` in the input gives a `null` in the output? Right now it's only possible to get `false` (if there is no null in the value_set, or if skip_nulls=True) or `true` (if there is a null in the value_set and skip_nulls=False).
   
   So something like `isin([1, 2, null], value_set=[1, 3]) -> [true, false, null]` (instead of true, false, false)
   
   If you see "isin"  as a shortcut to write multiple equality comparisons (`isin(input, value_set=[1, 3, ...]` -> `(input == 1) | (input == 3) | ...`), then you would get such behaviour. 
   But so it's a bit the question whether for "isin" we use "equality" semantics or "identity/lookup" semantics for nulls (and given it's now called "SetLookup" in the function names, we clearly go for the second, but I am not fully sure which of the two are most useful / expected in practice).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758677360


   @jorisvandenbossche  I have no idea. Perhaps @bkietz or @michalursa can share their opinion.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758682512


   Looking at the behavior of `%in%` in R (cc @nealrichardson), there NA's also get matched (eg `c(1, 2, NA) %in% c(1, 3)` gives true,false,false and `c(1, 2, NA) %in% c(1, 3, NA)` gives true,false,true), so that is consistent with the behaviour we have in Arrow right now. 
   
   The SQL `IN` operator does not seem to match Nulls, because there it is a short-hand for multiple comparisons. But, in practice, you can only use this (as far as I know, only limited SQL knowledge) in a WHERE clause. So whether the Null in the column gives False or Null doesn't matter much, because in both cases the row does not get preserved in a WHERE filter. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#discussion_r555954700



##########
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
##########
@@ -36,53 +36,59 @@ namespace {
 
 template <typename Type>
 struct SetLookupState : public KernelState {
-  explicit SetLookupState(MemoryPool* pool)
-      : lookup_table(pool, 0), lookup_null_count(0) {}
+  explicit SetLookupState(const SetLookupOptions& options, MemoryPool* pool)
+      : options(options), lookup_table(pool, 0) {}
 
-  Status Init(const SetLookupOptions& options) {
+  Status Init() {
+    if (options.value_set.kind() == Datum::ARRAY) {
+      RETURN_NOT_OK(AddArrayValueSet(*options.value_set.array()));
+    } else if (options.value_set.kind() == Datum::CHUNKED_ARRAY) {
+      const ChunkedArray& value_set = *options.value_set.chunked_array();
+      for (const std::shared_ptr<Array>& chunk : value_set.chunks()) {

Review comment:
       We have a lot of code like this, maybe later we should add `Datum::chunks()` so we can write
   ```c++
   if (!options.values_set.is_arraylike()) {
     return Status::Invalid("value_set should be an array or chunked array");
   }
   for (const std::shared_ptr<ArrayData>& chunk : options.value_set.chunks()) {
     RETURN_NOT_OK(AddArrayValueSet(*chunk->data()));
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758683097


   (BTW, since this is existing behaviour, and this PR doesn't change that, that should certainly not hold up this PR, and should probably move the discussion to a JIRA)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758829080


   The CI failure are spurious. Green Travis-CI build: https://travis-ci.com/github/pitrou/arrow/builds/212852090


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#issuecomment-758214764


   https://issues.apache.org/jira/browse/ARROW-10663


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #9164: ARROW-10663: [C++] Fix is_in and index_in behaviour

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #9164:
URL: https://github.com/apache/arrow/pull/9164#discussion_r555961225



##########
File path: cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
##########
@@ -36,53 +36,59 @@ namespace {
 
 template <typename Type>
 struct SetLookupState : public KernelState {
-  explicit SetLookupState(MemoryPool* pool)
-      : lookup_table(pool, 0), lookup_null_count(0) {}
+  explicit SetLookupState(const SetLookupOptions& options, MemoryPool* pool)
+      : options(options), lookup_table(pool, 0) {}
 
-  Status Init(const SetLookupOptions& options) {
+  Status Init() {
+    if (options.value_set.kind() == Datum::ARRAY) {
+      RETURN_NOT_OK(AddArrayValueSet(*options.value_set.array()));
+    } else if (options.value_set.kind() == Datum::CHUNKED_ARRAY) {
+      const ChunkedArray& value_set = *options.value_set.chunked_array();
+      for (const std::shared_ptr<Array>& chunk : value_set.chunks()) {

Review comment:
       If we're not too bothered by the cost of a temporary vector then it may be nice indeed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org