You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/07 21:46:53 UTC

[GitHub] [arrow] westonpace opened a new pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

westonpace opened a new pull request #10272:
URL: https://github.com/apache/arrow/pull/10272


   This allows the user to supply an optional `mask` when creating a struct array.
   
    * The mask requirements are pretty strict (must be a boolean arrow array without nulls) compared with some of the other functions (e.g. `array.mask` accepts a wide variety of inputs).  I think this should be ok since this use case is probably rarer and there are other plenty of existing ways to convert other datatypes to an arrow array.
    * Unfortunately, StructArray::Make interprets the "null buffer" as more of a validity buffer (1 = valid, 0 = null).  This is the opposite of everywhere else a `mask` is used.  I was torn between inverting the input buffer to mimic the python API and passing through directly to the C interface for simplicity.  I chose the simpler option but could be convinced otherwise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#issuecomment-836283202


   If we add such a `mask` keyword, I would say that it should match the `mask` keyword of `pa.array(..)` (and thus being an inverted mask, and not the validity mask as used internally). 
   
   That means always some conversion of the data is needed (to invert the mask), and you cannot create a StructArray 100% cheaply from existing arrays with `from_arrays`, but for such a case you can still use `from_buffers` if you want.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#discussion_r632767457



##########
File path: python/pyarrow/array.pxi
##########
@@ -2189,6 +2227,18 @@ cdef class StructArray(Array):
         if names is not None and fields is not None:
             raise ValueError('Must pass either names or fields, not both')
 
+        if mask is None:
+            c_mask = shared_ptr[CBuffer]()
+        elif isinstance(mask, Array):
+            if mask.type != bool_():

Review comment:
       I couldn't figure out how to access `is_boolean` from the `pxi` file but I changed it to `mask.type.id != Type_BOOL` which is similar to the way other type comparisons are done in this file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm closed pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #10272:
URL: https://github.com/apache/arrow/pull/10272


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#issuecomment-834802529


   https://issues.apache.org/jira/browse/ARROW-12677


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#discussion_r633344983



##########
File path: python/pyarrow/array.pxi
##########
@@ -2153,7 +2186,8 @@ cdef class StructArray(Array):
         return [pyarrow_wrap_array(arr) for arr in arrays]
 
     @staticmethod
-    def from_arrays(arrays, names=None, fields=None):
+    def from_arrays(arrays, names=None, fields=None, mask=None,
+                    memory_pool=None):

Review comment:
       It seems we are also inconsistent in the naming of this keyword, the `ListArray.from_arrays` above uses a `pool` keyword (but `memory_pool` is used more often, so this change is fine, will open a JIRA to make this consistent)

##########
File path: python/pyarrow/array.pxi
##########
@@ -2153,7 +2186,8 @@ cdef class StructArray(Array):
         return [pyarrow_wrap_array(arr) for arr in arrays]
 
     @staticmethod
-    def from_arrays(arrays, names=None, fields=None):
+    def from_arrays(arrays, names=None, fields=None, mask=None,
+                    memory_pool=None):

Review comment:
       -> https://issues.apache.org/jira/browse/ARROW-12805




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#discussion_r633344983



##########
File path: python/pyarrow/array.pxi
##########
@@ -2153,7 +2186,8 @@ cdef class StructArray(Array):
         return [pyarrow_wrap_array(arr) for arr in arrays]
 
     @staticmethod
-    def from_arrays(arrays, names=None, fields=None):
+    def from_arrays(arrays, names=None, fields=None, mask=None,
+                    memory_pool=None):

Review comment:
       It seems we are also inconsistent in the naming of this keyword, the `ListArray.from_arrays` above uses a `pool` keyword (but `memory_pool` is used more often, so this change is fine, will open a JIRA to make this consistent)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#discussion_r633349425



##########
File path: python/pyarrow/array.pxi
##########
@@ -2153,7 +2186,8 @@ cdef class StructArray(Array):
         return [pyarrow_wrap_array(arr) for arr in arrays]
 
     @staticmethod
-    def from_arrays(arrays, names=None, fields=None):
+    def from_arrays(arrays, names=None, fields=None, mask=None,
+                    memory_pool=None):

Review comment:
       -> https://issues.apache.org/jira/browse/ARROW-12805




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#issuecomment-840899285


   I've added the call to invert.  I went ahead and added a `memory_pool` parameter per @&res suggestion on the JIRA.  I also verified that we can create null elements in a `ListArray` and added a test for `ListArray.from_arrays` since there was none.
   
   I believe I have addressed all PR comments and, pending CI, this is ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10272: ARROW-12677: [Python] Add a mask argument to pyarrow.StructArray.from_arrays

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10272:
URL: https://github.com/apache/arrow/pull/10272#discussion_r632622802



##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -932,6 +954,24 @@ def test_fixed_size_list_from_arrays():
         pa.FixedSizeListArray.from_arrays(values, 5)
 
 
+def test_variable_list_from_arrays():
+    values = pa.array([1, 2, 3, 4], pa.int64())
+    offsets = pa.array([0, 2, 4])
+    result = pa.ListArray.from_arrays(offsets, values)
+    assert result.to_pylist() == [[1, 2], [3, 4]]
+    assert result.type.equals(pa.list_(pa.int64()))
+
+    offsets = pa.array([0, None, 2, 4])
+    result = pa.ListArray.from_arrays(offsets, values)
+    assert result.to_pylist() == [[1, 2], None, [3, 4]]
+
+    # raise if offset out of bounds
+    with pytest.raises(ValueError):
+        pa.ListArray.from_arrays(pa.array([-1, 2, 4]), values)
+
+    with pytest.raises(ValueError):
+        pa.ListArray.from_arrays(pa.array([0, 2, 5]), values)
+

Review comment:
       ```suggestion
   
   
   ```

##########
File path: python/pyarrow/array.pxi
##########
@@ -2189,6 +2227,18 @@ cdef class StructArray(Array):
         if names is not None and fields is not None:
             raise ValueError('Must pass either names or fields, not both')
 
+        if mask is None:
+            c_mask = shared_ptr[CBuffer]()
+        elif isinstance(mask, Array):
+            if mask.type != bool_():

Review comment:
       nit: maybe pa.types.is_boolean?

##########
File path: python/pyarrow/array.pxi
##########
@@ -2189,6 +2227,18 @@ cdef class StructArray(Array):
         if names is not None and fields is not None:
             raise ValueError('Must pass either names or fields, not both')
 
+        if mask is None:
+            c_mask = shared_ptr[CBuffer]()
+        elif isinstance(mask, Array):
+            if mask.type != bool_():
+                raise ValueError('Mask must be a pyarray.Array of type bool')
+            if mask.null_count != 0:
+                raise ValueError('Mask must not contain nulls')
+            inverted_mask = _pc().invert(mask, memory_pool=memory_pool)
+            c_mask = pyarrow_unwrap_buffer(inverted_mask.buffers()[1])
+        else:
+            raise ValueError('Mask must be a pyarray.Array of type bool')

Review comment:
       ```suggestion
               raise ValueError('Mask must be a pyarrow.Array of type bool')
   ```

##########
File path: python/pyarrow/array.pxi
##########
@@ -2189,6 +2227,18 @@ cdef class StructArray(Array):
         if names is not None and fields is not None:
             raise ValueError('Must pass either names or fields, not both')
 
+        if mask is None:
+            c_mask = shared_ptr[CBuffer]()
+        elif isinstance(mask, Array):
+            if mask.type != bool_():
+                raise ValueError('Mask must be a pyarray.Array of type bool')
+            if mask.null_count != 0:
+                raise ValueError('Mask must not contain nulls')
+            inverted_mask = _pc().invert(mask, memory_pool=memory_pool)
+            c_mask = pyarrow_unwrap_buffer(inverted_mask.buffers()[1])
+        else:
+            raise ValueError('Mask must be a pyarray.Array of type bool')

Review comment:
       nit: maybe also include an `'(expected pyarrow.Array of type bool, got {type(mask)})'` (this is semi-consistently done in PyArrow)

##########
File path: python/pyarrow/array.pxi
##########
@@ -2189,6 +2227,18 @@ cdef class StructArray(Array):
         if names is not None and fields is not None:
             raise ValueError('Must pass either names or fields, not both')
 
+        if mask is None:
+            c_mask = shared_ptr[CBuffer]()
+        elif isinstance(mask, Array):
+            if mask.type != bool_():
+                raise ValueError('Mask must be a pyarray.Array of type bool')

Review comment:
       ```suggestion
                   raise ValueError('Mask must be a pyarrow.Array of type bool')
   ```

##########
File path: python/pyarrow/tests/test_array.py
##########
@@ -932,6 +954,24 @@ def test_fixed_size_list_from_arrays():
         pa.FixedSizeListArray.from_arrays(values, 5)
 
 
+def test_variable_list_from_arrays():
+    values = pa.array([1, 2, 3, 4], pa.int64())
+    offsets = pa.array([0, 2, 4])
+    result = pa.ListArray.from_arrays(offsets, values)
+    assert result.to_pylist() == [[1, 2], [3, 4]]
+    assert result.type.equals(pa.list_(pa.int64()))
+
+    offsets = pa.array([0, None, 2, 4])
+    result = pa.ListArray.from_arrays(offsets, values)
+    assert result.to_pylist() == [[1, 2], None, [3, 4]]
+
+    # raise if offset out of bounds
+    with pytest.raises(ValueError):
+        pa.ListArray.from_arrays(pa.array([-1, 2, 4]), values)
+
+    with pytest.raises(ValueError):
+        pa.ListArray.from_arrays(pa.array([0, 2, 5]), values)
+

Review comment:
       (just to fix the lint error)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org