You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/03/16 17:25:14 UTC
[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34570: GH-34568: [C++][Python] Expose Run-End Encoded arrays in Python Arrow

jorisvandenbossche commented on code in PR #34570:
URL: https://github.com/apache/arrow/pull/34570#discussion_r1139070633


##########
python/pyarrow/array.pxi:
##########
@@ -2851,6 +2851,103 @@ cdef class StructArray(Array):
         return self.take(indices)
 
 
+cdef class RunEndEncodedArray(Array):
+    """
+    Concrete class for Arrow run-end encoded arrays.
+    """
+
+    @staticmethod
+    def from_arrays(logical_length, run_ends, values, type=None):
+        """
+        Construct RunEndEncodedArray from run_ends and values arrays.
+
+        Parameters
+        ----------
+        logical_length : int
+            The logical length of the run-end encoded array.
+        run_ends : Array (int16, int32, or int64 type)
+            The run_ends array.
+        values : Array (any type)
+            The values array.
+        type : pyarrow.DataType, optional
+            The run_end_encoded(run_end_type, value_type) array type.
+
+        Returns
+        -------
+        RunEndEncodedArray
+        """
+        cdef:
+            int64_t _logical_length
+            Array _run_ends
+            Array _values
+            int64_t _logical_offset
+            shared_ptr[CDataType] c_type
+            shared_ptr[CRunEndEncodedArray] ree_array
+
+        _logical_length = <int64_t>logical_length
+        _run_ends = asarray(run_ends)
+        _values = asarray(values)
+        _logical_offset = <int64_t>0
+
+        type = ensure_type(type, allow_none=True)
+        if type is not None:
+            c_type = pyarrow_unwrap_data_type(type)
+            ree_array.reset(new CRunEndEncodedArray(
+                    c_type, _logical_length, _run_ends.sp_array,
+                    _values.sp_array, _logical_offset))

Review Comment:
   I suppose for this code path you need to ensure that run_ends and values have the correct matching type? (potentially the source of the segfault)



##########
python/pyarrow/array.pxi:
##########
@@ -2851,6 +2851,103 @@ cdef class StructArray(Array):
         return self.take(indices)
 
 
+cdef class RunEndEncodedArray(Array):
+    """
+    Concrete class for Arrow run-end encoded arrays.
+    """
+
+    @staticmethod
+    def from_arrays(logical_length, run_ends, values, type=None):
+        """
+        Construct RunEndEncodedArray from run_ends and values arrays.
+
+        Parameters
+        ----------
+        logical_length : int
+            The logical length of the run-end encoded array.

Review Comment:
   This can't be inferred from the run ends? Or could it be optional for non-offsetted arrays?



##########
python/pyarrow/tests/test_convert_builtin.py:
##########
@@ -2351,3 +2351,22 @@ def test_array_from_pylist_offset_overflow():
     assert isinstance(arr, pa.ChunkedArray)
     assert len(arr) == 2**31
     assert len(arr.chunks) > 1
+
+
+def check_run_end_encoded_with_type(ree_type=None):

Review Comment:
   Can you move this to `test_array.py` ? (I _think_ that's a better fit for testing the attributes of the array class and the from_arrays constructor)
   
   



##########
python/pyarrow/tests/test_types.py:
##########


Review Comment:
   There is a `get_many_types` at the top of this file that returns many types used in various generic tests, can you add one run end encoded type there as well?



##########
python/pyarrow/array.pxi:
##########
@@ -2851,6 +2851,103 @@ cdef class StructArray(Array):
         return self.take(indices)
 
 
+cdef class RunEndEncodedArray(Array):
+    """
+    Concrete class for Arrow run-end encoded arrays.
+    """
+
+    @staticmethod
+    def from_arrays(logical_length, run_ends, values, type=None):
+        """
+        Construct RunEndEncodedArray from run_ends and values arrays.
+
+        Parameters
+        ----------
+        logical_length : int
+            The logical length of the run-end encoded array.

Review Comment:
   It's also not super clear how this length is interpreted together with run_ends/values when it is smaller than the run_ends last value (it seems it is cut off at the end, which might be logical, but still something to make clearer I think)
   
   Essentially, for a "standard" array, you have (offsets=0, length=run_ends[-1]), and to have a shorter array both offsets could be > 1 or length could be < run_ends[-1]?



##########
python/pyarrow/tests/test_types.py:
##########
@@ -818,6 +818,19 @@ def test_fields_weakrefable():
     assert wr() is None
 
 
+def test_run_end_encoded_type():
+    ty = pa.run_end_encoded(pa.int64(), pa.utf8())
+    assert isinstance(ty, pa.RunEndEncodedType)
+    assert ty.run_end_type == pa.int64()
+    assert ty.value_type == pa.utf8()
+
+    with pytest.raises(TypeError):
+        pa.run_end_encoded(pa.int64(), None)
+
+    with pytest.raises(TypeError):
+        pa.run_end_encoded(None, pa.utf8())

Review Comment:
   Can you also test with an unsupported run_end type (eg uint)



##########
python/pyarrow/tests/test_convert_builtin.py:
##########
@@ -2351,3 +2351,22 @@ def test_array_from_pylist_offset_overflow():
     assert isinstance(arr, pa.ChunkedArray)
     assert len(arr) == 2**31
     assert len(arr.chunks) > 1
+
+
+def check_run_end_encoded_with_type(ree_type=None):

Review Comment:
   And could you also add a test for `RunEndEncodedArray.from_buffers`? (this gets inherited from the base class, but we should ensure this version actually works and not segfaults or so)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org