You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/24 13:01:49 UTC

[GitHub] [arrow-nanoarrow] jorisvandenbossche opened a new pull request, #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

jorisvandenbossche opened a new pull request, #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62

   Currently still on top https://github.com/apache/arrow-nanoarrow/pull/52
   
   This further explores possible Python bindings in cython. I added here an `Array` class that wraps an `ArrowArray*` and `ArrowSchema*` struct with some basic features: 1) ensuring to `release` the pointers when Array objects get deallocated, 2) method to construct it from raw pointers or from a pyarrow array (that last one is probably mostly useful for testing), 3) basic conversion to numpy which keeps this Array object as its base to keep it alive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] jorisvandenbossche commented on a diff in pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62#discussion_r1005621109


##########
python/tests/test_nanoarrow.py:
##########
@@ -6,22 +8,64 @@
 import pytest
 
 
-def test_as_numpy_array():
-    
-    arr = pa.array([1, 2, 3])
-    result = nanoarrow.as_numpy_array(arr)
-    expected = arr.to_numpy()
+def test_array_from_pyarrow():
+    parr = pa.array([1, 2, 3])
+    result = nanoarrow.Array.from_pyarrow(parr)
+    assert result.format == "l"
+
+
+def test_array_to_numpy_lifetime():
+
+    parr = pa.array([1, 2, 3])
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    refcount = sys.getrefcount(arr)
+    result = arr.to_numpy()
+    assert sys.getrefcount(arr) > refcount
+    assert result.base is arr
+    del arr
+    result
+    assert result.base
+
+
+def test_array_to_numpy():
+    parr = pa.array([1, 2, 3])
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    result = arr.to_numpy()
+    expected = parr.to_numpy()
     np.testing.assert_array_equal(result, expected)
 
-    arr = pa.array([1, 2, 3], pa.uint8())
-    result = nanoarrow.as_numpy_array(arr)
-    expected = arr.to_numpy()
+    parr = pa.array([1, 2, 3], pa.uint8())
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    result = arr.to_numpy()
+    expected = parr.to_numpy()
     np.testing.assert_array_equal(result, expected)
 
-    arr = pa.array([1, 2, None])
+    arr = nanoarrow.Array.from_pyarrow(pa.array([1, 2, None]))
     with pytest.raises(ValueError, match="Cannot convert array with nulls"):
-        nanoarrow.as_numpy_array(arr)
+        arr.to_numpy()
 
-    arr = pa.array([[1], [2, 3]])
+    arr = nanoarrow.Array.from_pyarrow(pa.array([[1], [2, 3]]))
     with pytest.raises(TypeError, match="Cannot convert a non-primitive array"):
-        nanoarrow.as_numpy_array(arr)
+       arr.to_numpy()
+
+
+def test_from_external_pointers():
+    pytest.importorskip("pyarrow.cffi")
+
+    from pyarrow.cffi import ffi
+
+    c_schema = ffi.new("struct ArrowSchema*")
+    ptr_schema = int(ffi.cast("uintptr_t", c_schema))
+    c_array = ffi.new("struct ArrowArray*")
+    ptr_array = int(ffi.cast("uintptr_t", c_array))
+
+    typ = pa.int32()
+    parr = pa.array([1, 2, 3], type=typ)
+    parr._export_to_c(ptr_array, ptr_schema)
+
+    arr = nanoarrow.Array.from_pointers(ptr_array, ptr_schema)
+    assert arr.to_numpy().tolist() == [1, 2, 3]
+
+    # trying to import second time should not cause a segfault? To enable

Review Comment:
   Yes, I think that's correct. Right now, with the raw pointers, it's the responsibility of the user passing around those pointers that they don't consume the pointers twice (as the consumer cannot know that there is another consumer of the same struct). 
   
   I think using PyCapsules instead of raw pointers could make this more robust. 
   But in practice to avoid this issue, I suppose we should "move" the array struct when constructing from external raw pointers, so the original pointer can be marked as released (release callback set to NULL)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot commented on pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on PR #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62#issuecomment-1291997577

   Yes, the pattern of an object responsible for calling the release callback `struct ArrowWhatever` and doing a "move" whenever ownership changes is most consistent with how Arrow C++ does it. The user-allocated cffi structs would then be intermediaries (like they are when interacting with Arrow C++).
   
   You almost certainly don't want to expose the Array/ArrayView difference to actual users, but it's probably a useful concept in the internals (in the R package you can create an external pointer to an ArrayView that keeps the appropriate objects alive, but that's never exposed to the user). The interface that I exposed in the R package is more like (apologies for what is almost certainly invalid cython):
   
   ```cython
   cdef class Schema:
       cdef:
           ArrowArray array
           object parent
   
   cdef class Array:
       cdef:
           ArrowArray array
           ArrowSchema* schema_ptr # (might be NULL)
           object schema # (might be None or point to a Schema)
           object parent
   ```
   
   You can do a `nanoarrow_array_set_schema()` to attach a schema to an array (which I do internally when importing). You can in theory have an Array without a schema but this mostly happens at intermediary phases of passing the objects around.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] jorisvandenbossche commented on a diff in pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62#discussion_r1005621109


##########
python/tests/test_nanoarrow.py:
##########
@@ -6,22 +8,64 @@
 import pytest
 
 
-def test_as_numpy_array():
-    
-    arr = pa.array([1, 2, 3])
-    result = nanoarrow.as_numpy_array(arr)
-    expected = arr.to_numpy()
+def test_array_from_pyarrow():
+    parr = pa.array([1, 2, 3])
+    result = nanoarrow.Array.from_pyarrow(parr)
+    assert result.format == "l"
+
+
+def test_array_to_numpy_lifetime():
+
+    parr = pa.array([1, 2, 3])
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    refcount = sys.getrefcount(arr)
+    result = arr.to_numpy()
+    assert sys.getrefcount(arr) > refcount
+    assert result.base is arr
+    del arr
+    result
+    assert result.base
+
+
+def test_array_to_numpy():
+    parr = pa.array([1, 2, 3])
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    result = arr.to_numpy()
+    expected = parr.to_numpy()
     np.testing.assert_array_equal(result, expected)
 
-    arr = pa.array([1, 2, 3], pa.uint8())
-    result = nanoarrow.as_numpy_array(arr)
-    expected = arr.to_numpy()
+    parr = pa.array([1, 2, 3], pa.uint8())
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    result = arr.to_numpy()
+    expected = parr.to_numpy()
     np.testing.assert_array_equal(result, expected)
 
-    arr = pa.array([1, 2, None])
+    arr = nanoarrow.Array.from_pyarrow(pa.array([1, 2, None]))
     with pytest.raises(ValueError, match="Cannot convert array with nulls"):
-        nanoarrow.as_numpy_array(arr)
+        arr.to_numpy()
 
-    arr = pa.array([[1], [2, 3]])
+    arr = nanoarrow.Array.from_pyarrow(pa.array([[1], [2, 3]]))
     with pytest.raises(TypeError, match="Cannot convert a non-primitive array"):
-        nanoarrow.as_numpy_array(arr)
+       arr.to_numpy()
+
+
+def test_from_external_pointers():
+    pytest.importorskip("pyarrow.cffi")
+
+    from pyarrow.cffi import ffi
+
+    c_schema = ffi.new("struct ArrowSchema*")
+    ptr_schema = int(ffi.cast("uintptr_t", c_schema))
+    c_array = ffi.new("struct ArrowArray*")
+    ptr_array = int(ffi.cast("uintptr_t", c_array))
+
+    typ = pa.int32()
+    parr = pa.array([1, 2, 3], type=typ)
+    parr._export_to_c(ptr_array, ptr_schema)
+
+    arr = nanoarrow.Array.from_pointers(ptr_array, ptr_schema)
+    assert arr.to_numpy().tolist() == [1, 2, 3]
+
+    # trying to import second time should not cause a segfault? To enable

Review Comment:
   Yes, I think that's correct. Right now, with the raw pointers, it's the responsibility of the user passing around those pointers that they don't consume the pointers twice (as the consumer cannot know that there is another consumer of the same struct). 
   
   I think using PyCapsules instead of raw pointers could make this more robust for usage in Python. 
   But in practice to avoid this issue with raw pointers, I suppose we should "move" the array struct when constructing from external raw pointers, so the original pointer can be marked as released (release callback set to NULL). See my comment just below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot commented on a diff in pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on code in PR #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62#discussion_r1004816765


##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -44,14 +47,82 @@ cdef dict _numpy_type_map = {
 }
 
 
-def as_numpy_array(arr):
-    cdef ArrowSchema schema
-    cdef ArrowArray array
+cdef class Array:

Review Comment:
   Semantics, but what you have here is more like a wrapper around the `ArrowArrayView` (that keeps the objects it points to alive). The R package has one of these too (and similarly it backs all the conversions to R vectors).



##########
python/tests/test_nanoarrow.py:
##########
@@ -6,22 +8,64 @@
 import pytest
 
 
-def test_as_numpy_array():
-    
-    arr = pa.array([1, 2, 3])
-    result = nanoarrow.as_numpy_array(arr)
-    expected = arr.to_numpy()
+def test_array_from_pyarrow():
+    parr = pa.array([1, 2, 3])
+    result = nanoarrow.Array.from_pyarrow(parr)
+    assert result.format == "l"
+
+
+def test_array_to_numpy_lifetime():
+
+    parr = pa.array([1, 2, 3])
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    refcount = sys.getrefcount(arr)
+    result = arr.to_numpy()
+    assert sys.getrefcount(arr) > refcount
+    assert result.base is arr
+    del arr
+    result
+    assert result.base
+
+
+def test_array_to_numpy():
+    parr = pa.array([1, 2, 3])
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    result = arr.to_numpy()
+    expected = parr.to_numpy()
     np.testing.assert_array_equal(result, expected)
 
-    arr = pa.array([1, 2, 3], pa.uint8())
-    result = nanoarrow.as_numpy_array(arr)
-    expected = arr.to_numpy()
+    parr = pa.array([1, 2, 3], pa.uint8())
+    arr = nanoarrow.Array.from_pyarrow(parr)
+    result = arr.to_numpy()
+    expected = parr.to_numpy()
     np.testing.assert_array_equal(result, expected)
 
-    arr = pa.array([1, 2, None])
+    arr = nanoarrow.Array.from_pyarrow(pa.array([1, 2, None]))
     with pytest.raises(ValueError, match="Cannot convert array with nulls"):
-        nanoarrow.as_numpy_array(arr)
+        arr.to_numpy()
 
-    arr = pa.array([[1], [2, 3]])
+    arr = nanoarrow.Array.from_pyarrow(pa.array([[1], [2, 3]]))
     with pytest.raises(TypeError, match="Cannot convert a non-primitive array"):
-        nanoarrow.as_numpy_array(arr)
+       arr.to_numpy()
+
+
+def test_from_external_pointers():
+    pytest.importorskip("pyarrow.cffi")
+
+    from pyarrow.cffi import ffi
+
+    c_schema = ffi.new("struct ArrowSchema*")
+    ptr_schema = int(ffi.cast("uintptr_t", c_schema))
+    c_array = ffi.new("struct ArrowArray*")
+    ptr_array = int(ffi.cast("uintptr_t", c_array))
+
+    typ = pa.int32()
+    parr = pa.array([1, 2, 3], type=typ)
+    parr._export_to_c(ptr_array, ptr_schema)
+
+    arr = nanoarrow.Array.from_pointers(ptr_array, ptr_schema)
+    assert arr.to_numpy().tolist() == [1, 2, 3]
+
+    # trying to import second time should not cause a segfault? To enable

Review Comment:
   My reading of cython code is not awesome, but I think importing twice causes a double free when the `release()` method to be called twice?



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -44,14 +47,82 @@ cdef dict _numpy_type_map = {
 }
 
 
-def as_numpy_array(arr):
-    cdef ArrowSchema schema
-    cdef ArrowArray array
+cdef class Array:
+
+    cdef:
+        ArrowArray* array_ptr
+        ArrowSchema* schema_ptr
+        bint own_data
+        bint own_ptrs
+        ArrowArrayView array_view
+        object parent
+
+    def __init__(self):
+        raise TypeError("Do not call constructor directly")
+
+    cdef void init(self, ArrowArray* array_ptr, ArrowSchema* schema_ptr) except *:
+        self.array_ptr = array_ptr
+        self.schema_ptr = schema_ptr
+        self.own_data = True
+
+    def __dealloc__(self):
+        if self.own_data:
+            self.array_ptr.release(self.array_ptr)
+            self.schema_ptr.release(self.schema_ptr)
+        if self.own_ptrs:
+            if self.array_ptr is not NULL:
+                free(self.array_ptr)
+                self.array_ptr = NULL
+            if self.schema_ptr is not NULL:
+                free(self.schema_ptr)
+                self.schema_ptr = NULL
+        self.parent = None
+
+    @property
+    def format(self):
+        cdef const char* format_string = deref(self.schema_ptr).format

Review Comment:
   With the latest nanoarrow you can do:
   
   https://github.com/apache/arrow-nanoarrow/blob/7395bf25275679f6a4fbc35c7cf9bc8fbd6b6150/r/src/schema.c#L146-L149



##########
python/src/nanoarrow/_lib.pyx:
##########
@@ -44,14 +47,82 @@ cdef dict _numpy_type_map = {
 }
 
 
-def as_numpy_array(arr):
-    cdef ArrowSchema schema
-    cdef ArrowArray array
+cdef class Array:
+
+    cdef:
+        ArrowArray* array_ptr
+        ArrowSchema* schema_ptr
+        bint own_data
+        bint own_ptrs
+        ArrowArrayView array_view
+        object parent
+
+    def __init__(self):
+        raise TypeError("Do not call constructor directly")
+
+    cdef void init(self, ArrowArray* array_ptr, ArrowSchema* schema_ptr) except *:
+        self.array_ptr = array_ptr
+        self.schema_ptr = schema_ptr
+        self.own_data = True
+
+    def __dealloc__(self):
+        if self.own_data:
+            self.array_ptr.release(self.array_ptr)
+            self.schema_ptr.release(self.schema_ptr)
+        if self.own_ptrs:
+            if self.array_ptr is not NULL:
+                free(self.array_ptr)
+                self.array_ptr = NULL
+            if self.schema_ptr is not NULL:
+                free(self.schema_ptr)
+                self.schema_ptr = NULL
+        self.parent = None
+
+    @property
+    def format(self):
+        cdef const char* format_string = deref(self.schema_ptr).format
+        if format_string == NULL:
+            return None
+        else:
+            return format_string.decode('utf8')
+
+    @staticmethod
+    cdef Array _from_ptrs(ArrowArray* array_ptr, ArrowSchema* schema_ptr, bint own_ptrs=False):
+        cdef Array self = Array.__new__(Array)
+        self.init(array_ptr, schema_ptr)
+        self.own_ptrs = own_ptrs
+        return self
+
+    @classmethod
+    def from_pointers(cls, array_ptr, schema_ptr):
+        cdef:
+            ArrowArray* c_array_ptr = <ArrowArray*> <uintptr_t > array_ptr
+            ArrowSchema* c_schema_ptr = <ArrowSchema*> <uintptr_t > schema_ptr
+        return Array._from_ptrs(c_array_ptr, c_schema_ptr)
+
+    @classmethod
+    def from_pyarrow(cls, arr):
+        cdef ArrowSchema *schema = <ArrowSchema *>malloc(sizeof(ArrowSchema))
+        cdef ArrowArray *array = <ArrowArray *>malloc(sizeof(ArrowArray))
+        arr._export_to_c(<uintptr_t> array, <uintptr_t> schema)
+        self = Array._from_ptrs(array, schema, own_ptrs=True)
+        self.parent = arr

Review Comment:
   You probably don't need this? (I think the release callback takes care of keeping the `shared_ptr` alive for the lifecycle of the schema/array)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot closed pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot closed pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)
URL: https://github.com/apache/arrow-nanoarrow/pull/62


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot commented on pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by GitBox <gi...@apache.org>.

paleolimbot commented on PR #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62#issuecomment-1292015695

   Actually, I'm remembering that the R wrappers don't necessarily own the memory for the `struct ArrowWhatever`...for example when calling `schema$children[[1]]`, where the pointer is kept alive by `parent`. Perhaps `parent` could be one of those `PyCapsule` things for an object that really does need its `release` callback called?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] jorisvandenbossche commented on pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62#issuecomment-1291955073

   Thanks for the comments!
   
   > Just spitballing, maybe what you could have is: ... that might simplify the `own_data`/`own_ptrs`/malloc lifecycle stuff you have here.
   
   The question is also to what extent we want to expose the Array / ArrayView difference to Python users. For nanoarrow C, this distinction makes total sense, as we want something on top of the raw ArrowArray struct, which we can't change. But in Python, the Array class is already something that wraps the raw struct, so I was thinking that therefore the Python API doesn't need to directly expose this distinction to the user.
   
   
   The `own_data` isn't actually used yet. I only added it (prematurely) because I was thinking that at some point the struct might be passed to something else (eg to pyarrow), and the nanoarrow.Array class shouldn't call the release callback. Although that can (should?) probably be done by copying the struct before passing it to another consumer and setting `array.release = NULL` (https://arrow.apache.org/docs/dev/format/CDataInterface.html#moving-an-array) and in `__dealloc__` check for the release callback not being NULL whether to call it or not. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot commented on pull request #62: [Python] Basic Array class wrapping C struct (with conversion to numpy)

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on PR #62:
URL: https://github.com/apache/arrow-nanoarrow/pull/62#issuecomment-1603249650

   Closing for now...we definitely need to revisit the concept of the nanoarrow `Array` but I think now we have a better starting place!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org