You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/06/22 13:11:50 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request, #36242: GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch

jorisvandenbossche opened a new pull request, #36242:
URL: https://github.com/apache/arrow/pull/36242

   ### Rationale for this change
   
   Currently, calling `np.asarray(table)` gives the wrong result (transpose of the expected result), because we don't implement an explicit `__array__` to numpy conversion, and then numpy falls back to iterate the object (but this iterates the columns, giving a transposed result).
   
   To fix this unexpected result, I added an actual `__array__` (currently with a naive implementation in python, potentially this could be optimized in our C++ conversion layer)
   
   ### Are there any user-facing changes?
   
   This changes the behaviour of `np.asarray(table/record_batch)`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36242: GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #36242:
URL: https://github.com/apache/arrow/pull/36242#discussion_r1245514344


##########
python/pyarrow/table.pxi:
##########
@@ -1487,6 +1487,16 @@ cdef class _Tabular(_PandasConvertible):
 
         return _PyArrowDataFrame(self, nan_as_null, allow_copy)
 
+    def __array__(self, dtype=None):
+        column_arrays = [
+            np.asarray(self.column(i), dtype=dtype) for i in range(self.num_columns)
+        ]
+        if column_arrays:
+            arr = np.stack(column_arrays, axis=1)
+        else:
+            arr = np.empty((self.num_rows, 0), dtype=dtype)

Review Comment:
   Yes, that is possible, you can have a table with a know number of rows but without columns:
   
   ```
   In [16]: table = pa.table({'a': [1, 2, 3]}).select([])
   
   In [17]: table
   Out[17]: 
   pyarrow.Table
   
   ----
   
   In [18]: table.num_rows
   Out[18]: 3
   
   In [19]: table.num_columns
   Out[19]: 0
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche merged pull request #36242: GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche merged PR #36242:
URL: https://github.com/apache/arrow/pull/36242


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #36242: GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #36242:
URL: https://github.com/apache/arrow/pull/36242#discussion_r1245568173


##########
python/pyarrow/table.pxi:
##########
@@ -1487,6 +1487,16 @@ cdef class _Tabular(_PandasConvertible):
 
         return _PyArrowDataFrame(self, nan_as_null, allow_copy)
 
+    def __array__(self, dtype=None):
+        column_arrays = [
+            np.asarray(self.column(i), dtype=dtype) for i in range(self.num_columns)
+        ]
+        if column_arrays:
+            arr = np.stack(column_arrays, axis=1)
+        else:
+            arr = np.empty((self.num_rows, 0), dtype=dtype)

Review Comment:
   Very interesting! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36242: GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36242:
URL: https://github.com/apache/arrow/pull/36242#issuecomment-1619234389

   Conbench analyzed the 6 benchmark runs on commit `f9b619b7`.
   
   There were 5 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-07-02 04:24:36Z](http://conbench.ursa.dev/compare/runs/96ade3beb566490291ddb77c829b99e8...1e496f0bb8634c33844de1f7fb01daa5/)
     - [params=<Add, Int8Type>/size:524288/inverse_null_proportion:0, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark](http://conbench.ursa.dev/compare/benchmarks/064a0d3fdee07692800004c8350d9f77...064a0fc54a097d5680008f93dadcc1a8)
     - [params=<Add, Int8Type>/size:524288/inverse_null_proportion:100, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark](http://conbench.ursa.dev/compare/benchmarks/064a0d3fe08a75e28000585e152eaf0d...064a0fc54b1776c580002dd7208a333f)
   - and 3 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14751432546) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on a diff in pull request #36242: GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on code in PR #36242:
URL: https://github.com/apache/arrow/pull/36242#discussion_r1239024807


##########
python/pyarrow/table.pxi:
##########
@@ -1487,6 +1487,16 @@ cdef class _Tabular(_PandasConvertible):
 
         return _PyArrowDataFrame(self, nan_as_null, allow_copy)
 
+    def __array__(self, dtype=None):
+        column_arrays = [
+            np.asarray(self.column(i), dtype=dtype) for i in range(self.num_columns)
+        ]
+        if column_arrays:
+            arr = np.stack(column_arrays, axis=1)
+        else:
+            arr = np.empty((self.num_rows, 0), dtype=dtype)

Review Comment:
   Just out of curiosity, is it possible to have `self.num_rows > 0` in this case? I would imagine that if there are no columns then there are no rows.



##########
python/pyarrow/table.pxi:
##########
@@ -1487,6 +1487,16 @@ cdef class _Tabular(_PandasConvertible):
 
         return _PyArrowDataFrame(self, nan_as_null, allow_copy)
 
+    def __array__(self, dtype=None):

Review Comment:
   nit: should we maintain alphabetization of methods (ignoring the `__init__` method)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36242: GH-34886: [Python] Add correct __array__ numpy conversion for Table and RecordBatch

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36242:
URL: https://github.com/apache/arrow/pull/36242#issuecomment-1602616437

   :warning: GitHub issue #34886 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org