You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/30 14:36:53 UTC

[GitHub] [arrow] kszucs opened a new pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

kszucs opened a new pull request #7868:
URL: https://github.com/apache/arrow/pull/7868


   While it technically is still using pandas during the conversion it exposes the to_numpy() method.
   
   Also refactored the chunked array construction to support more flexible input, see the test case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

Posted by GitBox <gi...@apache.org>.
kszucs commented on pull request #7868:
URL: https://github.com/apache/arrow/pull/7868#issuecomment-671324848


   @pitrou rebased and addressed your comments


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7868:
URL: https://github.com/apache/arrow/pull/7868#issuecomment-669257856


   @kszucs Can you rebase this PR? Hopefully it will fix AppVeyor.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7868:
URL: https://github.com/apache/arrow/pull/7868


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7868:
URL: https://github.com/apache/arrow/pull/7868#discussion_r465809999



##########
File path: python/pyarrow/table.pxi
##########
@@ -226,24 +226,34 @@ cdef class ChunkedArray(_PandasConvertible):
     def _to_pandas(self, options, **kwargs):
         return _array_like_to_pandas(self, options)
 
-    def __array__(self, dtype=None):
+    def to_numpy(self):
+        """
+        Return a NumPy copy of this array (experimental).
+
+        Returns
+        -------
+        array : numpy.ndarray
+        """
         cdef:
             PyObject* out
             PandasOptions c_options
             object values
 
         if self.type.id == _Type_EXTENSION:
-            return (
-                chunked_array(
-                    [self.chunk(i).storage for i in range(self.num_chunks)]
-                ).__array__(dtype)
-            )
+            storage_array = chunked_array([
+                chunk.storage for chunk in self.iterchunks()
+            ])

Review comment:
       Does this work if there's zero chunk, or do you need to pass the type explicitly?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7868:
URL: https://github.com/apache/arrow/pull/7868#issuecomment-666442636


   https://issues.apache.org/jira/browse/ARROW-9429


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kszucs commented on a change in pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

Posted by GitBox <gi...@apache.org>.
kszucs commented on a change in pull request #7868:
URL: https://github.com/apache/arrow/pull/7868#discussion_r467847573



##########
File path: python/pyarrow/table.pxi
##########
@@ -226,24 +226,34 @@ cdef class ChunkedArray(_PandasConvertible):
     def _to_pandas(self, options, **kwargs):
         return _array_like_to_pandas(self, options)
 
-    def __array__(self, dtype=None):
+    def to_numpy(self):
+        """
+        Return a NumPy copy of this array (experimental).
+
+        Returns
+        -------
+        array : numpy.ndarray
+        """
         cdef:
             PyObject* out
             PandasOptions c_options
             object values
 
         if self.type.id == _Type_EXTENSION:
-            return (
-                chunked_array(
-                    [self.chunk(i).storage for i in range(self.num_chunks)]
-                ).__array__(dtype)
-            )
+            storage_array = chunked_array([
+                chunk.storage for chunk in self.iterchunks()
+            ])

Review comment:
       Need to pass the type explicitly, fixing it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7868: ARROW-9429: [Python] ChunkedArray.to_numpy

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7868:
URL: https://github.com/apache/arrow/pull/7868#discussion_r465810996



##########
File path: python/pyarrow/table.pxi
##########
@@ -425,31 +438,35 @@ def chunked_array(arrays, type=None):
         Array arr
         vector[shared_ptr[CArray]] c_arrays
         shared_ptr[CChunkedArray] sp_chunked_array
-        shared_ptr[CDataType] sp_data_type
+
+    type = ensure_type(type, allow_none=True)
 
     if isinstance(arrays, Array):
         arrays = [arrays]
 
     for x in arrays:
-        if isinstance(x, Array):
-            arr = x
-            if type is not None:
-                assert x.type == type
+        arr = x if isinstance(x, Array) else array(x, type=type)
+
+        if type is None:
+            # it allows more flexible chunked array construction from to coerce
+            # subsequent arrays to the firstly inferred array type
+            # it also spares the inference overhead after the first chunk
+            type = arr.type
         else:
-            arr = array(x, type=type)
+            if arr.type != type:
+                raise ArrowInvalid(
+                    "Each array chunks must have type {}".format(type)

Review comment:
       Either "All array chunks" (plural) or "Each array chunk" (singular).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org