You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/12 19:59:00 UTC

[jira] [Commented] (ARROW-2451) Handle more dtypes efficiently in custom numpy array serializer.

    [ https://issues.apache.org/jira/browse/ARROW-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436218#comment-16436218 ] 

ASF GitHub Bot commented on ARROW-2451:
---------------------------------------

robertnishihara opened a new pull request #1887: ARROW-2451: [Python] Handle non-object arrays more efficiently in custom serializer.
URL: https://github.com/apache/arrow/pull/1887
 
 
   **Before this PR**
   
   ```python
   import numpy as np
   import pyarrow as pa
   
   x = np.array(10 ** 8 * [True, False])
   %time serialized_obj = pa.serialize(x).to_buffer()  # 10.4s
   %time new_x = pa.deserialize(serialized_obj)  # 8.94s
   
   x = np.array([str(i) for i in range(10 ** 7)])
   %time serialized_obj = pa.serialize(x).to_buffer()  # 2.24s
   %time new_x = pa.deserialize(serialized_obj)  # 2.03s
   ```
   
   **After this PR**
   
   ```python
   import numpy as np
   import pyarrow as pa
   
   x = np.array(10 ** 8 * [True, False])
   %time serialized_obj = pa.serialize(x).to_buffer()  # 117ms
   %time new_x = pa.deserialize(serialized_obj)  # 265us
   
   x = np.array([str(i) for i in range(10 ** 7)])
   %time serialized_obj = pa.serialize(x).to_buffer()  # 174ms
   %time new_x = pa.deserialize(serialized_obj)  # 13.2ms
   ```
   
   cc @devin-petersohn @mitar @pcmoritz 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Handle more dtypes efficiently in custom numpy array serializer.
> ----------------------------------------------------------------
>
>                 Key: ARROW-2451
>                 URL: https://issues.apache.org/jira/browse/ARROW-2451
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Robert Nishihara
>            Assignee: Robert Nishihara
>            Priority: Major
>              Labels: pull-request-available
>
> Right now certain dtypes like bool or fixed length strings are serialized as lists, which is inefficient. We can handle these more efficiently by casting them to uint8 and saving the original dtype as additional data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)