You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/11/26 18:22:01 UTC
[jira] [Commented] (ARROW-1783) [Python] Convert SerializedPyObject to/from sequence of component buffers with minimal memory allocation / copying

    [ https://issues.apache.org/jira/browse/ARROW-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266124#comment-16266124 ] 

ASF GitHub Bot commented on ARROW-1783:
---------------------------------------

wesm opened a new pull request #1362: ARROW-1783: [Python] Provide a "component" dict representation of a serialized Python object with minimal allocation
URL: https://github.com/apache/arrow/pull/1362
 
 
   For systems (like Dask) that prefer to handle their own framed buffer transport, this provides a list of memoryview-compatible objects with minimal copying / allocation from the input data structure, which can similarly be zero-copy reconstructed to the original object.
   
   To motivate the use case, consider a dict of ndarrays:
   
   ```
   data = {i: np.random.randn(1000, 1000) for i in range(50)}
   ```
   
   Here, we have:
   
   ```
   >>> %timeit serialized = pa.serialize(data)
   52.7 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   ```
   
   This is about 400MB of data. Some systems may not want to double memory by assembling this into a single large buffer, like with the `to_buffer` method:
   
   ```
   >>> written = serialized.to_buffer()
   >>> written.size
   400015456
   ```
   
   We provide a `to_components` method which contains a dict with a `'data'` field containing a list of `pyarrow.Buffer` objects. This can be converted back to the original Python object using `pyarrow.deserialize_components`:
   
   ```
   >>> %timeit components = serialized.to_components()
   73.8 µs ± 812 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   
   >>> list(components.keys())
   ['num_buffers', 'data', 'num_tensors']
   
   >>> len(components['data'])
   101
   
   >>> type(components['data'][0])
   pyarrow.lib.Buffer
   ```
   
   The reason there are 101 data components (1 + 2 * 50) is that:
   
   * 1 buffer for the serialized Union stream representing the object
   * 2 buffers for each of the tensors: 1 for the metadata and 1 for the tensor body. The body is separate so that this is zero-copy from the input
   
   Next step after this is ARROW-1784 which is to transport a pandas.DataFrame using this mechanism
   
   cc @pitrou @jcrist @mrocklin

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [Python] Convert SerializedPyObject to/from sequence of component buffers with minimal memory allocation / copying
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1783
>                 URL: https://issues.apache.org/jira/browse/ARROW-1783
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> See discussion on Dask org:
> https://github.com/dask/distributed/pull/931
> It would be valuable for downstream users to compute the serialized payload as a sequence of memoryview-compatible objects without having to allocate new memory on write. This means that the component tensor messages must have their metadata and bodies in separate buffers. This will require a bit of work internally reassemble the object from a collection of {{pyarrow.Buffer}} objects
> see also ARROW-1509



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)