You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/24 08:42:00 UTC

[jira] [Updated] (ARROW-1382) [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize

     [ https://issues.apache.org/jira/browse/ARROW-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-1382:
-----------------------------------------
    Labels: pyarrow-serialization  (was: pull-request-available)

> [Python] Deduplicate non-scalar Python objects when using pyarrow.serialize
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-1382
>                 URL: https://issues.apache.org/jira/browse/ARROW-1382
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Robert Nishihara
>            Priority: Major
>              Labels: pyarrow-serialization
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> If a Python object appears multiple times within a list/tuple/dictionary, then when pyarrow serializes the object, it will duplicate the object many times. This leads to a potentially huge expansion in the size of the object (e.g., the serialized version of {{100 * [np.zeros(10 ** 6)]}} will be 100 times bigger than it needs to be).
> {code}
> import pyarrow as pa
> l = [0]
> original_object = [l, l]
> # Serialize and deserialize the object.
> buf = pa.serialize(original_object).to_buffer()
> new_object = pa.deserialize(buf)
> # This works.
> assert original_object[0] is original_object[1]
> # This fails.
> assert new_object[0] is new_object[1]
> {code}
> One potential way to address this is to use the Arrow dictionary encoding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)