You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/03/02 17:28:00 UTC
[jira] [Commented] (ARROW-15826) Allow serializing arbitrary Python objects to parquet

    [ https://issues.apache.org/jira/browse/ARROW-15826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500298#comment-17500298 ] 

Will Jones commented on ARROW-15826:
------------------------------------

Have you considered using a binary column? That should save to parquet just fine.

{code:python}
import pandas as pd
import pyarrow as pa
import pickle

class Foo: 
    pass

df = pd.DataFrame({"a": [pickle.dumps(Foo()) for _ in range(3)], "b": [1, 2, 3]})
table = pa.Table.from_pandas(df)
table
# pyarrow.Table
# a: binary
# b: int64
# ----
# a: [[80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E,80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E,80049517000000000000008C085F5F6D61696E5F5F948C03466F6F9493942981942E]]
# b: [[1,2,3]]
{code}



> Allow serializing arbitrary Python objects to parquet
> -----------------------------------------------------
>
>                 Key: ARROW-15826
>                 URL: https://issues.apache.org/jira/browse/ARROW-15826
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Michael Milton
>            Priority: Major
>
> I'm trying to serialize a pandas DataFrame containing custom objects to parquet. Here is some example code:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> class Foo: 
>     pass
> df = pd.DataFrame({"a": [Foo(), Foo(), Foo()], "b": [1, 2, 3]})
> table = pyarrow.Table.from_pandas(df)
> {code}
> Gives me:
> {code:java}
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/table.pxi", line 1782, in pyarrow.lib.Table.from_pandas
>   File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
>     arrays = [convert_column(c, f)
>   File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
>     arrays = [convert_column(c, f)
>   File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
>     raise e
>   File "/home/migwell/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
>     result = pa.array(col, type=type_, from_pandas=True, safe=safe)
>   File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: ('Could not convert <__main__.Foo object at 0x7fc23e38bfd0> with type Foo: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column a with type object'){code}
> Now, I realise that there's this disclaimer about arbitrary object serialization: [https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization.] However it isn't clear how this applies to parquet. In my case, I want to have a well-formed parquet file that has binary blobs in one column that _can_ be deserialized to my class, but can otherwise be read by general parquet tools without failing. Using pickle doesn't solve this use case since other languages like R may not be able to read the pickle file.
> Alternatively, if there is a well-defined protocol for telling pyarrow how to translate a given type to and from arrow types, I would be happy to use that instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)