You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Andreas Heider <an...@heider.io> on 2018/07/18 18:41:59 UTC

(De)serialising schemas in pyarrow

Hi,

I'm using Arrow together with dask to quickly write lots of parquet files. Pandas has a tendency to forget column types (in my case it's a string column that might be completely null in some splits), so I'm building a Schema once and then manually passing that Schema into pa.Table.from_pandas and pq.ParquetWriter so all resulting files consistently have the same types.

However, due to dask being distributed passing around that Schema involves serialising the Schema and sending it to different processes, and this was a bit harder than expected.

Simple pickling fails with "No type alias for double" on unpickling.

Schema does have a  .serialize(), but I can't find how to deserialize it again? pa.deserialize says "Expected to read 923444752 metadata bytes but only read 11284. It also looks like pa.deserialize is meant for Python objects.

So I've settled on this for now:

def serialize_schema(schema):
    sink = pa.BufferOutputStream()
    writer = pa.RecordBatchStreamWriter(sink, schema)
    writer.close()
    return sink.get_result().to_pybytes()

def deserialize_schema(buf):
    buf_reader = pa.BufferReader(but)
    reader = pa.RecordBatchStreamReader(but_reader)
    return reader.schema

This works, but is a bit more involved than I hoped it'd be.

Do you have any advice how this is meant to work?

Thanks,
Andreas

Re: (De)serialising schemas in pyarrow

Posted by Wes McKinney <we...@gmail.com>.

hi Andreas,

You want the read_schema method -- much simpler. Check out the unit tests:

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L502

Would be nice to have this in the Sphinx docs

- Wes

On Wed, Jul 18, 2018 at 2:41 PM, Andreas Heider <an...@heider.io> wrote:
> Hi,
>
> I'm using Arrow together with dask to quickly write lots of parquet files. Pandas has a tendency to forget column types (in my case it's a string column that might be completely null in some splits), so I'm building a Schema once and then manually passing that Schema into pa.Table.from_pandas and pq.ParquetWriter so all resulting files consistently have the same types.
>
> However, due to dask being distributed passing around that Schema involves serialising the Schema and sending it to different processes, and this was a bit harder than expected.
>
> Simple pickling fails with "No type alias for double" on unpickling.
>
> Schema does have a  .serialize(), but I can't find how to deserialize it again? pa.deserialize says "Expected to read 923444752 metadata bytes but only read 11284. It also looks like pa.deserialize is meant for Python objects.
>
> So I've settled on this for now:
>
> def serialize_schema(schema):
>     sink = pa.BufferOutputStream()
>     writer = pa.RecordBatchStreamWriter(sink, schema)
>     writer.close()
>     return sink.get_result().to_pybytes()
>
> def deserialize_schema(buf):
>     buf_reader = pa.BufferReader(but)
>     reader = pa.RecordBatchStreamReader(but_reader)
>     return reader.schema
>
> This works, but is a bit more involved than I hoped it'd be.
>
> Do you have any advice how this is meant to work?
>
> Thanks,
> Andreas