You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Wenbing Bai (Jira)" <ji...@apache.org> on 2021/02/03 00:21:00 UTC
[jira] [Created] (BEAM-11742) Use schema when creating record batch
in ParquetSink
Wenbing Bai created BEAM-11742:
----------------------------------
Summary: Use schema when creating record batch in ParquetSink
Key: BEAM-11742
URL: https://issues.apache.org/jira/browse/BEAM-11742
Project: Beam
Issue Type: Improvement
Components: io-py-parquet
Reporter: Wenbing Bai
Assignee: Wenbing Bai
Before pyarrow 0.15, it is not possible to create pyarrow record batch with schema.
So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record batch we use
{code:java}
rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
Error is raised that the parquet table to be created (record batch schema) has a different schema with the schema specify (self._schema).
For example, when schema specified with "is not null", the record batch schema doesn't indicate that, the error will be raised.
The fix is to use schema instead of names in pa.RecordBatch.from_arrays
{code:java}
rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)