You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Wenbing Bai (Jira)" <ji...@apache.org> on 2021/02/03 00:21:00 UTC

[jira] [Created] (BEAM-11742) Use schema when creating record batch in ParquetSink

Wenbing Bai created BEAM-11742:
----------------------------------

             Summary: Use schema when creating record batch in ParquetSink
                 Key: BEAM-11742
                 URL: https://issues.apache.org/jira/browse/BEAM-11742
             Project: Beam
          Issue Type: Improvement
          Components: io-py-parquet
            Reporter: Wenbing Bai
            Assignee: Wenbing Bai


Before pyarrow 0.15, it is not possible to create pyarrow record batch with schema.

So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record batch we use 

 
{code:java}
rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
Error is raised that the parquet table to be created (record batch schema) has a different schema with the schema specify (self._schema).

For example, when schema specified with "is not null", the record batch schema doesn't indicate that, the error will be raised.

 

The fix is to use schema instead of names in pa.RecordBatch.from_arrays
{code:java}
rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)