You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Sergii Mikhtoniuk (Jira)" <ji...@apache.org> on 2020/08/14 00:58:00 UTC
[jira] [Created] (ARROW-9735) [Rust] [Parquet] Corrupt footer error
on files produced by AvroParquetWriter
Sergii Mikhtoniuk created ARROW-9735:
----------------------------------------
Summary: [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter
Key: ARROW-9735
URL: https://issues.apache.org/jira/browse/ARROW-9735
Project: Apache Arrow
Issue Type: Bug
Components: Rust, Rust - DataFusion
Reporter: Sergii Mikhtoniuk
Attachments: data.snappy.parquet
I started to use rust parquet library for some basic reading of files produced by Spark and was very happy with the performance, however when I try to read any Parquet files produced by my Flink app I get a panic:
{code:java}
General("Invalid Parquet file. Corrupt footer") {code}
I'm attaching the sample file: [^data.snappy.parquet]
Output of 'parquet-meta' command (from parquet-tools package):
{code:java}
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: parquet.avro.schema = {"type":"record","name":"Row","fields":[{"name":"system_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"event_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"city","type":["string","null"]},{"name":"population_x10","type":["int","null"]}]}
extra: writer.model.name = avro file schema: Row
-------------------------------------------------------------------------------------------------------------------
system_time: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
event_time: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
city: OPTIONAL BINARY L:STRING R:0 D:1
population_x10: OPTIONAL INT32 R:0 D:1row group 1: RC:3 TS:291 OFFSET:4
-------------------------------------------------------------------------------------------------------------------
system_time: INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:36:12.413+0000, max: 2020-08-14T00:36:12.413+0000, num_nulls: 0]
event_time: INT64 SNAPPY DO:0 FPO:98 SZ:94/90/0.96 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:35:42.709+0000, max: 2020-08-14T00:35:42.709+0000, num_nulls: 0]
city: BINARY SNAPPY DO:0 FPO:192 SZ:50/48/0.96 VC:3 ENC:PLAIN,BIT_PACKED,RLE ST:[min: A, max: C, num_nulls: 0]
population_x10: INT32 SNAPPY DO:0 FPO:242 SZ:65/63/0.97 VC:3 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 10000, max: 30000, num_nulls: 0] {code}
The file is produced in Apache Flink + Scala using (fairly recent):
{code:java}
"org.apache.parquet" % "parquet-avro" % "1.10.0" {code}
Code that produces the file goes like this:
{code:java}
val writer = AvroParquetWriter
.builder[GenericRecord](new Path(path))
.withSchema(avroSchema)
.withDataModel(model)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.build()
for (row <- rows) { writer.write(row) }
writer.close() {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)