You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Sergii Mikhtoniuk (Jira)" <ji...@apache.org> on 2020/08/14 00:58:00 UTC
[jira] [Created] (ARROW-9735) [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter

Sergii Mikhtoniuk created ARROW-9735:
----------------------------------------

             Summary: [Rust] [Parquet] Corrupt footer error on files produced by AvroParquetWriter
                 Key: ARROW-9735
                 URL: https://issues.apache.org/jira/browse/ARROW-9735
             Project: Apache Arrow
          Issue Type: Bug
          Components: Rust, Rust - DataFusion
            Reporter: Sergii Mikhtoniuk
         Attachments: data.snappy.parquet

I started to use rust parquet library for some basic reading of files produced by Spark and was very happy with the performance, however when I try to read any Parquet files produced by my Flink app I get a panic:
{code:java}
General("Invalid Parquet file. Corrupt footer") {code}
I'm attaching the sample file: [^data.snappy.parquet]

Output of 'parquet-meta' command (from parquet-tools package):
{code:java}
creator:        parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a) 
extra:          parquet.avro.schema = {"type":"record","name":"Row","fields":[{"name":"system_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"event_time","type":[{"type":"long","logicalType":"timestamp-millis"},"null"]},{"name":"city","type":["string","null"]},{"name":"population_x10","type":["int","null"]}]} 
extra:          writer.model.name = avro file schema:    Row 
-------------------------------------------------------------------------------------------------------------------
system_time:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
event_time:     OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
city:           OPTIONAL BINARY L:STRING R:0 D:1
population_x10: OPTIONAL INT32 R:0 D:1row group 1:    RC:3 TS:291 OFFSET:4 
-------------------------------------------------------------------------------------------------------------------
system_time:     INT64 SNAPPY DO:0 FPO:4 SZ:94/90/0.96 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:36:12.413+0000, max: 2020-08-14T00:36:12.413+0000, num_nulls: 0]
event_time:      INT64 SNAPPY DO:0 FPO:98 SZ:94/90/0.96 VC:3 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: 2020-08-14T00:35:42.709+0000, max: 2020-08-14T00:35:42.709+0000, num_nulls: 0]
city:            BINARY SNAPPY DO:0 FPO:192 SZ:50/48/0.96 VC:3 ENC:PLAIN,BIT_PACKED,RLE ST:[min: A, max: C, num_nulls: 0]
population_x10:  INT32 SNAPPY DO:0 FPO:242 SZ:65/63/0.97 VC:3 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 10000, max: 30000, num_nulls: 0] {code}
The file is produced in Apache Flink + Scala using (fairly recent):
{code:java}
"org.apache.parquet" % "parquet-avro" % "1.10.0" {code}
Code that produces the file goes like this:
{code:java}
val writer = AvroParquetWriter
.builder[GenericRecord](new Path(path))
.withSchema(avroSchema)
.withDataModel(model)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.build()

for (row <- rows) { writer.write(row) }
writer.close() {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)