You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by "Thomas Fredriksen(External)" <th...@cognite.com> on 2021/06/09 09:20:01 UTC
ParquetIO - ClassCastException when reading parquet-files

Hi there,

We are seeing some issues when reading some Parquet-files that have
previously been written by another Beam-pipeline.

The initial pipeline will convert a protobuf-object to `GenericRecord` then
pass it to FileIO with the ParquetIO sink. The conversion is done using the
following process:

        ProtobufDatumWriter<InputT> datumWriter = new
> ProtobufDatumWriter<>(cls);
>         ProtobufDatumReader<InputT> datumReader = new
> ProtobufDatumReader<>(cls);
>
>         /* ... */
>
>         ByteArrayOutputStream os = new ByteArrayOutputStream();
>         Encoder e = EncoderFactory.get().binaryEncoder(os, null);
>         datumWriter.write(element, e);
>         e.flush();
>
>         GenericDatumReader<GenericRecord> genericDatumReader = new
> GenericDatumReader<>(datumReader.getSchema());
>         out.output(genericDatumReader.read(
>                 null, DecoderFactory.get().binaryDecoder(
>                         new ByteArrayInputStream(os.toByteArray()),
> null)));
>

Upon reading the file, we are seeing the following exception:

Caused by: java.lang.ClassCastException: required double value is not a
> group
> at org.apache.parquet.schema.Type.asGroupType(Type.java:250)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
> at
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
> at
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.<init>(AvroRecordConverter.java:536)
> at
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.<init>(AvroRecordConverter.java:486)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
> at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
> at
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
> at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:141)
> at
> org.apache.parquet.avro.AvroRecordConverter.<init>(AvroRecordConverter.java:95)
> at
> org.apache.parquet.avro.AvroRecordMaterializer.<init>(AvroRecordMaterializer.java:33)
> at
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:186)
> at
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)
> at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
> at
> org.apache.beam.sdk.io.parquet.ParquetIO$ReadFiles$ReadFn.processElement(ParquetIO.java:1096)
>
> Process finished with exit code 1
>

The protofiles in question use the google wrappers for primitives. The
error seems to dislike this as it seems to think that the "value"-field
inside `DoubleValue` is a record, not a primitive.

The files however are read just fine using Parquet with Python.

Any advice on how to solve this?