You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Luca Pireddu <pi...@gmail.com> on 2016/03/04 12:58:34 UTC

org.apache.parquet.io.ParquetDecodingException with AvroParquetInputFormat

Hello all,

I'm using AvroParquetOutputFormat and AvroParquetInputFormat for a
pair of Hadoop applications -- one that writes avro-parquet and one
that reads.  Actually, I'm using Pydoop (
https://github.com/crs4/pydoop) but the actual I/O is done through the
AvroParquet classes.

The writer seems to succeed.  Instead, the reader, when processing the
other application's result, crashes with a ParquetDecodingException.
Here's the syslog output with the stack trace:


2016-03-04 12:46:50,075 INFO [main] org.apache.hadoop.mapred.MapTask:
Processing split: ParquetInputSplit{part:
hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
start: 0 end: 16916 length: 16916 hosts: []}
2016-03-04 12:46:50,846 WARN [main]
org.apache.hadoop.mapred.YarnChild: Exception running child :
org.apache.parquet.io.ParquetDecodingException: Can not read value at
1 in block 0 in file
hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeReaderBase.initialize(PydoopAvroBridgeReaderBase.java:66)
at it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeValueReader.initialize(PydoopAvroBridgeValueReader.java:38)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:545)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:783)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassCastException:
org.bdgenomics.formats.avro.Contig cannot be cast to java.lang.Integer
at org.bdgenomics.formats.avro.AlignmentRecord.put(AlignmentRecord.java:258)
at org.apache.parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:168)
at org.apache.parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:46)
at org.apache.parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:95)
at org.apache.parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:189)
at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
... 11 more


I'm using parquet 1.8.1 and avro 1.7.6. I'm able to read the parquet
file with parquet-tools-1.8.1, so I'm inclined to think that the file
is valid.

Contig is the first class defined in my avro schema:

file schema:
org.bdgenomics.formats.avro.AlignmentRecord
--------------------------------------------------------------------------------
contig:                               OPTIONAL F:6
.contigName:                          OPTIONAL BINARY O:UTF8 R:0 D:2
.contigLength:                        OPTIONAL INT64 R:0 D:2
...and so on.

Can someone suggest what might be causing the problem when reading?
Any help would be appreciated!

Thanks,

Luca

Re: org.apache.parquet.io.ParquetDecodingException with AvroParquetInputFormat

Posted by Luca Pireddu <pi...@gmail.com>.

That's the tip I needed!

I had modified an existing schema while not removing the original,
thus creating two entities with the same name in the same namespace.
My apps somehow managed to use one version while writing and a
different one while reading, thus creating the problem.

Thanks for the help,

Luca



On 4 March 2016 at 18:42, Ryan Blue <rb...@netflix.com.invalid> wrote:
> Luca,
>
> What are you reader and writer schemas? It looks like they may not match
> because the reader expects and Integer but is deserializing a Contig object.
>
> rb
>
> On Fri, Mar 4, 2016 at 3:58 AM, Luca Pireddu <pi...@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using AvroParquetOutputFormat and AvroParquetInputFormat for a
>> pair of Hadoop applications -- one that writes avro-parquet and one
>> that reads.  Actually, I'm using Pydoop (
>> https://github.com/crs4/pydoop) but the actual I/O is done through the
>> AvroParquet classes.
>>
>> The writer seems to succeed.  Instead, the reader, when processing the
>> other application's result, crashes with a ParquetDecodingException.
>> Here's the syslog output with the stack trace:
>>
>>
>> 2016-03-04 12:46:50,075 INFO [main] org.apache.hadoop.mapred.MapTask:
>> Processing split: ParquetInputSplit{part:
>>
>> hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
>> start: 0 end: 16916 length: 16916 hosts: []}
>> 2016-03-04 12:46:50,846 WARN [main]
>> org.apache.hadoop.mapred.YarnChild: Exception running child :
>> org.apache.parquet.io.ParquetDecodingException: Can not read value at
>> 1 in block 0 in file
>>
>> hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
>> at
>> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
>> at
>> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
>> at
>> it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeReaderBase.initialize(PydoopAvroBridgeReaderBase.java:66)
>> at
>> it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeValueReader.initialize(PydoopAvroBridgeValueReader.java:38)
>> at
>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:545)
>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:783)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:415)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
>> Caused by: java.lang.ClassCastException:
>> org.bdgenomics.formats.avro.Contig cannot be cast to java.lang.Integer
>> at
>> org.bdgenomics.formats.avro.AlignmentRecord.put(AlignmentRecord.java:258)
>> at
>> org.apache.parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:168)
>> at
>> org.apache.parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:46)
>> at
>> org.apache.parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:95)
>> at
>> org.apache.parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:189)
>> at
>> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
>> at
>> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
>> ... 11 more
>>
>>
>> I'm using parquet 1.8.1 and avro 1.7.6. I'm able to read the parquet
>> file with parquet-tools-1.8.1, so I'm inclined to think that the file
>> is valid.
>>
>> Contig is the first class defined in my avro schema:
>>
>> file schema:
>> org.bdgenomics.formats.avro.AlignmentRecord
>>
>> --------------------------------------------------------------------------------
>> contig:                               OPTIONAL F:6
>> .contigName:                          OPTIONAL BINARY O:UTF8 R:0 D:2
>> .contigLength:                        OPTIONAL INT64 R:0 D:2
>> ...and so on.
>>
>> Can someone suggest what might be causing the problem when reading?
>> Any help would be appreciated!
>>
>> Thanks,
>>
>> Luca
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: org.apache.parquet.io.ParquetDecodingException with AvroParquetInputFormat

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Luca,

What are you reader and writer schemas? It looks like they may not match
because the reader expects and Integer but is deserializing a Contig object.

rb

On Fri, Mar 4, 2016 at 3:58 AM, Luca Pireddu <pi...@gmail.com> wrote:

> Hello all,
>
> I'm using AvroParquetOutputFormat and AvroParquetInputFormat for a
> pair of Hadoop applications -- one that writes avro-parquet and one
> that reads.  Actually, I'm using Pydoop (
> https://github.com/crs4/pydoop) but the actual I/O is done through the
> AvroParquet classes.
>
> The writer seems to succeed.  Instead, the reader, when processing the
> other application's result, crashes with a ParquetDecodingException.
> Here's the syslog output with the stack trace:
>
>
> 2016-03-04 12:46:50,075 INFO [main] org.apache.hadoop.mapred.MapTask:
> Processing split: ParquetInputSplit{part:
>
> hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
> start: 0 end: 16916 length: 16916 hosts: []}
> 2016-03-04 12:46:50,846 WARN [main]
> org.apache.hadoop.mapred.YarnChild: Exception running child :
> org.apache.parquet.io.ParquetDecodingException: Can not read value at
> 1 in block 0 in file
>
> hdfs://localhost:9000/user/pireddu/seqal_mini_ref_bwamem_avo_output/tmp/part-m-00000.parquet
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
> at
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
> at
> it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeReaderBase.initialize(PydoopAvroBridgeReaderBase.java:66)
> at
> it.crs4.pydoop.mapreduce.pipes.PydoopAvroBridgeValueReader.initialize(PydoopAvroBridgeValueReader.java:38)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:545)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:783)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.ClassCastException:
> org.bdgenomics.formats.avro.Contig cannot be cast to java.lang.Integer
> at
> org.bdgenomics.formats.avro.AlignmentRecord.put(AlignmentRecord.java:258)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.set(AvroIndexedRecordConverter.java:168)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.access$000(AvroIndexedRecordConverter.java:46)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter$1.add(AvroIndexedRecordConverter.java:95)
> at
> org.apache.parquet.avro.AvroIndexedRecordConverter.end(AvroIndexedRecordConverter.java:189)
> at
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:413)
> at
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:218)
> ... 11 more
>
>
> I'm using parquet 1.8.1 and avro 1.7.6. I'm able to read the parquet
> file with parquet-tools-1.8.1, so I'm inclined to think that the file
> is valid.
>
> Contig is the first class defined in my avro schema:
>
> file schema:
> org.bdgenomics.formats.avro.AlignmentRecord
>
> --------------------------------------------------------------------------------
> contig:                               OPTIONAL F:6
> .contigName:                          OPTIONAL BINARY O:UTF8 R:0 D:2
> .contigLength:                        OPTIONAL INT64 R:0 D:2
> ...and so on.
>
> Can someone suggest what might be causing the problem when reading?
> Any help would be appreciated!
>
> Thanks,
>
> Luca
>



-- 
Ryan Blue
Software Engineer
Netflix