You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Vikas Gandham <g....@gmail.com> on 2017/11/14 17:15:08 UTC
Parquet files from spark not readable in Cascading
Hi,
When I tried reading parquet data that was generated by spark in cascading
it throws following error
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
value at 0 in block -1 in file ""
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at
org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103)
at
org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)
at cascading.tap.hadoop.io
.MultiInputFormat$1.operate(MultiInputFormat.java:253)
at cascading.tap.hadoop.io
.MultiInputFormat$1.operate(MultiInputFormat.java:248)
at cascading.util.Util.retry(Util.java:1044)
at cascading.tap.hadoop.io
.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:418)
at java.util.ArrayList.get(ArrayList.java:431)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)
at org.apache.parquet.io
.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)
at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)
at org.apache.parquet.io
.RecordReaderImplementation.<init>(RecordReaderImplementation.java:293)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at org.apache.parquet.io
.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
This is mostly seen when parquet has nested structures.
I didnt find any solution to this.
I see some JIRA issues like this
https://issues.apache.org/jira/browse/SPARK-10434 (parquet compatability
/interoperabilityissues) where reading parquet files in Spark 1.4 where the
files
were generated by Spark 1.5 .This was fixed in later versions but was it
fixed in Cascading?
Not sure if this is something to do with Parquet version or Cascading has a
bug or Spark is doing something with Parquet files
which cascading is not accepting
Note : I am trying to read Parquet with avro schema in Cascading
I have posted in Cascading and Spark mailing list too.
--
Thanks
Vikas Gandham