You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Vikas Gandham <g....@gmail.com> on 2017/11/14 17:15:08 UTC

Parquet files from spark not readable in Cascading

Hi,



When I  tried reading parquet data that was generated by spark in cascading
it throws following error







Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
value at 0 in block -1 in file ""

at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)

at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)

at
org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.<init>(DeprecatedParquetInputFormat.java:103)

at
org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:47)

at cascading.tap.hadoop.io
.MultiInputFormat$1.operate(MultiInputFormat.java:253)

at cascading.tap.hadoop.io
.MultiInputFormat$1.operate(MultiInputFormat.java:248)

at cascading.util.Util.retry(Util.java:1044)

at cascading.tap.hadoop.io
.MultiInputFormat.getRecordReader(MultiInputFormat.java:247)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:394)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)

at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1

at java.util.ArrayList.elementData(ArrayList.java:418)

at java.util.ArrayList.get(ArrayList.java:431)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:98)

at org.apache.parquet.io
.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:83)

at org.apache.parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:77)

at org.apache.parquet.io
.RecordReaderImplementation.<init>(RecordReaderImplementation.java:293)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)

at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)

at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)

at org.apache.parquet.io
.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)

at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)

at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)



This is mostly seen when parquet has nested structures.



I didnt find any solution to this.



I see some JIRA issues like this
https://issues.apache.org/jira/browse/SPARK-10434 (parquet compatability
/interoperabilityissues) where reading parquet files in Spark 1.4 where the
files

were generated by Spark 1.5 .This was fixed in later versions but was it
fixed in Cascading?



Not sure if this is something to do with Parquet version or Cascading has a
bug or Spark is doing something with Parquet files

which cascading is not accepting



Note : I am trying to read Parquet with avro schema in Cascading



I have posted in Cascading and Spark mailing list too.





-- 
Thanks
Vikas Gandham