You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2017/02/23 07:19:44 UTC
[jira] [Updated] (PARQUET-893) GroupColumnIO.getFirst() doesn't
check for empty groups
[ https://issues.apache.org/jira/browse/PARQUET-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cheng Lian updated PARQUET-893:
-------------------------------
Description:
The following Spark snippet reproduces this issue with Spark 2.1 (with parquet-mr 1.8.1) and Spark 2.2-SNAPSHOT (with parquet-mr 1.8.2):
{code}
import org.apache.spark.sql.types._
val path = "/tmp/parquet-test"
case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)
val df = Seq(Outer(Inner(1), 1)).toDF()
df.printSchema()
// root
// |-- f0: struct (nullable = true)
// | |-- f00: integer (nullable = false)
// |-- f1: integer (nullable = false)
df.write.mode("overwrite").parquet(path)
val requestedSchema =
new StructType().
add("f0", new StructType().
// This nested field name differs from the original one
add("f01", IntegerType)).
add("f1", IntegerType)
println(requestedSchema.treeString)
// root
// |-- f0: struct (nullable = true)
// | |-- f01: integer (nullable = true)
// |-- f1: integer (nullable = true)
spark.read.schema(requestedSchema).parquet(path).show()
{code}
In the above snippet, {{requestedSchema}} is compatible with the schema of the written Parquet file, but the following exception is thrown:
{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/parquet-test/part-00007-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 21 more
{noformat}
According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} [doesn't check for empty groups|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/io/GroupColumnIO.java#L103] properly.
I haven't tried parquet-mr 1.9.0 but it probably suffers from the same issue.
was:
The following Spark 2.1 snippet reproduces this issue:
{code}
import org.apache.spark.sql.types._
val path = "/tmp/parquet-test"
case class Inner(f00: Int)
case class Outer(f0: Inner, f1: Int)
val df = Seq(Outer(Inner(1), 1)).toDF()
df.printSchema()
// root
// |-- f0: struct (nullable = true)
// | |-- f00: integer (nullable = false)
// |-- f1: integer (nullable = false)
df.write.mode("overwrite").parquet(path)
val requestedSchema =
new StructType().
add("f0", new StructType().
// This nested field name differs from the original one
add("f01", IntegerType)).
add("f1", IntegerType)
println(requestedSchema.treeString)
// root
// |-- f0: struct (nullable = true)
// | |-- f01: integer (nullable = true)
// |-- f1: integer (nullable = true)
spark.read.schema(requestedSchema).parquet(path).show()
{code}
In the above snippet, {{requestedSchema}} is compatible with the schema of the written Parquet file, but the following exception is thrown:
{noformat}
org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/parquet-test/part-00007-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
... 21 more
{noformat}
According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} [doesn't check for empty groups|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/io/GroupColumnIO.java#L103] properly.
> GroupColumnIO.getFirst() doesn't check for empty groups
> -------------------------------------------------------
>
> Key: PARQUET-893
> URL: https://issues.apache.org/jira/browse/PARQUET-893
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.8.1
> Reporter: Cheng Lian
>
> The following Spark snippet reproduces this issue with Spark 2.1 (with parquet-mr 1.8.1) and Spark 2.2-SNAPSHOT (with parquet-mr 1.8.2):
> {code}
> import org.apache.spark.sql.types._
> val path = "/tmp/parquet-test"
> case class Inner(f00: Int)
> case class Outer(f0: Inner, f1: Int)
> val df = Seq(Outer(Inner(1), 1)).toDF()
> df.printSchema()
> // root
> // |-- f0: struct (nullable = true)
> // | |-- f00: integer (nullable = false)
> // |-- f1: integer (nullable = false)
> df.write.mode("overwrite").parquet(path)
> val requestedSchema =
> new StructType().
> add("f0", new StructType().
> // This nested field name differs from the original one
> add("f01", IntegerType)).
> add("f1", IntegerType)
> println(requestedSchema.treeString)
> // root
> // |-- f0: struct (nullable = true)
> // | |-- f01: integer (nullable = true)
> // |-- f1: integer (nullable = true)
> spark.read.schema(requestedSchema).parquet(path).show()
> {code}
> In the above snippet, {{requestedSchema}} is compatible with the schema of the written Parquet file, but the following exception is thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/tmp/parquet-test/part-00007-d2b0bec1-7be5-4b51-8d53-3642680bc9c2.snappy.parquet
> at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:243)
> at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:227)
> at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:184)
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.ArrayList.rangeCheck(ArrayList.java:653)
> at java.util.ArrayList.get(ArrayList.java:429)
> at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
> at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:102)
> at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:102)
> at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:97)
> at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:277)
> at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:135)
> at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:101)
> at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
> at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:101)
> at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:140)
> at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:214)
> ... 21 more
> {noformat}
> According to this stack trace, it seems that {{GroupColumnIO.getFirst()}} [doesn't check for empty groups|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/io/GroupColumnIO.java#L103] properly.
> I haven't tried parquet-mr 1.9.0 but it probably suffers from the same issue.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)