You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2015/08/13 10:21:45 UTC

[jira] [Updated] (PARQUET-136) NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null

     [ https://issues.apache.org/jira/browse/PARQUET-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Lian updated PARQUET-136:
-------------------------------
    Description: 
For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found [here|https://github.com/apache/parquet-mr/blob/251a495d2a72de7e892ade7f64980f51f2fcc0dd/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L97-L100].

This issue can be steadily reproduced with the following Spark shell snippet against Spark 1.2.0-SNAPSHOT ([013089794d|https://github.com/apache/spark/tree/013089794ddfffbae8b913b72c1fa6375774207a]):
{code}
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sqlContext._

case class StringCol(value: String)

sc.parallelize(StringCol(null) :: Nil, 1).saveAsParquetFile("/tmp/empty.parquet")
parquetFile("/tmp/empty.parquet").registerTempTable("null_table")

sql("SET spark.sql.parquet.filterPushdown=true")
sql("SELECT * FROM null_table WHERE value = 'foo'").collect()
{code}
Exception thrown:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
        at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
        at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
        at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
        at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
        at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
        at parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
        at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
        at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
        at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
        at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
        at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
        at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}

  was:
For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found [here|https://github.com/apache/incubator-parquet-mr/blob/251a495d2a72de7e892ade7f64980f51f2fcc0dd/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L97-L100].

This issue can be steadily reproduced with the following Spark shell snippet against Spark 1.2.0-SNAPSHOT ([013089794d|https://github.com/apache/spark/tree/013089794ddfffbae8b913b72c1fa6375774207a]):
{code}
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sqlContext._

case class StringCol(value: String)

sc.parallelize(StringCol(null) :: Nil, 1).saveAsParquetFile("/tmp/empty.parquet")
parquetFile("/tmp/empty.parquet").registerTempTable("null_table")

sql("SET spark.sql.parquet.filterPushdown=true")
sql("SELECT * FROM null_table WHERE value = 'foo'").collect()
{code}
Exception thrown:
{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
        at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
        at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
        at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
        at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
        at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
        at parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
        at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
        at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
        at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
        at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
        at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
        at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
        at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}


> NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null
> ---------------------------------------------------------------------------------------
>
>                 Key: PARQUET-136
>                 URL: https://issues.apache.org/jira/browse/PARQUET-136
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.6.0
>            Reporter: Cheng Lian
>             Fix For: 1.6.0
>
>
> For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found [here|https://github.com/apache/parquet-mr/blob/251a495d2a72de7e892ade7f64980f51f2fcc0dd/parquet-hadoop/src/main/java/parquet/filter2/statisticslevel/StatisticsFilter.java#L97-L100].
> This issue can be steadily reproduced with the following Spark shell snippet against Spark 1.2.0-SNAPSHOT ([013089794d|https://github.com/apache/spark/tree/013089794ddfffbae8b913b72c1fa6375774207a]):
> {code}
> import org.apache.spark.sql.SQLContext
> val sqlContext = new SQLContext(sc)
> import sqlContext._
> case class StringCol(value: String)
> sc.parallelize(StringCol(null) :: Nil, 1).saveAsParquetFile("/tmp/empty.parquet")
> parquetFile("/tmp/empty.parquet").registerTempTable("null_table")
> sql("SET spark.sql.parquet.filterPushdown=true")
> sql("SELECT * FROM null_table WHERE value = 'foo'").collect()
> {code}
> Exception thrown:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
>         at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
>         at parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
>         at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
>         at parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
>         at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
>         at parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
>         at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
>         at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
>         at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
>         at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
>         at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135)
>         at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
>         at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>         at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
>         at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)