You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "pin_zhang (JIRA)" <ji...@apache.org> on 2018/02/09 09:10:00 UTC

[jira] [Created] (SPARK-23371) Parquet Footer data is wrong on window in parquet format partition table

pin_zhang created SPARK-23371:
---------------------------------

             Summary: Parquet Footer data is wrong on window in parquet format partition table 
                 Key: SPARK-23371
                 URL: https://issues.apache.org/jira/browse/SPARK-23371
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.2, 2.1.1
            Reporter: pin_zhang


On window

Run SQL in spark shell
 spark.sql("create table part_test (id string )partitioned by( index int) stored as parquet")
 spark.sql("insert into part_test partition (index =1) values ('1')")

Get exception when query spark.sql("select * from part_test ").show()

For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the version info in spark on window. Classloader try to get version in the parquet-format-2.3.0-incubating.jar

18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because created_by
 could not be parsed (see PARQUET-251): parquet-mr
 org.apache.parquet.VersionParser$VersionParseException: Could not parse created_
 by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*))
 at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
 at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt
 atistics.java:60)
 at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq
 uetStatistics(ParquetMetadataConverter.java:263)
 at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque
 tFileReader.java:583)
 at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF
 ileReader.java:513)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
 ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
 ecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
 ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
 at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe
 xt(RecordReaderIterator.scala:39)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
 t(FileScanRDD.scala:109)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt
 erator(FileScanRDD.scala:184)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
 t(FileScanRDD.scala:109)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
 rator.scan_nextBatch$(Unknown Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
 rator.processNext(Unknown Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo
 wIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon
 $1.hasNext(WholeStageCodegenExec.scala:377)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
 cala:231)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
 cala:225)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
 ply$25.apply(RDD.scala:827)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
 ply$25.apply(RDD.scala:827)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:
 38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:99)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
 java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
 .java:617)
 at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org