You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Stephen Boesch <ja...@gmail.com> on 2017/02/18 19:50:12 UTC

Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

The following JIRA mentions that a fix made to read parquet 1.6.2 into
2.X  STILL leaves an "avalanche" of warnings:


https://issues.apache.org/jira/browse/SPARK-17993

Here is the text inside one of the last comments before it was merged:

  I have built the code from the PR and it indeed succeeds reading the data.
  I have tried doing df.count() and now I'm swarmed with warnings like
this (they are just keep getting printed endlessly in the terminal):
  16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics
because created_by could not be parsed (see PARQUET-251): parquet-mr
version 1.6.0
  org.apache.parquet.VersionParser$VersionParseException: Could not
parse created_by: parquet-mr version 1.6.0 using format: (.+) version
((.*) )?\(build ?(.*)\)
    at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
    at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
    at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
    at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
    at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
    at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
    at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
    at

I am running 2.1.0 release and there are multitudes of these warnings.
Is there any way - short of changing logging level to ERROR - to
suppress these?

Re: Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

Posted by Stephen Boesch <ja...@gmail.com>.
For now I have added to the log4j.properties:

log4j.logger.org.apache.parquet=ERROR

2017-02-18 11:50 GMT-08:00 Stephen Boesch <ja...@gmail.com>:

> The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X  STILL leaves an "avalanche" of warnings:
>
>
> https://issues.apache.org/jira/browse/SPARK-17993
>
> Here is the text inside one of the last comments before it was merged:
>
>   I have built the code from the PR and it indeed succeeds reading the data.
>   I have tried doing df.count() and now I'm swarmed with warnings like this (they are just keep getting printed endlessly in the terminal):
>   16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
>   org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build ?(.*)\)
>     at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>     at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
>     at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
>     at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
>     at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
>     at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
>     at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
>     at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
>     at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>     at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
>     at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
>     at
>
> I am running 2.1.0 release and there are multitudes of these warnings. Is there any way - short of changing logging level to ERROR - to suppress these?
>
>
>