You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Pradeep Gollakota <pr...@gmail.com> on 2017/02/09 02:49:30 UTC

Missing min/max statistics in file footer

Hi folks,

I generated a bunch of parquet files using spark and
ParquetThriftOutputFormat. The thirft model has a column called "deviceId"
which is a string column. It also has a "timestamp" column of int64. After
the files have been generated, I inspected the file footers and noticed
that only the "timestamp" field has min/max statistics. My primary filter
will be deviceId, the data is partitioned and sorted by deviceId, but since
the statistics data is missing, it's not able to prune blocks from being
read. Am I missing some configuration setting that allows it to generate
the stats data? The following is code is how an RDD[Thrift] is being saved
to parquet. The configuration is default configuration.

implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
ClassTag](rdd: RDD[T]) {
  def saveAsParquet(output: String,
                    conf: Configuration =
rdd.context.hadoopConfiguration): Unit = {
    val job = Job.getInstance(conf)
    val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]]
    ParquetThriftOutputFormat.setThriftClass(job, clazz)
    val r = rdd.map[(Void, T)](x => (null, x))
      .saveAsNewAPIHadoopFile(
        output,
        classOf[Void],
        clazz,
        classOf[ParquetThriftOutputFormat[T]],
        job.getConfiguration)
  }
}


Thanks,
Pradeep

Re: Missing min/max statistics in file footer

Posted by Julien Le Dem <ju...@ledem.net>.
When the reader ignores the stats, you should see a warning in the logs.
If you have a local build you can easily modify the logic to verify:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L347 <https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L347>

> On Feb 10, 2017, at 12:39 PM, Lars Volker <lv...@cloudera.com> wrote:
> 
> In that case I don't see why reading the stats shouldn't work, assuming
> they are in the file in the first place. I don't know why writing them
> would fail, so unless someone else can help you, you may have to debug the
> code that writes them.
> 
> On Fri, Feb 10, 2017 at 8:31 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:
> 
>> metadata.getFileMetadata().createdBy() shows this "parquet-mr version
>> 1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"
>> 
>> Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
>> PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869>
>> 
>> On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <lv...@cloudera.com> wrote:
>> 
>>> Can you check the value of ParquetMetaData.created_by? Once you have
>> that,
>>> you should see if it gets filtered by the code in CorruptStatistics.java.
>>> 
>>> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <pradeepg26@gmail.com
>>> 
>>> wrote:
>>> 
>>>> Data was written with Spark but I'm using the parquet APIs directly for
>>>> reads. I checked the stats in the footer with the following code.
>>>> 
>>>> ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
>>>> ParquetMetadataConverter.NO_FILTER);
>>>> ColumnPath deviceId = ColumnPath.get("deviceId");
>>>> metadata.getBlocks().forEach(b -> {
>>>>    if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
>>>>        System.out.println("\nBlockSize = " + b.getTotalByteSize());
>>>>        System.out.println("ComprSize = " + b.getCompressedSize());
>>>>        System.out.println("Num Rows  = " + b.getRowCount());
>>>>        b.getColumns().forEach(c -> {
>>>>            if (c.getPath().equals(deviceId)) {
>>>>                Comparable max = c.getStatistics().genericGetMax();
>>>>                Comparable min = c.getStatistics().genericGetMin();
>>>>                System.out.println("\t" + c.getPath() + " [" + min +
>>>> ", " + max + "]");
>>>>            }
>>>>        });
>>>>    }
>>>> });
>>>> 
>>>> 
>>>> Thanks,
>>>> Pradeep
>>>> 
>>>> On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <lv...@cloudera.com> wrote:
>>>> 
>>>>> Hi Pradeep,
>>>>> 
>>>>> I don't have any experience with using Parquet APIs through Spark.
>> That
>>>>> being said, there are currently several issues around column
>>> statistics,
>>>>> both in the format and in the parquet-mr implementation (PARQUET-686,
>>>>> PARQUET-839, PARQUET-840).
>>>>> 
>>>>> However, in your case and depending on the versions involved, you
>> might
>>>>> also hit PARQUET-251, which can cause statistics for some files to be
>>>>> ignored. In this context it may be worth to have a look at this file:
>>>>> https://github.com/apache/parquet-mr/blob/master/
>>>>> parquet-column/src/main/java/org/apache/parquet/
>> CorruptStatistics.java
>>>>> 
>>>>> How did you check that the statistics are not written to the footer?
>> If
>>>> you
>>>>> used parquet-mr, they may be there but be ignored.
>>>>> 
>>>>> Cheers, Lars
>>>>> 
>>>>> On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
>>> pradeepg26@gmail.com
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Bumping the thread to see if I get any responses.
>>>>>> 
>>>>>> On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
>>>> pradeepg26@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi folks,
>>>>>>> 
>>>>>>> I generated a bunch of parquet files using spark and
>>>>>>> ParquetThriftOutputFormat. The thirft model has a column called
>>>>>> "deviceId"
>>>>>>> which is a string column. It also has a "timestamp" column of
>>> int64.
>>>>>> After
>>>>>>> the files have been generated, I inspected the file footers and
>>>> noticed
>>>>>>> that only the "timestamp" field has min/max statistics. My
>> primary
>>>>> filter
>>>>>>> will be deviceId, the data is partitioned and sorted by deviceId,
>>> but
>>>>>> since
>>>>>>> the statistics data is missing, it's not able to prune blocks
>> from
>>>>> being
>>>>>>> read. Am I missing some configuration setting that allows it to
>>>>> generate
>>>>>>> the stats data? The following is code is how an RDD[Thrift] is
>>> being
>>>>>> saved
>>>>>>> to parquet. The configuration is default configuration.
>>>>>>> 
>>>>>>> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
>>>>>> ClassTag](rdd: RDD[T]) {
>>>>>>>  def saveAsParquet(output: String,
>>>>>>>                    conf: Configuration = rdd.context.
>>>>> hadoopConfiguration):
>>>>>> Unit = {
>>>>>>>    val job = Job.getInstance(conf)
>>>>>>>    val clazz: Class[T] = classTag[T].runtimeClass.
>>>>>> asInstanceOf[Class[T]]
>>>>>>>    ParquetThriftOutputFormat.setThriftClass(job, clazz)
>>>>>>>    val r = rdd.map[(Void, T)](x => (null, x))
>>>>>>>      .saveAsNewAPIHadoopFile(
>>>>>>>        output,
>>>>>>>        classOf[Void],
>>>>>>>        clazz,
>>>>>>>        classOf[ParquetThriftOutputFormat[T]],
>>>>>>>        job.getConfiguration)
>>>>>>>  }
>>>>>>> }
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Pradeep
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: Missing min/max statistics in file footer

Posted by Lars Volker <lv...@cloudera.com>.
In that case I don't see why reading the stats shouldn't work, assuming
they are in the file in the first place. I don't know why writing them
would fail, so unless someone else can help you, you may have to debug the
code that writes them.

On Fri, Feb 10, 2017 at 8:31 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> metadata.getFileMetadata().createdBy() shows this "parquet-mr version
> 1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"
>
> Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
> PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869>
>
> On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > Can you check the value of ParquetMetaData.created_by? Once you have
> that,
> > you should see if it gets filtered by the code in CorruptStatistics.java.
> >
> > On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <pradeepg26@gmail.com
> >
> > wrote:
> >
> > > Data was written with Spark but I'm using the parquet APIs directly for
> > > reads. I checked the stats in the footer with the following code.
> > >
> > > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> > > ParquetMetadataConverter.NO_FILTER);
> > > ColumnPath deviceId = ColumnPath.get("deviceId");
> > > metadata.getBlocks().forEach(b -> {
> > >     if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> > >         System.out.println("\nBlockSize = " + b.getTotalByteSize());
> > >         System.out.println("ComprSize = " + b.getCompressedSize());
> > >         System.out.println("Num Rows  = " + b.getRowCount());
> > >         b.getColumns().forEach(c -> {
> > >             if (c.getPath().equals(deviceId)) {
> > >                 Comparable max = c.getStatistics().genericGetMax();
> > >                 Comparable min = c.getStatistics().genericGetMin();
> > >                 System.out.println("\t" + c.getPath() + " [" + min +
> > > ", " + max + "]");
> > >             }
> > >         });
> > >     }
> > > });
> > >
> > >
> > > Thanks,
> > > Pradeep
> > >
> > > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <lv...@cloudera.com> wrote:
> > >
> > > > Hi Pradeep,
> > > >
> > > > I don't have any experience with using Parquet APIs through Spark.
> That
> > > > being said, there are currently several issues around column
> > statistics,
> > > > both in the format and in the parquet-mr implementation (PARQUET-686,
> > > > PARQUET-839, PARQUET-840).
> > > >
> > > > However, in your case and depending on the versions involved, you
> might
> > > > also hit PARQUET-251, which can cause statistics for some files to be
> > > > ignored. In this context it may be worth to have a look at this file:
> > > > https://github.com/apache/parquet-mr/blob/master/
> > > > parquet-column/src/main/java/org/apache/parquet/
> CorruptStatistics.java
> > > >
> > > > How did you check that the statistics are not written to the footer?
> If
> > > you
> > > > used parquet-mr, they may be there but be ignored.
> > > >
> > > > Cheers, Lars
> > > >
> > > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
> > pradeepg26@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Bumping the thread to see if I get any responses.
> > > > >
> > > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> > > pradeepg26@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi folks,
> > > > > >
> > > > > > I generated a bunch of parquet files using spark and
> > > > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > > > "deviceId"
> > > > > > which is a string column. It also has a "timestamp" column of
> > int64.
> > > > > After
> > > > > > the files have been generated, I inspected the file footers and
> > > noticed
> > > > > > that only the "timestamp" field has min/max statistics. My
> primary
> > > > filter
> > > > > > will be deviceId, the data is partitioned and sorted by deviceId,
> > but
> > > > > since
> > > > > > the statistics data is missing, it's not able to prune blocks
> from
> > > > being
> > > > > > read. Am I missing some configuration setting that allows it to
> > > > generate
> > > > > > the stats data? The following is code is how an RDD[Thrift] is
> > being
> > > > > saved
> > > > > > to parquet. The configuration is default configuration.
> > > > > >
> > > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > > > ClassTag](rdd: RDD[T]) {
> > > > > >   def saveAsParquet(output: String,
> > > > > >                     conf: Configuration = rdd.context.
> > > > hadoopConfiguration):
> > > > > Unit = {
> > > > > >     val job = Job.getInstance(conf)
> > > > > >     val clazz: Class[T] = classTag[T].runtimeClass.
> > > > > asInstanceOf[Class[T]]
> > > > > >     ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > > >     val r = rdd.map[(Void, T)](x => (null, x))
> > > > > >       .saveAsNewAPIHadoopFile(
> > > > > >         output,
> > > > > >         classOf[Void],
> > > > > >         clazz,
> > > > > >         classOf[ParquetThriftOutputFormat[T]],
> > > > > >         job.getConfiguration)
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Pradeep
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Missing min/max statistics in file footer

Posted by Pradeep Gollakota <pr...@gmail.com>.
metadata.getFileMetadata().createdBy() shows this "parquet-mr version
1.9.1-SNAPSHOT (build 2fd62ee4d524c270764e9b91dca72e5cf1a005b7)"

Ignore the 1.9.1-SNAPSHOT... that's my local build as I'm trying to work on
PARQUET-869 <https://issues.apache.org/jira/browse/PARQUET-869>

On Fri, Feb 10, 2017 at 10:17 AM, Lars Volker <lv...@cloudera.com> wrote:

> Can you check the value of ParquetMetaData.created_by? Once you have that,
> you should see if it gets filtered by the code in CorruptStatistics.java.
>
> On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:
>
> > Data was written with Spark but I'm using the parquet APIs directly for
> > reads. I checked the stats in the footer with the following code.
> >
> > ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> > ParquetMetadataConverter.NO_FILTER);
> > ColumnPath deviceId = ColumnPath.get("deviceId");
> > metadata.getBlocks().forEach(b -> {
> >     if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
> >         System.out.println("\nBlockSize = " + b.getTotalByteSize());
> >         System.out.println("ComprSize = " + b.getCompressedSize());
> >         System.out.println("Num Rows  = " + b.getRowCount());
> >         b.getColumns().forEach(c -> {
> >             if (c.getPath().equals(deviceId)) {
> >                 Comparable max = c.getStatistics().genericGetMax();
> >                 Comparable min = c.getStatistics().genericGetMin();
> >                 System.out.println("\t" + c.getPath() + " [" + min +
> > ", " + max + "]");
> >             }
> >         });
> >     }
> > });
> >
> >
> > Thanks,
> > Pradeep
> >
> > On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <lv...@cloudera.com> wrote:
> >
> > > Hi Pradeep,
> > >
> > > I don't have any experience with using Parquet APIs through Spark. That
> > > being said, there are currently several issues around column
> statistics,
> > > both in the format and in the parquet-mr implementation (PARQUET-686,
> > > PARQUET-839, PARQUET-840).
> > >
> > > However, in your case and depending on the versions involved, you might
> > > also hit PARQUET-251, which can cause statistics for some files to be
> > > ignored. In this context it may be worth to have a look at this file:
> > > https://github.com/apache/parquet-mr/blob/master/
> > > parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
> > >
> > > How did you check that the statistics are not written to the footer? If
> > you
> > > used parquet-mr, they may be there but be ignored.
> > >
> > > Cheers, Lars
> > >
> > > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <
> pradeepg26@gmail.com
> > >
> > > wrote:
> > >
> > > > Bumping the thread to see if I get any responses.
> > > >
> > > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> > pradeepg26@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I generated a bunch of parquet files using spark and
> > > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > > "deviceId"
> > > > > which is a string column. It also has a "timestamp" column of
> int64.
> > > > After
> > > > > the files have been generated, I inspected the file footers and
> > noticed
> > > > > that only the "timestamp" field has min/max statistics. My primary
> > > filter
> > > > > will be deviceId, the data is partitioned and sorted by deviceId,
> but
> > > > since
> > > > > the statistics data is missing, it's not able to prune blocks from
> > > being
> > > > > read. Am I missing some configuration setting that allows it to
> > > generate
> > > > > the stats data? The following is code is how an RDD[Thrift] is
> being
> > > > saved
> > > > > to parquet. The configuration is default configuration.
> > > > >
> > > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > > ClassTag](rdd: RDD[T]) {
> > > > >   def saveAsParquet(output: String,
> > > > >                     conf: Configuration = rdd.context.
> > > hadoopConfiguration):
> > > > Unit = {
> > > > >     val job = Job.getInstance(conf)
> > > > >     val clazz: Class[T] = classTag[T].runtimeClass.
> > > > asInstanceOf[Class[T]]
> > > > >     ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > > >     val r = rdd.map[(Void, T)](x => (null, x))
> > > > >       .saveAsNewAPIHadoopFile(
> > > > >         output,
> > > > >         classOf[Void],
> > > > >         clazz,
> > > > >         classOf[ParquetThriftOutputFormat[T]],
> > > > >         job.getConfiguration)
> > > > >   }
> > > > > }
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Pradeep
> > > > >
> > > >
> > >
> >
>

Re: Missing min/max statistics in file footer

Posted by Lars Volker <lv...@cloudera.com>.
Can you check the value of ParquetMetaData.created_by? Once you have that,
you should see if it gets filtered by the code in CorruptStatistics.java.

On Fri, Feb 10, 2017 at 7:11 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> Data was written with Spark but I'm using the parquet APIs directly for
> reads. I checked the stats in the footer with the following code.
>
> ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
> ParquetMetadataConverter.NO_FILTER);
> ColumnPath deviceId = ColumnPath.get("deviceId");
> metadata.getBlocks().forEach(b -> {
>     if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
>         System.out.println("\nBlockSize = " + b.getTotalByteSize());
>         System.out.println("ComprSize = " + b.getCompressedSize());
>         System.out.println("Num Rows  = " + b.getRowCount());
>         b.getColumns().forEach(c -> {
>             if (c.getPath().equals(deviceId)) {
>                 Comparable max = c.getStatistics().genericGetMax();
>                 Comparable min = c.getStatistics().genericGetMin();
>                 System.out.println("\t" + c.getPath() + " [" + min +
> ", " + max + "]");
>             }
>         });
>     }
> });
>
>
> Thanks,
> Pradeep
>
> On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <lv...@cloudera.com> wrote:
>
> > Hi Pradeep,
> >
> > I don't have any experience with using Parquet APIs through Spark. That
> > being said, there are currently several issues around column statistics,
> > both in the format and in the parquet-mr implementation (PARQUET-686,
> > PARQUET-839, PARQUET-840).
> >
> > However, in your case and depending on the versions involved, you might
> > also hit PARQUET-251, which can cause statistics for some files to be
> > ignored. In this context it may be worth to have a look at this file:
> > https://github.com/apache/parquet-mr/blob/master/
> > parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
> >
> > How did you check that the statistics are not written to the footer? If
> you
> > used parquet-mr, they may be there but be ignored.
> >
> > Cheers, Lars
> >
> > On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <pradeepg26@gmail.com
> >
> > wrote:
> >
> > > Bumping the thread to see if I get any responses.
> > >
> > > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <
> pradeepg26@gmail.com>
> > > wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I generated a bunch of parquet files using spark and
> > > > ParquetThriftOutputFormat. The thirft model has a column called
> > > "deviceId"
> > > > which is a string column. It also has a "timestamp" column of int64.
> > > After
> > > > the files have been generated, I inspected the file footers and
> noticed
> > > > that only the "timestamp" field has min/max statistics. My primary
> > filter
> > > > will be deviceId, the data is partitioned and sorted by deviceId, but
> > > since
> > > > the statistics data is missing, it's not able to prune blocks from
> > being
> > > > read. Am I missing some configuration setting that allows it to
> > generate
> > > > the stats data? The following is code is how an RDD[Thrift] is being
> > > saved
> > > > to parquet. The configuration is default configuration.
> > > >
> > > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > > ClassTag](rdd: RDD[T]) {
> > > >   def saveAsParquet(output: String,
> > > >                     conf: Configuration = rdd.context.
> > hadoopConfiguration):
> > > Unit = {
> > > >     val job = Job.getInstance(conf)
> > > >     val clazz: Class[T] = classTag[T].runtimeClass.
> > > asInstanceOf[Class[T]]
> > > >     ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > > >     val r = rdd.map[(Void, T)](x => (null, x))
> > > >       .saveAsNewAPIHadoopFile(
> > > >         output,
> > > >         classOf[Void],
> > > >         clazz,
> > > >         classOf[ParquetThriftOutputFormat[T]],
> > > >         job.getConfiguration)
> > > >   }
> > > > }
> > > >
> > > >
> > > > Thanks,
> > > > Pradeep
> > > >
> > >
> >
>

Re: Missing min/max statistics in file footer

Posted by Pradeep Gollakota <pr...@gmail.com>.
Data was written with Spark but I'm using the parquet APIs directly for
reads. I checked the stats in the footer with the following code.

ParquetMetadata metadata = ParquetFileReader.readFooter(conf, path,
ParquetMetadataConverter.NO_FILTER);
ColumnPath deviceId = ColumnPath.get("deviceId");
metadata.getBlocks().forEach(b -> {
    if (b.getTotalByteSize() > 4 * 1024 * 1024L) {
        System.out.println("\nBlockSize = " + b.getTotalByteSize());
        System.out.println("ComprSize = " + b.getCompressedSize());
        System.out.println("Num Rows  = " + b.getRowCount());
        b.getColumns().forEach(c -> {
            if (c.getPath().equals(deviceId)) {
                Comparable max = c.getStatistics().genericGetMax();
                Comparable min = c.getStatistics().genericGetMin();
                System.out.println("\t" + c.getPath() + " [" + min +
", " + max + "]");
            }
        });
    }
});


Thanks,
Pradeep

On Fri, Feb 10, 2017 at 9:08 AM, Lars Volker <lv...@cloudera.com> wrote:

> Hi Pradeep,
>
> I don't have any experience with using Parquet APIs through Spark. That
> being said, there are currently several issues around column statistics,
> both in the format and in the parquet-mr implementation (PARQUET-686,
> PARQUET-839, PARQUET-840).
>
> However, in your case and depending on the versions involved, you might
> also hit PARQUET-251, which can cause statistics for some files to be
> ignored. In this context it may be worth to have a look at this file:
> https://github.com/apache/parquet-mr/blob/master/
> parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java
>
> How did you check that the statistics are not written to the footer? If you
> used parquet-mr, they may be there but be ignored.
>
> Cheers, Lars
>
> On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:
>
> > Bumping the thread to see if I get any responses.
> >
> > On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pr...@gmail.com>
> > wrote:
> >
> > > Hi folks,
> > >
> > > I generated a bunch of parquet files using spark and
> > > ParquetThriftOutputFormat. The thirft model has a column called
> > "deviceId"
> > > which is a string column. It also has a "timestamp" column of int64.
> > After
> > > the files have been generated, I inspected the file footers and noticed
> > > that only the "timestamp" field has min/max statistics. My primary
> filter
> > > will be deviceId, the data is partitioned and sorted by deviceId, but
> > since
> > > the statistics data is missing, it's not able to prune blocks from
> being
> > > read. Am I missing some configuration setting that allows it to
> generate
> > > the stats data? The following is code is how an RDD[Thrift] is being
> > saved
> > > to parquet. The configuration is default configuration.
> > >
> > > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> > ClassTag](rdd: RDD[T]) {
> > >   def saveAsParquet(output: String,
> > >                     conf: Configuration = rdd.context.
> hadoopConfiguration):
> > Unit = {
> > >     val job = Job.getInstance(conf)
> > >     val clazz: Class[T] = classTag[T].runtimeClass.
> > asInstanceOf[Class[T]]
> > >     ParquetThriftOutputFormat.setThriftClass(job, clazz)
> > >     val r = rdd.map[(Void, T)](x => (null, x))
> > >       .saveAsNewAPIHadoopFile(
> > >         output,
> > >         classOf[Void],
> > >         clazz,
> > >         classOf[ParquetThriftOutputFormat[T]],
> > >         job.getConfiguration)
> > >   }
> > > }
> > >
> > >
> > > Thanks,
> > > Pradeep
> > >
> >
>

Re: Missing min/max statistics in file footer

Posted by Lars Volker <lv...@cloudera.com>.
Hi Pradeep,

I don't have any experience with using Parquet APIs through Spark. That
being said, there are currently several issues around column statistics,
both in the format and in the parquet-mr implementation (PARQUET-686,
PARQUET-839, PARQUET-840).

However, in your case and depending on the versions involved, you might
also hit PARQUET-251, which can cause statistics for some files to be
ignored. In this context it may be worth to have a look at this file:
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/CorruptStatistics.java

How did you check that the statistics are not written to the footer? If you
used parquet-mr, they may be there but be ignored.

Cheers, Lars

On Fri, Feb 10, 2017 at 5:31 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> Bumping the thread to see if I get any responses.
>
> On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pr...@gmail.com>
> wrote:
>
> > Hi folks,
> >
> > I generated a bunch of parquet files using spark and
> > ParquetThriftOutputFormat. The thirft model has a column called
> "deviceId"
> > which is a string column. It also has a "timestamp" column of int64.
> After
> > the files have been generated, I inspected the file footers and noticed
> > that only the "timestamp" field has min/max statistics. My primary filter
> > will be deviceId, the data is partitioned and sorted by deviceId, but
> since
> > the statistics data is missing, it's not able to prune blocks from being
> > read. Am I missing some configuration setting that allows it to generate
> > the stats data? The following is code is how an RDD[Thrift] is being
> saved
> > to parquet. The configuration is default configuration.
> >
> > implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] :
> ClassTag](rdd: RDD[T]) {
> >   def saveAsParquet(output: String,
> >                     conf: Configuration = rdd.context.hadoopConfiguration):
> Unit = {
> >     val job = Job.getInstance(conf)
> >     val clazz: Class[T] = classTag[T].runtimeClass.
> asInstanceOf[Class[T]]
> >     ParquetThriftOutputFormat.setThriftClass(job, clazz)
> >     val r = rdd.map[(Void, T)](x => (null, x))
> >       .saveAsNewAPIHadoopFile(
> >         output,
> >         classOf[Void],
> >         clazz,
> >         classOf[ParquetThriftOutputFormat[T]],
> >         job.getConfiguration)
> >   }
> > }
> >
> >
> > Thanks,
> > Pradeep
> >
>

Re: Missing min/max statistics in file footer

Posted by Pradeep Gollakota <pr...@gmail.com>.
Bumping the thread to see if I get any responses.

On Wed, Feb 8, 2017 at 6:49 PM, Pradeep Gollakota <pr...@gmail.com>
wrote:

> Hi folks,
>
> I generated a bunch of parquet files using spark and
> ParquetThriftOutputFormat. The thirft model has a column called "deviceId"
> which is a string column. It also has a "timestamp" column of int64. After
> the files have been generated, I inspected the file footers and noticed
> that only the "timestamp" field has min/max statistics. My primary filter
> will be deviceId, the data is partitioned and sorted by deviceId, but since
> the statistics data is missing, it's not able to prune blocks from being
> read. Am I missing some configuration setting that allows it to generate
> the stats data? The following is code is how an RDD[Thrift] is being saved
> to parquet. The configuration is default configuration.
>
> implicit class ThriftRDD[T <: TBase[T, _ <: TFieldIdEnum] : ClassTag](rdd: RDD[T]) {
>   def saveAsParquet(output: String,
>                     conf: Configuration = rdd.context.hadoopConfiguration): Unit = {
>     val job = Job.getInstance(conf)
>     val clazz: Class[T] = classTag[T].runtimeClass.asInstanceOf[Class[T]]
>     ParquetThriftOutputFormat.setThriftClass(job, clazz)
>     val r = rdd.map[(Void, T)](x => (null, x))
>       .saveAsNewAPIHadoopFile(
>         output,
>         classOf[Void],
>         clazz,
>         classOf[ParquetThriftOutputFormat[T]],
>         job.getConfiguration)
>   }
> }
>
>
> Thanks,
> Pradeep
>