You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by David Mollitor <da...@gmail.com> on 2020/03/09 14:42:14 UTC

Finding Max Value of Column

Hello Gang,

I am trying to build an application.  One function it has is to scan a
directory of Parquet files and then determine the maximum "sequence number"
(id) across all files.  This is the solution I came up with, but is this
correct?  How would you do such a thing?

I wrote the files with parquet-avro writer.

try (DirectoryStream<java.nio.file.Path> directoryStream =
Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {

  PrimitiveType type = Types.required(PrimitiveTypeName.INT64).named("seq");
  Statistics<?> stats = Statistics.getBuilderForReading(type).build();

  for (java.nio.file.Path path : directoryStream) {
    ParquetFileReader reader =
ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new
Configuration()));

    for (final BlockMetaData rowGroup : reader.getRowGroups()) {
      for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
        if ("seq".equals(column.getPath().toDotString())) {
          stats.mergeStatistics(column.getStatistics());
        }
      }
   }
}

Thanks.

Re: Finding Max Value of Column

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.
Hi David,

Unfortunately, I don't have a better solution. If you think, finding a
global min/max value in a file would be frequently used by our clients you
may create a jira for this feature.

Regards,
Gabor

On Tue, Mar 10, 2020 at 2:08 PM David Mollitor <da...@gmail.com> wrote:

> Hey Gabor,
>
> I appreciate you sharing your knowledge with me.
>
> As I understand it, my solution is acceptable but is not the generalized
> solution.  What would that solution look like?
>
> Thanks.
>
> On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky
> <ga...@cloudera.com.invalid> wrote:
>
> > Hi,
> >
> > Statistics objects are mainly created for internal use. The check you
> > mentioned is to ensure that only the corresponding column statistics are
> > summarized.
> > The code you've written works properly because you create and use the
> > Statistics object as we use it internally. However, it is quite easy to
> > misuse it.
> > It is also worth mentioning that the code works properly because your
> type
> > is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it
> > would not always be that trivial.
> > So, if this code works for your case you may use it but I would not
> suggest
> > generalizing it for other cases and neither would suggest extending the
> > existing code to support it.
> >
> > Regards,
> > Gabor
> >
> > On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <da...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > One thing that would have made this even easier... the
> 'mergeStatsistics'
> > > method throws an exception if the columns are not equal on the RHS/LHS
> of
> > > the method.  I had to add that toDotString check to avoid this
> > scenario.  I
> > > could have just caught (and ignored) that exception to remove that
> extra
> > > check, but the overhead would have been heavy, and it would have added
> > even
> > > more code.
> > >
> > > The 'mergeStatistics' method is already doing a comparison check
> > internally
> > > (that's why it throws an exception),  is there any interest in adding a
> > new
> > > method signature that returns true/false if the merge was successful,
> > > instead of throwing an exception?
> > >
> > > Then the code just becomes:
> > >
> > > for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > >       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> > >             boolean success =
> > > stats.mergeStatistics(column.getStatistics());
> > >       }
> > > }
> > >
> > >
> > >
> > > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
> > > <ga...@cloudera.com.invalid> wrote:
> > >
> > > > Hi David,
> > > >
> > > > Your code looks good to me. As you are using INT64, min/max truncate
> > does
> > > > not apply. I think, it should work fine.
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <da...@gmail.com>
> > wrote:
> > > >
> > > > > Hello Gang,
> > > > >
> > > > > I am trying to build an application.  One function it has is to
> scan
> > a
> > > > > directory of Parquet files and then determine the maximum "sequence
> > > > number"
> > > > > (id) across all files.  This is the solution I came up with, but is
> > > this
> > > > > correct?  How would you do such a thing?
> > > > >
> > > > > I wrote the files with parquet-avro writer.
> > > > >
> > > > > try (DirectoryStream<java.nio.file.Path> directoryStream =
> > > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> > > > >
> > > > >   PrimitiveType type =
> > > > > Types.required(PrimitiveTypeName.INT64).named("seq");
> > > > >   Statistics<?> stats =
> > Statistics.getBuilderForReading(type).build();
> > > > >
> > > > >   for (java.nio.file.Path path : directoryStream) {
> > > > >     ParquetFileReader reader =
> > > > > ParquetFileReader.open(HadoopInputFile.fromPath(new
> > Path(path.toUri()),
> > > > new
> > > > > Configuration()));
> > > > >
> > > > >     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > > > >       for (final ColumnChunkMetaData column :
> rowGroup.getColumns())
> > {
> > > > >         if ("seq".equals(column.getPath().toDotString())) {
> > > > >           stats.mergeStatistics(column.getStatistics());
> > > > >         }
> > > > >       }
> > > > >    }
> > > > > }
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > >
> >
>

Re: Finding Max Value of Column

Posted by David Mollitor <da...@gmail.com>.
Hey Gabor,

I appreciate you sharing your knowledge with me.

As I understand it, my solution is acceptable but is not the generalized
solution.  What would that solution look like?

Thanks.

On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky
<ga...@cloudera.com.invalid> wrote:

> Hi,
>
> Statistics objects are mainly created for internal use. The check you
> mentioned is to ensure that only the corresponding column statistics are
> summarized.
> The code you've written works properly because you create and use the
> Statistics object as we use it internally. However, it is quite easy to
> misuse it.
> It is also worth mentioning that the code works properly because your type
> is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it
> would not always be that trivial.
> So, if this code works for your case you may use it but I would not suggest
> generalizing it for other cases and neither would suggest extending the
> existing code to support it.
>
> Regards,
> Gabor
>
> On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <da...@gmail.com> wrote:
>
> > Hello,
> >
> > One thing that would have made this even easier... the 'mergeStatsistics'
> > method throws an exception if the columns are not equal on the RHS/LHS of
> > the method.  I had to add that toDotString check to avoid this
> scenario.  I
> > could have just caught (and ignored) that exception to remove that extra
> > check, but the overhead would have been heavy, and it would have added
> even
> > more code.
> >
> > The 'mergeStatistics' method is already doing a comparison check
> internally
> > (that's why it throws an exception),  is there any interest in adding a
> new
> > method signature that returns true/false if the merge was successful,
> > instead of throwing an exception?
> >
> > Then the code just becomes:
> >
> > for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> >       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> >             boolean success =
> > stats.mergeStatistics(column.getStatistics());
> >       }
> > }
> >
> >
> >
> > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
> > <ga...@cloudera.com.invalid> wrote:
> >
> > > Hi David,
> > >
> > > Your code looks good to me. As you are using INT64, min/max truncate
> does
> > > not apply. I think, it should work fine.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <da...@gmail.com>
> wrote:
> > >
> > > > Hello Gang,
> > > >
> > > > I am trying to build an application.  One function it has is to scan
> a
> > > > directory of Parquet files and then determine the maximum "sequence
> > > number"
> > > > (id) across all files.  This is the solution I came up with, but is
> > this
> > > > correct?  How would you do such a thing?
> > > >
> > > > I wrote the files with parquet-avro writer.
> > > >
> > > > try (DirectoryStream<java.nio.file.Path> directoryStream =
> > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> > > >
> > > >   PrimitiveType type =
> > > > Types.required(PrimitiveTypeName.INT64).named("seq");
> > > >   Statistics<?> stats =
> Statistics.getBuilderForReading(type).build();
> > > >
> > > >   for (java.nio.file.Path path : directoryStream) {
> > > >     ParquetFileReader reader =
> > > > ParquetFileReader.open(HadoopInputFile.fromPath(new
> Path(path.toUri()),
> > > new
> > > > Configuration()));
> > > >
> > > >     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > > >       for (final ColumnChunkMetaData column : rowGroup.getColumns())
> {
> > > >         if ("seq".equals(column.getPath().toDotString())) {
> > > >           stats.mergeStatistics(column.getStatistics());
> > > >         }
> > > >       }
> > > >    }
> > > > }
> > > >
> > > > Thanks.
> > > >
> > >
> >
>

Re: Finding Max Value of Column

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.
Hi,

Statistics objects are mainly created for internal use. The check you
mentioned is to ensure that only the corresponding column statistics are
summarized.
The code you've written works properly because you create and use the
Statistics object as we use it internally. However, it is quite easy to
misuse it.
It is also worth mentioning that the code works properly because your type
is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it
would not always be that trivial.
So, if this code works for your case you may use it but I would not suggest
generalizing it for other cases and neither would suggest extending the
existing code to support it.

Regards,
Gabor

On Mon, Mar 9, 2020 at 4:12 PM David Mollitor <da...@gmail.com> wrote:

> Hello,
>
> One thing that would have made this even easier... the 'mergeStatsistics'
> method throws an exception if the columns are not equal on the RHS/LHS of
> the method.  I had to add that toDotString check to avoid this scenario.  I
> could have just caught (and ignored) that exception to remove that extra
> check, but the overhead would have been heavy, and it would have added even
> more code.
>
> The 'mergeStatistics' method is already doing a comparison check internally
> (that's why it throws an exception),  is there any interest in adding a new
> method signature that returns true/false if the merge was successful,
> instead of throwing an exception?
>
> Then the code just becomes:
>
> for (final BlockMetaData rowGroup : reader.getRowGroups()) {
>       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
>             boolean success =
> stats.mergeStatistics(column.getStatistics());
>       }
> }
>
>
>
> On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
> <ga...@cloudera.com.invalid> wrote:
>
> > Hi David,
> >
> > Your code looks good to me. As you are using INT64, min/max truncate does
> > not apply. I think, it should work fine.
> >
> > Cheers,
> > Gabor
> >
> > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <da...@gmail.com> wrote:
> >
> > > Hello Gang,
> > >
> > > I am trying to build an application.  One function it has is to scan a
> > > directory of Parquet files and then determine the maximum "sequence
> > number"
> > > (id) across all files.  This is the solution I came up with, but is
> this
> > > correct?  How would you do such a thing?
> > >
> > > I wrote the files with parquet-avro writer.
> > >
> > > try (DirectoryStream<java.nio.file.Path> directoryStream =
> > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> > >
> > >   PrimitiveType type =
> > > Types.required(PrimitiveTypeName.INT64).named("seq");
> > >   Statistics<?> stats = Statistics.getBuilderForReading(type).build();
> > >
> > >   for (java.nio.file.Path path : directoryStream) {
> > >     ParquetFileReader reader =
> > > ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()),
> > new
> > > Configuration()));
> > >
> > >     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > >       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> > >         if ("seq".equals(column.getPath().toDotString())) {
> > >           stats.mergeStatistics(column.getStatistics());
> > >         }
> > >       }
> > >    }
> > > }
> > >
> > > Thanks.
> > >
> >
>

Re: Finding Max Value of Column

Posted by David Mollitor <da...@gmail.com>.
Hello,

One thing that would have made this even easier... the 'mergeStatsistics'
method throws an exception if the columns are not equal on the RHS/LHS of
the method.  I had to add that toDotString check to avoid this scenario.  I
could have just caught (and ignored) that exception to remove that extra
check, but the overhead would have been heavy, and it would have added even
more code.

The 'mergeStatistics' method is already doing a comparison check internally
(that's why it throws an exception),  is there any interest in adding a new
method signature that returns true/false if the merge was successful,
instead of throwing an exception?

Then the code just becomes:

for (final BlockMetaData rowGroup : reader.getRowGroups()) {
      for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
            boolean success = stats.mergeStatistics(column.getStatistics());
      }
}



On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
<ga...@cloudera.com.invalid> wrote:

> Hi David,
>
> Your code looks good to me. As you are using INT64, min/max truncate does
> not apply. I think, it should work fine.
>
> Cheers,
> Gabor
>
> On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <da...@gmail.com> wrote:
>
> > Hello Gang,
> >
> > I am trying to build an application.  One function it has is to scan a
> > directory of Parquet files and then determine the maximum "sequence
> number"
> > (id) across all files.  This is the solution I came up with, but is this
> > correct?  How would you do such a thing?
> >
> > I wrote the files with parquet-avro writer.
> >
> > try (DirectoryStream<java.nio.file.Path> directoryStream =
> > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> >
> >   PrimitiveType type =
> > Types.required(PrimitiveTypeName.INT64).named("seq");
> >   Statistics<?> stats = Statistics.getBuilderForReading(type).build();
> >
> >   for (java.nio.file.Path path : directoryStream) {
> >     ParquetFileReader reader =
> > ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()),
> new
> > Configuration()));
> >
> >     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> >       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> >         if ("seq".equals(column.getPath().toDotString())) {
> >           stats.mergeStatistics(column.getStatistics());
> >         }
> >       }
> >    }
> > }
> >
> > Thanks.
> >
>

Re: Finding Max Value of Column

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.
Hi David,

Your code looks good to me. As you are using INT64, min/max truncate does
not apply. I think, it should work fine.

Cheers,
Gabor

On Mon, Mar 9, 2020 at 3:42 PM David Mollitor <da...@gmail.com> wrote:

> Hello Gang,
>
> I am trying to build an application.  One function it has is to scan a
> directory of Parquet files and then determine the maximum "sequence number"
> (id) across all files.  This is the solution I came up with, but is this
> correct?  How would you do such a thing?
>
> I wrote the files with parquet-avro writer.
>
> try (DirectoryStream<java.nio.file.Path> directoryStream =
> Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
>
>   PrimitiveType type =
> Types.required(PrimitiveTypeName.INT64).named("seq");
>   Statistics<?> stats = Statistics.getBuilderForReading(type).build();
>
>   for (java.nio.file.Path path : directoryStream) {
>     ParquetFileReader reader =
> ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new
> Configuration()));
>
>     for (final BlockMetaData rowGroup : reader.getRowGroups()) {
>       for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
>         if ("seq".equals(column.getPath().toDotString())) {
>           stats.mergeStatistics(column.getStatistics());
>         }
>       }
>    }
> }
>
> Thanks.
>