You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/20 03:03:46 UTC

[GitHub] [arrow] romgrk-comparative opened a new issue #10753: Retrieve min/max values from parquet files

romgrk-comparative opened a new issue #10753:
URL: https://github.com/apache/arrow/issues/10753


   Hey,
   
   I'm unfamiliar with both parquet and arrow, but I've been playing with the `parquet-meta` tool and when it dumps the metadata, I can see min/max values for each row group that is listed. Is it possible to access those min/max values from arrow? I've been computing them by looping through all entries but it would be nice to read it from there if it's possible.
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romgrk-comparative commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
romgrk-comparative commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883040287


   Haven't found those min/max values :/ Any help appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883308382


   Things get a little tricky in C++ with all the typing and the classes there aren't as well documented.  For a robust solution you'd probably want to do something with templates but this example should get you started:
   
   ```
     arrow::fs::LocalFileSystem file_system;
     ARROW_ASSIGN_OR_RAISE(auto input, file_system.OpenInputFile("data.parquet"));
   
     parquet::ArrowReaderProperties arrow_reader_properties =
         parquet::default_arrow_reader_properties();
   
     arrow_reader_properties.set_pre_buffer(true);
     arrow_reader_properties.set_use_threads(true);
   
     parquet::ReaderProperties reader_properties =
         parquet::default_reader_properties();
   
     // Open Parquet file reader
     std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
     auto reader_builder = parquet::arrow::FileReaderBuilder();
     reader_builder.properties(arrow_reader_properties);
     ARROW_RETURN_NOT_OK(reader_builder.Open(std::move(input), reader_properties));
     ARROW_RETURN_NOT_OK(reader_builder.Build(&arrow_reader));
   
     std::shared_ptr<arrow::Schema> schema;
     ARROW_RETURN_NOT_OK(arrow_reader->GetSchema(&schema));
     auto metadata = arrow_reader->parquet_reader()->metadata();
     for (int i = 0; i < metadata->num_row_groups(); i++) {
       auto row_group = metadata->RowGroup(i);
       std::cout << "Row group: " << i << " (" << row_group->num_rows() << " rows)"
                 << std::endl;
       for (int j = 0; j < row_group->num_columns(); j++) {
         auto column = row_group->ColumnChunk(j);
         auto field = schema->fields()[j];
         std::cout << "  Column: " << field->name() << " ("
                   << field->type()->ToString() << ")" << std::endl;
         if (column->statistics()->HasMinMax()) {
           if (field->type()->id() == arrow::float64()->id()) {
             auto double_field = std::dynamic_pointer_cast<
                 parquet::TypedStatistics<parquet::DoubleType>>(
                 column->statistics());
             std::cout << "    Minimum: " << double_field->min() << std::endl;
             std::cout << "    Maximum: " << double_field->max() << std::endl;
           } else if (field->type()->id() ==
                      arrow::timestamp(arrow::TimeUnit::MILLI)->id()) {
             auto int_field = std::dynamic_pointer_cast<
                 parquet::TypedStatistics<parquet::Int64Type>>(
                 column->statistics());
             std::cout << "    Minimum: " << int_field->min() << std::endl;
             std::cout << "    Maximum: " << int_field->max() << std::endl;
           } else {
             std::cout << "    Minimum: " << column->statistics()->EncodeMin()
                       << std::endl;
             std::cout << "    Maximum: " << column->statistics()->EncodeMax()
                       << std::endl;
           }
         } else {
           std::cout << "    Minimum: unknown" << std::endl;
           std::cout << "    Maximum: unknown" << std::endl;
         }
       }
     }
   
   ```
   
   The base `parquet::Statistics` class only has `EncodeMin` and `EncodeMax` which encodes the min/max into a byte array (not necessarily a printable string).  If you want the value you need to cast it to one of the `parquet::Statistics` subclasses.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romgrk-comparative commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
romgrk-comparative commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883017339


   I'm in C++. Got it, I'll dig in the metadata. I you have hints for how to retrieve min/max for each columns much appreciated :]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romgrk-comparative commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
romgrk-comparative commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883017339






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883015751


   Yes, it is.  It's part of the parquet metadata which is accessible from a parquet::ParquetFileReader.  What language are you working with?  For python there is https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romgrk-comparative commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
romgrk-comparative commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883389589


   Yay, the joys of C++ ^^ Thanks a lot for the example, I would have spent quite a while finding that out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883015751


   Yes, it is.  It's part of the parquet metadata which is accessible from a parquet::ParquetFileReader.  What language are you working with?  For python there is https://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romgrk-comparative commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
romgrk-comparative commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883017339






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romgrk-comparative closed issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
romgrk-comparative closed issue #10753:
URL: https://github.com/apache/arrow/issues/10753


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on issue #10753: Retrieve min/max values from parquet files

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #10753:
URL: https://github.com/apache/arrow/issues/10753#issuecomment-883015751






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org