You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ei-grad (via GitHub)" <gi...@apache.org> on 2023/04/25 10:00:21 UTC

[GitHub] [arrow] ei-grad opened a new issue, #35331: [pyarrow] Expose `sorting_columns` in RowGroupMetaData for Parquet files

ei-grad opened a new issue, #35331:
URL: https://github.com/apache/arrow/issues/35331

   ### Describe the enhancement requested
   
   ## Summary
   
   Currently, the `pyarrow.parquet.RowGroupMetaData` class does not expose the `sorting_columns` information available in the Parquet format's `RowGroup` struct. This information is useful for users who need to understand the local sorting order of columns within each RowGroup. It would be beneficial to expose this information in the `RowGroupMetaData` class.
   
   ## Details
   
   The Parquet format includes an optional `sorting_columns` field in the `RowGroup` struct, which stores information about the sorting order of columns within the RowGroup. This information is defined in the `SortingColumn` struct in the `parquet.thrift` file:
   
   ```
   struct SortingColumn {
     1: required int32 column_idx;
     2: required bool descending;
     3: optional bool nulls_first;
   }
   ```
   
   In the `RowGroup` struct, the `sorting_columns` field is defined as follows:
   
   ```
   struct RowGroup {
     1: required list<ColumnChunk> columns;
     2: required i64 total_byte_size;
     3: required i64 num_rows;
     4: optional list<SortingColumn> sorting_columns;
   }
   ```
   
   However, the `pyarrow.parquet.RowGroupMetaData` class does not expose this information. As a result, users cannot access the local sorting information of columns within RowGroups.
   
   ## Proposal
   
   I propose adding a new method or property in the `RowGroupMetaData` class to expose the `sorting_columns` information. This could be implemented as a new method, such as `get_sorting_columns()`, or as a property, such as `sorting_columns`. The output should include the column index, sorting order (ascending or descending), and whether null values appear first or last in the sorted order.
   
   ## Use Case
   
   Users working with sorted Parquet files can benefit from understanding the local sorting order of columns within RowGroups. This information is particularly useful when analyzing large datasets or performing operations that require knowledge of the sort order, such as range queries, filtering, or merging.
   
   By exposing the `sorting_columns` information in the `RowGroupMetaData` class, users can more easily work with sorted Parquet files and perform advanced data processing operations.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1525105068

   Draft an issue, will test it later: https://github.com/apache/arrow/pull/35351/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1525074933

   I can help with the C++ part. Will finish it tonight


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1529818576

   @wjones127 Hi will, I only solve the C++ part. Should we reuse this issue, or I should create another issue for C++/Python, and use that issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1522991033

   @ei-grad thanks for opening the issue! I think that would be a welcome enhancement. 
   
   Note that this is also not yet exposed in the C++ `parquet::RowGroupMetaData` (which wraps the raw `parquet::format::RowGroup`, which is the auto-generated equivalent of the thrift struct you mentioned). So a first step would be to expose it in C++, and then the Python bindings can expose it as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files [arrow]

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF closed issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files
URL: https://github.com/apache/arrow/issues/35331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 closed issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 closed issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files
URL: https://github.com/apache/arrow/issues/35331


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1530129499

   Ah sorry I didn't notice that. I think we generally want one-to-one correspondence with issues and pull requests, so i should have created a separate sub-issue for the C++ part. I think its fine for now if we re-use the issue, right @jorisvandenbossche ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1530164832

   Yes, let's just re-use it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1530180033

   I'm not familiar with Python part, would you mind do this, or tell me the code I can take use for reference?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files [arrow]

Posted by "judahrand (via GitHub)" <gi...@apache.org>.
judahrand commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1864151370

   Sure!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #35331: [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1531848979

   I can work on that soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1864145233

   @judahrand would you mind reply here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++][Python] Expose `sorting_columns` in RowGroupMetaData for Parquet files [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #35331:
URL: https://github.com/apache/arrow/issues/35331#issuecomment-1864144508

   Please not assign to me =_=


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org