You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/10 18:23:04 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #4169: Add ability to specify external sort information for ParquetExec

alamb opened a new issue, #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   IOx stores parquet files in a particular sort order, and then uses the fact the data is sorted for a variety of sort related optimizations
   
   The new `BasicEnforcement` rule added in https://github.com/apache/arrow-datafusion/pull/4122 by @mingmwang is (correctly) deciding that since the `ParquetExec` declares its output is not sorted, it needs to add a `SortExec` which is unnecessary in our case and will slow performance dramatically. 
   
   I think the way to avoid this is to teach DataFusion that the `ParquetExec` is actually sorted (which is is) and then everything will work out.
   
   
   **Describe the solution you'd like**
   I would like a way for someone constructing a `ParquetExec` manually to be able to specify that the data is already sorted. 
   
   
   **Describe alternatives you've considered**
   It might be possible to figure out the sort order of the data given the parquet metadata, but I haven't looked into that carefully
   
   **Additional context**
   
   As a bonus, I think at least some part of our plan construction logic in IOx that adds SortExec's in to sort the data could potentially be removed as it is now covered by the DataFusion optimizer.
   
   See more detail at https://github.com/influxdata/influxdb_iox/pull/6108#discussion_r1019387151
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311590143

   My plan is to implement the "manual override" initially (both to unblock IOx work and allow users to provide other externally known sort information) and file tickets describing the "encode sort order in the parquet files" for follow on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311388859

   One question for SortPreservingMergeExec, does it alway require the inputs to be sorted ? What if the inputs are not sorted,
   what kind of output stream the SortPreservingMergeExec can generate ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311852555

   So my current proposal is:
   1. Allow users to specify the sort order of their data in ListingTableOptions (e.g. https://github.com/apache/arrow-datafusion/pull/4170)
   2. Filed tickets to track potentially automatically determining this information from parquet metadata: https://github.com/apache/arrow-rs/issues/3090 https://github.com/apache/arrow-datafusion/issues/4177


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb closed issue #4169: Add ability to specify external sort information for ParquetExec
URL: https://github.com/apache/arrow-datafusion/issues/4169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1312095585

   PR https://github.com/apache/arrow-datafusion/pull/4170 contain the proposed implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311578293

   > Is there a way we can generalize sort order over different formats?
   I think sort order information should be quite generalizable over formats, CSV/JSON/Avro could be sorted as well.
   
   That is an excellent point -- since they all use the "ListingTable" API perhaps that is the most appropriate place to allow users to specify externally known sorting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
crepererum commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311347788

   One alternative would be to encode the "sorted by" property into the parquet file itself. Sure that's more effort, but I kinda think that it would be nicer for the ecosystem. This metadata would be optional and solely help optimization (although if specified, it must be correct). This is very similar to statistics. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311576873

   > One question for SortPreservingMergeExec, does it alway require the inputs to be sorted ? What if the inputs are not sorted,
   what kind of output stream the SortPreservingMergeExec can generate ?
   
   `SortPreservingMergeExec` is a fairly classic multi-column merge operator. This comment describes it well:
   
   https://github.com/apache/arrow-datafusion/blob/5883e43db6c16d3ac3616302606849abbfbc86eb/datafusion/core/src/physical_plan/sorts/sort_preserving_merge.rs#L54-L80
   
   If the inputs are not sorted the output will be incorrect (specifically some of the rows may be lost)
   
   The SortPreservingMergeExec produces a sorted output stream without resorting its input
   
   One usecase for this operator outside of IOx might be to implement `UNION` (which removes duplicate) of two subqueries that were already sorted. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311416838

   Is there a way we can generalize sort order over different formats?
   I think sort order information should be quite generalizable over formats, CSV/JSON/Avro could be sorted as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
crepererum commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311455081

   > Is there a way we can generalize sort order over different formats?
   
   Depends on the source of this information, hence my comment. Should we place this information into the parquet file (similar to stats) or in the catalog (this is what IOx does and what would generalize to other storage formats as well). Both can make sense under different circumstances.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #4169: Add ability to specify external sort information for ParquetExec

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #4169:
URL: https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311572149

   >  One alternative would be to encode the "sorted by" property into the parquet file itself. Sure that's more effort, but I kinda think that it would be nicer for the ecosystem. This metadata would be optional and solely help optimization (although if specified, it must be correct). This is very similar to statistics.
   
   Yes, I agree this would be nice if there was some standard way to do so. 
   
   I poked around in the format definition and it seems like there is a standard way to encode the sort order in each Row Group's metadata:
   
   There is a "SortingColumn" in the format
   https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698
   
   Which is then in the RowGroup metadata:
   https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832
   
   However, I did not find any code to read/write this in the parquet crate
   https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard
   
   I will file some follow on tickets to properly support this in parquet and in datafusion. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org