You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/11 13:38:20 UTC

[GitHub] [arrow-rs] tustvold opened a new issue #284: RecordBatch Sort Order

tustvold opened a new issue #284:
URL: https://github.com/apache/arrow-rs/issues/284


   It is often the case that a RecordBatch is sorted lexicographically on one or more columns, and knowing this allows eliminating redundant sorts, more efficient lookups, etc...
   
   To give a concrete use-case, within [IOx](https://github.com/influxdata/influxdb_iox/) data is compacted into sorted, read-only blocks that periodically must be merged together into new sorted, read-only blocks. The result is that we are performing a lot of operations on blocks of sorted data, that would benefit from being able to express that they are sorted. 
   
   It should be noted that the parquet format already has a similar concept stored in its metadata, see [here](https://github.com/apache/parquet-format/blob/2e23a1168f50e83cacbbf970259a947e430ebe3a/src/main/thrift/parquet.thrift#L827) although I've yet to find an implementation that actually makes use of it.
   
   In the short-term I can workaround this with IOx-specific logic, but thought it worthwhile to maybe start a conversation about introducing some sort of standardised way to represent this in an arrow schema, as I imagine crates like Datafusion would also stand to benefit from this.
   
   There are some areas that I can see being pretty gnarly, however. Datafusion, and I imagine other systems, use a single schema to refer to a collection of RecordBatches. Some logic would therefore be needed to compute the common sort order "prefix" between the record batches. A similar, but more complex issue would arise when merging schemas. 
   
   I'm not sure if this is even the right place to be raising this issue, but thought it couldn't hurt to do so :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #284: RecordBatch Sort Order

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #284:
URL: https://github.com/apache/arrow-rs/issues/284#issuecomment-838894369


   I will start a thread on dev@arrow.apache.org


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #284: Encoding RecordBatch Sort Order

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #284:
URL: https://github.com/apache/arrow-rs/issues/284#issuecomment-1062854331


   For reference https://github.com/apache/arrow-datafusion/pull/1776 added a notion of sortedness to DataFusion's ExecutionPlan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb edited a comment on issue #284: RecordBatch Sort Order

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on issue #284:
URL: https://github.com/apache/arrow-rs/issues/284#issuecomment-838894369


   I will start a thread on dev@arrow.apache.org
   
   https://lists.apache.org/thread.html/r851827e166cf1bdd0197b22e2d993ea1f7fb79c911f5a34689b92ae4%40%3Cdev.arrow.apache.org%3E
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao commented on issue #284: RecordBatch Sort Order

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #284:
URL: https://github.com/apache/arrow-rs/issues/284#issuecomment-838500891


   This does not seem specific to the Rust implementation, but generic over the arrow format itself and how semantics are stored in the metadata. Could you raise this in the mailing list, where the arrow specification is discussed?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org