You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/07 16:42:26 UTC

[GitHub] [arrow] alamb commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

alamb commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1458488544

   > this is super helpful. This is the relevant PR for datafusion: https://github.com/apache/arrow-datafusion/pull/1776. @alamb , if you have extra input it'd be nice to hear.
   
   >  Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data.
   
   Yes, I think this is the standard situation (I have debugged many bugs in various past lives related to sortedness)
   
   DataFusion has gotten quite a bit more sophisticated in its sortedness handling / removing Sorts if not required based on metadata such as https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/sort_enforcement.rs#L18 and https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/global_sort_selection.rs
   
   In terms of metadata, I recommend adding something to the Arrow standard as sorting is so important (and doesn't really vary from system to system)
   
   Things that should be covered:
   1. where do nulls sort (first or last)
   2. ASC / DESC
   3. Any collation considerations  (ideally we would keep it as simple as possible)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org