You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "coady (via GitHub)" <gi...@apache.org> on 2023/03/04 02:45:14 UTC

[GitHub] [arrow] coady opened a new issue, #34451: [C++][Python] A metadata standard for sorted datasets.

coady opened a new issue, #34451:
URL: https://github.com/apache/arrow/issues/34451

   ### Describe the enhancement requested
   
   Split off from #34153.
   
   In order to take advantage of sorted columns, it would be necessary for arrow to standardize on a way to represent sorting in datasets/table metadata. Akin to to pandas' `index_columns`.
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drin commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

Posted by "drin (via GitHub)" <gi...@apache.org>.

drin commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1457213440

   I won't commit yet to proposing a particular standard, but maybe I can help guide the discussion to consider a couple ways sorted information is represented and which aspects we would like to preserve in a standard.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drin commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

Posted by "drin (via GitHub)" <gi...@apache.org>.

drin commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1462828023

   wanted to mention that #32884 likely has some relevance 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1458488544

> this is super helpful. This is the relevant PR for datafusion: https://github.com/apache/arrow-datafusion/pull/1776. @alamb , if you have extra input it'd be nice to hear.

> Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data.

Yes, I think this is the standard situation (I have debugged many bugs in various past lives related to sortedness)

DataFusion has gotten quite a bit more sophisticated in its sortedness handling / removing Sorts if not required based on metadata such as https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/sort_enforcement.rs#L18 and https://github.com/apache/arrow-datafusion/blob/928662bb12d915aef83abba1312392d25770f68f/datafusion/core/src/physical_optimizer/global_sort_selection.rs

In terms of metadata, I recommend adding something to the Arrow standard as sorting is so important (and doesn't really vary from system to system)

Things that should be covered:
1. where do nulls sort (first or last)
2. ASC / DESC
3. Any collation considerations (ideally we would keep it as simple as possible)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drin commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

Posted by "drin (via GitHub)" <gi...@apache.org>.

drin commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1456970474

   In some work I'm doing I plan on putting metadata keys that are system-specific. It seems that part of the requirement here would be a way of "namespacing" metadata attributes for cooperative system design. I would also think that, at minimum, the following namespaces would always potentially co-exist:
   1. acero (or arrow, generally)
   2. substrait
   3. "the system" (top-level system managing arrow data)
   4. _some_ application (whatever is interacting with the data stored in an arrow object)
   
   With the following possible caveats:
   * (3) and (4) could be the same (maybe some large application such as duckdb that doesn't share the data with any other system).
   * (2) and (4) could be the same (maybe an application funnels all intent through a substrait producer).
   * If acero is never actively used, then conflicts with its namespace are moot (though prone to refactoring if acero is every adopted in the future)
   
   Not sure if there has been any other proposal of metadata management in arrow that should be leveraged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1457178512

   > Not sure if there has been any other proposal of metadata management in arrow that should be leveraged.
   
   [Sortedness metadata has been discussed on the ML before](https://lists.apache.org/thread/xo7337dz73mx5lmfwrfw5bx5lp5mnzcf).  The reception seemed generally favorable though no proposal was ever put forward.  [Similarly with min/max metadata](https://lists.apache.org/thread/r4v876hdprzyqgttdy0ntvoj5dpdnykg).
   
   In many systems I believe sortedness is often recorded outside of the files themselves as part of some catalog or table format (e.g. [Iceberg](https://iceberg.apache.org/spec/#sorting))
   
   As for namespacing, there is some precedent in the [spec](https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata):
   
   > The colon symbol : is to be used as a namespace separator. It can be used multiple times in a key.
   > 
   > The ARROW pattern is a reserved namespace for internal Arrow use in the custom_metadata fields. For example, ARROW:extension:name.
   
   I don't think there is a need at the moment for an Acero namespace as the only thing (so far) that Acero would be interested in would be sortedness and min/max statistics and/or index information.  All of this should be universally applicable and ideally agreed upon across all implementations and not just in Acero (though we could start there while doing initial work with the hope of making a proposal for the wider community).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1456922143

   We're pretty close in Acero to the point where we can use this for real advantage internally.  For example, there is a PR (https://github.com/apache/arrow/pull/34311) up for a more RAM-efficient aggregation if we know the data is segmented / sorted.  Currently, the node expects you to declare which columns are sorted ahead of time and, if they aren't, if will give you bad data.
   
   However, if we had a metadata standard in place then pyarrow/R (for in-memory tables) and datasets (for on-disk tables) could automatically detect this condition and apply the more efficient aggregation.  There are probably a few unknowns about how exactly that should happen (An optimization pass based on data statistics (e.g. orderedness)?  Detected on the fly while running a plan?) but getting a standard in for representing this information would be a good first step.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] drin commented on issue #34451: [C++][Python] A metadata standard for sorted datasets.

Posted by "drin (via GitHub)" <gi...@apache.org>.

drin commented on issue #34451:
URL: https://github.com/apache/arrow/issues/34451#issuecomment-1457211006

   > [Sortedness metadata has been discussed on the ML before](https://lists.apache.org/thread/xo7337dz73mx5lmfwrfw5bx5lp5mnzcf)
   
   this is super helpful. This is the relevant PR for datafusion: https://github.com/apache/arrow-datafusion/pull/1776. @alamb , if you have extra input it'd be nice to hear.
   
   
   > As for namespacing, there is some precedent in the [spec](https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata)
   
   I think this addresses the aspect I was most concerned with, though namespace reservation hadn't totally occurred to me (awkward if multiple projects have conflicting namespaces). Probably in the short term this can be assumed to not be problematic (so acero can claim an arbitrary namespace when it's reasonable to do so).
   
   
   > In many systems I believe sortedness is often recorded outside of the files themselves as part of some catalog or table format
   
   The interesting thing here seems to be that for substrait plans that reference a table (e.g. `NamedTable`), perhaps there will/should be an operator that actually just merges metadata (perhaps metadata, managed by different systems, will be maintained in separate catalogs/locations and want to be merged at some point in query execution).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org