You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/29 15:00:24 UTC

[GitHub] [arrow] rjzamora commented on pull request #7546: ARROW-8733: [C++][Dataset][Python] Expose RowGroupInfo statistics values

rjzamora commented on pull request #7546:
URL: https://github.com/apache/arrow/pull/7546#issuecomment-651177576


   Thanks for the great work here @bkietz !
   
   This is wonderful - Dask uses the min/max statistics to calculate `divisions`, so this functionality is definitely necessary.
   
   
*A note on other (less-critical, but useful) statistics*:
   Dask also uses the `"total_byte_size"` statistics (for the full row-group, not each column) to aggregate partitions before reading in any data.  There is also a plan to use the `"num-rows”` statistics when the user executes `len(ddf)` (to avoid loading any data).   **How difficult would it be to add/expose these additional row-group statistics?**  Again, this is much less of a “blocker” for initial integration with Dask, but are likely things we will want to add in eventually.  cc @jorisvandenbossche 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org