You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/13 13:19:38 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new issue #717: Adapt column statistics API

Dandandan opened a new issue #717:
URL: https://github.com/apache/arrow-datafusion/issues/717


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   While looking at adding support for more statistics on the Delta Lake `TableProvider` implementation I bumped into some limitation in our statistics API.
   
   Currently columnstatistics is a `Option<Vec<ColumnStatistics>>`.
   
   https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/datasource.rs#L37
   
   So, it should return the statistics by (correct) index regardless of the order in the files. 
   
   **Describe the solution you'd like**
   Either:
   * Return a `HashMap<String, ColumnStatistics>` rather than a `Option<Vec<ColumnStatistics>>`
   * Pass a `Schema` parameter to `TableProvider::statisitics` so the position can be found out.
   
   FWIW, Delta Lake / delta-rs takes the first approach and seems straightforward to implement and use.
   
   **Describe alternatives you've considered**
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on issue #717: Adapt column statistics API

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #717:
URL: https://github.com/apache/arrow-datafusion/issues/717#issuecomment-917991869


   @Dandandan in #965 I used the schema from the `ExecutionPlan` trait and it worked fine. But I do agree that it might be better to come up with at data structure that helps asserting that the `column_statistics` vector is well aligned on the schema `fields` vector (same size, same types...). I'm adding this as an item in #997, so if you want to close this for now that's fine by me 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai edited a comment on issue #717: Adapt column statistics API

Posted by GitBox <gi...@apache.org>.
rdettai edited a comment on issue #717:
URL: https://github.com/apache/arrow-datafusion/issues/717#issuecomment-917991869


   @Dandandan in #965 I used the schema from the `ExecutionPlan` trait and it worked fine. But I do agree that it might be better to come up with at data structure that helps asserting that the `column_statistics` vector is well aligned on the schema `fields` vector (same size, same types...). I'm adding this as an item in #997, so if you want to close this for now that's fine by me 😃 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #717: Adapt column statistics API

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #717:
URL: https://github.com/apache/arrow-datafusion/issues/717#issuecomment-879099018


   Closing, seeing this could be done with the schema on table provider instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org