You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/27 14:40:43 UTC

[GitHub] [arrow-datafusion] thomas-k-cameron opened a new issue, #2976: /// TODO fix case where `num_rows` and `total_byte_size` are not defined (stat should be None instead of Some(0))

thomas-k-cameron opened a new issue, #2976:
URL: https://github.com/apache/arrow-datafusion/issues/2976

   **Describe the bug**
   Some fields of `Statistics` could return `Some(0)` when the value is not available.
   
   TODO comment says that some fields are supposed to be `None` instead of `Some(0)` under some circumstances but the value is set to `Some(0)` even when it is supposed to have `None`.
   
   ```rust
   /// Get all files as well as the file level summary statistics (no statistic for partition columns).
   /// If the optional `limit` is provided, includes only sufficient files.
   /// Needed to read up to `limit` number of rows.
   /// TODO fix case where `num_rows` and `total_byte_size` are not defined (stat should be None instead of Some(0))
   pub async fn get_statistics_with_limit(
       all_files: impl Stream<Item = Result<(PartitionedFile, Statistics)>>,
       file_schema: SchemaRef,
       limit: Option<usize>,
   ) -> Result<(Vec<PartitionedFile>, Statistics)>
   ```
   
   **To Reproduce**
   I haven't ran into any breaking bugs yet, however, the functions is used in a method provided by one of the method implemented on `ListingTable` struct and `Some(0)` is returned when the `options.collect_stat` is set to `false` (aka when it is not supposed to be available).
   
   And it appears that it is generating 
   
   `Statistics` is passed to a method implemented on `Arc<dyn FileFormat>` so chance of someone running into a bug is not zero.
   
   Also, `pub async fn get_statistics_with_limit` is a exported function.
   
   **Expected behavior**
   returns `None` instead of `Some(0)`.
   
   **Alternatives Considered**
   1. Turn `Some(0)` to `None`.
   However, this solution would be rather labour intensive as the field is being used by everywhere (over 200 occurrence)
   Also anyone using it cannot determine is the size is 0 or the value is just not available
   2. Assign `None` in ListingTable::async fn list_files_for_scan<'a>
   `ListingTable::async fn list_files_for_scan<'a>` is the only function  calling `get_statistics_with_limit`. It can be fixed by assigning `None` to statistics.
   However, `get_statistics_with_limit` is being exported so I didn't think it really solves the problem.
   3. Return a number that was available
   The `Statistics` struct is basically an aggregate of all `Statistics` struct passed onto the function.
   Some values may have fields that is not `None`. 
   Considering that `Statistics` is not meant to always return a accurate value, I think this could work too.
   
   **Additional context**
   Is it suppose to run get_statistics_with_limit when `options.collect_stat` is set to `false`?
   
   Thank you very much for reading my issue and please let me know if I'm getting it wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org