You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "hengfeiyang (via GitHub)" <gi...@apache.org> on 2023/09/15 16:55:36 UTC

[GitHub] [arrow-datafusion] hengfeiyang opened a new issue, #7573: Should parallel collecting statistics like infer schema?

hengfeiyang opened a new issue, #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573

   ### Is your feature request related to a problem or challenge?
   
   When i searched data from s3 I found Datafusion fetches parquet file metadata one by one, it is a bit slow when I have many files.
   
   The code is here:
   
   https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/listing/table.rs#L960C1-L985
   
   I found this code uses `iter::then`, and next it will fetch data one by one.
   
   But I found something here:
   
   https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/file_format/parquet.rs#L170-L175
   
   When fetching schema it uses concurrent requests. 
   
   ```
   iter::map().boxed().buffered(SCHEMA_INFERENCE_CONCURRENCY)
   ```
   
   Is possible to do the same things here? user concurrent request for collecting statistics?
   
   ### Describe the solution you'd like
   
   Actually i tried change the code:
   
   https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/listing/table.rs#L960C1-L985
   
   ```Rust
           let files = file_list.then(|part_file| async {
                   let part_file = part_file?;
                   let statistics = if self.options.collect_stat {
                       match self.collected_statistics.get(&part_file.object_meta) {
                           Some(statistics) => statistics,
                           None => {
                               let statistics = self
                                   .options
                                   .format
                                   .infer_stats(
                                       ctx,
                                       &store,
                                       self.file_schema.clone(),
                                       &part_file.object_meta,
                                   )
                                   .await?;
                               self.collected_statistics
                                   .save(part_file.object_meta.clone(), statistics.clone());
                               statistics
                           }
                       }
                   } else {
                       Statistics::default()
                   };
                   Ok((part_file, statistics)) as Result<(PartitionedFile, Statistics)>
               });
   ```
   
   To this:
   
   ```Rust
             let files = file_list.map(|part_file| async {
                   let part_file = part_file?;
                   let statistics = if self.options.collect_stat {
                       match self.collected_statistics.get(&part_file.object_meta) {
                           Some(statistics) => statistics,
                           None => {
                               let statistics = self
                                   .options
                                   .format
                                   .infer_stats(
                                       ctx,
                                       &store,
                                       self.file_schema.clone(),
                                       &part_file.object_meta,
                                   )
                                   .await?;
                               self.collected_statistics
                                   .save(part_file.object_meta.clone(), statistics.clone());
                               statistics
                           }
                       }
                   } else {
                       Statistics::default()
                   };
                   Ok((part_file, statistics)) as Result<(PartitionedFile, Statistics)>
               })
               .boxed()
               .buffered(COLLECT_STATISTICS_CONCURRENCY);
   ```
   
   And set a const variable:
   
   ```Rust
   const COLLECT_STATISTICS_CONCURRENCY: usize = 32;
   ```
   
   The search speed is much improved in my local because it can concurrently fetch parquet files to collect statistics, earlier it requested files one by one. to
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #7573: Should parallel collecting statistics like infer schema?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573#issuecomment-1724504600

   Hi @hengfeiyang  -- this sound like a great idea to me
   
   Rather than hard coding the concurrency, perhaps you can add a config parameter, perhaps like `ExecutionOptions::meta_fetch_concurrency`, following the model of https://docs.rs/datafusion/latest/datafusion/config/struct.ExecutionOptions.html#structfield.planning_concurrency
   
   The rationale for a configuration setting is that the optimal value will likely depend on the network configuration of the system, so there is no good constant that would work in all cases. 
   
   Perhaps it can default to 10 or 32 ?
   
   cc @Ted-Jiang  who I think was working on some other settings to cache statistics for multiple queries in the same session. See https://github.com/apache/arrow-datafusion/pull/7570
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Should parallel collecting statistics like infer schema? [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #7573: Should parallel collecting statistics like infer schema?
URL: https://github.com/apache/arrow-datafusion/issues/7573


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Should parallel collecting statistics like infer schema? [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573#issuecomment-1828505792

   Closed in https://github.com/apache/arrow-datafusion/pull/7595


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #7573: Should parallel collecting statistics like infer schema?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573#issuecomment-1724686126

   Thank you @hengfeiyang  -- most appreciated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #7573: Should parallel collecting statistics like infer schema?

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573#issuecomment-1724671474

   @Ted-Jiang 's PR is merged, so this change would need a follow on PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] hengfeiyang commented on issue #7573: Should parallel collecting statistics like infer schema?

Posted by "hengfeiyang (via GitHub)" <gi...@apache.org>.
hengfeiyang commented on issue #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573#issuecomment-1724669411

   @Ted-Jiang So, you will improve this part in your PR, I don't need to create a PR for this, Right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] hengfeiyang commented on issue #7573: Should parallel collecting statistics like infer schema?

Posted by "hengfeiyang (via GitHub)" <gi...@apache.org>.
hengfeiyang commented on issue #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573#issuecomment-1724646981

   @alamb You are right, we should add an option for it. the `const 32` I just copy from here: https://github.com/apache/arrow-datafusion/blob/31.0.0/datafusion/core/src/datasource/file_format/parquet.rs#L68
   
   Maybe we should both change it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] hengfeiyang commented on issue #7573: Should parallel collecting statistics like infer schema?

Posted by "hengfeiyang (via GitHub)" <gi...@apache.org>.
hengfeiyang commented on issue #7573:
URL: https://github.com/apache/arrow-datafusion/issues/7573#issuecomment-1724677808

   @alamb Okay, I will do it. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org