You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/04 19:14:08 UTC

[GitHub] [arrow] Dandandan commented on pull request #9086: [Rust] [DataFusion] [Experiment] Blocking threads filter

Dandandan commented on pull request #9086:
URL: https://github.com/apache/arrow/pull/9086#issuecomment-754162220


   @jorgecarleitao 
   
   This is really cool, thanks for creating this experiment!
   I am not very deep yet into the Rust way of doing parallelism, the documentation of tokio makes sense to me.
   
   Some ideas:
   
   * In general, I think it is best if the parallelism is on a high level as possible to reduce the amount of overhead related to scheduling / context switching, etc.
   * But in order to utilize parallelism best it should be fine-grained enough.
   * I think there is some balance between total control of large amount of control control over parallelism. I think Spark concurrency via partitions is an example where you can have a larger amount of control over it. It is not always fine-grained enough, e.g. if you have one 1  / a couple of files as input.
   * I think filtering batches is relatively fine-grained, so I am wondering if this a good level for parallelism.
   
   * Tokios default config `max_blocking_threads` is 512, this is I think very large for CPU intensive work (and would have a negative effect on performance) https://docs.rs/tokio/1.0.1/tokio/runtime/struct.Builder.html#method.max_blocking_threads. Maybe if using different "scopes" it makes sense to use a different runtime for CPU-intensive work where you use a different `max_blocking_threads` config?
   * Tokio's documentation seems to hint that Rayon would be a better choice for CPU intensive work?
   * In the `ParquetExec` `thread::spawn` is being used. `task::spawn_blocking` seems a better choice there as it handles errors in a better way and can limit the nr. of threads compared to  thread::spawn` I guess?
   
   * I think just as the statistics @andygrove started to add for `Exec`s it would be good to have something here as well to debug issues and make sure we are not doing things in an inefficient way


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org