You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/13 04:43:07 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

jorgecarleitao opened a new pull request #8453:
URL: https://github.com/apache/arrow/pull/8453


   Currently, `mergeExec` uses `tokio::spawn` to parallelize the work, by calling `tokio::spawn` once per logical thread. However, `tokio::spawn` returns a task / future, which `tokio` runtime will then schedule on its thread pool.
   
   Therefore, there is no need to limit the number of tasks to the number of logical threads, as tokio's runtime itself is responsible for that work. In particular, since we are using [`rt-threaded`](https://docs.rs/tokio/0.2.22/tokio/runtime/index.html#threaded-scheduler), tokio already declares a thread pool from the number of logical threads available.
   
   This PR removes the coupling, in `mergeExec`, between the number of logical threads (`max_concurrency`) and the number of created tasks. I observe no change in performance:
   
   <details>
    <summary>Benchmark results</summary>
   
   ```
   Switched to branch 'simplify_merge'
   Your branch is up to date with 'origin/simplify_merge'.
      Compiling datafusion v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/datafusion)
       Finished bench [optimized] target(s) in 38.02s
        Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/aggregate_query_sql-5241a705a1ff29ae
   Gnuplot not found, using plotters backend
   aggregate_query_no_group_by 15 12                                                                            
                           time:   [715.17 us 722.60 us 730.19 us]
                           change: [-8.3167% -5.2253% -2.2675%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   aggregate_query_group_by 15 12                                                                            
                           time:   [5.6538 ms 5.6695 ms 5.6892 ms]
                           change: [+0.1012% +0.5308% +0.9913%] (p = 0.02 < 0.05)
                           Change within noise threshold.
   Found 10 outliers among 100 measurements (10.00%)
     4 (4.00%) high mild
     6 (6.00%) high severe
   
   aggregate_query_group_by_with_filter 15 12                                                                             
                           time:   [2.6598 ms 2.6665 ms 2.6751 ms]
                           change: [-0.5532% -0.1446% +0.2679%] (p = 0.51 > 0.05)
                           No change in performance detected.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) high mild
     4 (4.00%) high severe
   ```
   
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove edited a comment on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
andygrove edited a comment on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707783648


   This looks great, but how do we control the concurrency now? The `concurrency` attribute in `ExecutionConfig` is now effectively unused. I think we should either create the tokio thread pool explicitly using the `ExecutionConfig.concurrency` value or remove this config entirely since it is no longer used.
   
   Without being able to control the concurrency level it will be difficult (but not impossible) to run scalability benchmarks to see how we scale with an increasing concurrency level.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707906851


   @andygrove , just to understand, we should avoid merging to master until the release, right? (I have this feeling, but I am not super sure xD)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707792251


   I wonder if the scope of this is too much. I really like this change and would like to see if merged. Perhaps we just remove the concurrency config and let users of DataFusion create their own tokio runtimes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707793695


   It would make sense to create the tokio runtime specifically in the benchmark crate and in the CLI but for other cases, we are being called as a library and should assume that the user has created the tokio runtime, I think.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707790241


   Ah, I see the use-case now.
   
   Ok, I think that [tokio's recommendation is](https://docs.rs/tokio/0.2.22/tokio/runtime/struct.Runtime.html):
   
   > Instances of Runtime can be created using new or Builder. However, most users will use the #[tokio::main] annotation on their entry point instead.
   
   Thus, IMO we should remove the `#[tokio:main]` and initialize the runtime using its associated [builder](https://docs.rs/tokio/0.2.22/tokio/runtime/struct.Builder.html), to configure the size of the thread pool, whenever we want to expose the concurrency configuration.
   
   Note that in the benchmarks, we can even specify different concurrency parameter depending on the benchmark (in case we want to perform some scaling).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707914506


   @jorgecarleitao I see no reason to wait before merging this (unless you want to remove the `concurrency` attribute from `ExecutionConfig` as part of this PR since it no longer has any effect).
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707917466


   I think there are _some_ things we would want to hold off on merging, like
   changing the Rust nightly version for example, or anything else we feel
   would be risky at the last minute.
   
   On Tue, Oct 13, 2020 at 12:07 PM Jorge Leitao <no...@github.com>
   wrote:
   
   > Ah, ok, I mistakenly understood that we were in a "feature freeze" until
   > the release (which is why I have not been merging stuff these days).
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/pull/8453#issuecomment-707916680>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAHEBRDSPNVFQAQLRA2BPVTSKSJO5ANCNFSM4SOBNGPA>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-708015832


   > Note that in the benchmarks, we can even specify different concurrency parameter depending on the benchmark (in case we want to perform some scaling).
   
   
   For what it is worth, one thing I plan to do as part of my internal project, is to limit "concurrent CPU bound tasks" by using a `tokio::sync::mpsc::Channel` as a way to limit the maximum number of concurrent tasks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707483047


   Note that there is an important note in [Tokio's documentation](https://docs.rs/tokio/0.2.22/tokio/index.html#cpu-bound-tasks-and-blocking-code):
   
   > If your code is CPU-bound and you wish to limit the number of threads used to run it, you should run it on another thread pool such as rayon. You can use an oneshot channel to send the result back to Tokio when the rayon task finishes.
   
   I.e. Tokio's `spawn` is suitable for non-blocking, non-CPU-bound tasks. This may explain why we do not see a significant performance increase when using `tokio::spawn`: most of our tasks are CPU-bounded.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707783648


   This looks great, but how do we control the concurrency now? The `concurrency` attribute in `ExecutionConfig` is now effectively unused.  think we should either create the tokio thread pool explicitly using the `ExecutionConfig.concurrency` value or remove this config entirely since it is no longer used.
   
   Without being able to control the concurrency level it will be difficult (but not impossible) to run scalability benchmarks to see how we scale with an increasing concurrency level.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707483947


   https://issues.apache.org/jira/browse/ARROW-10292


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #8453:
URL: https://github.com/apache/arrow/pull/8453


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8453: ARROW-10292: [Rust] [DataFusion] Simplify merge

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8453:
URL: https://github.com/apache/arrow/pull/8453#issuecomment-707916680


   Ah, ok, I mistakenly understood that we were in a "feature freeze" until the release (which is why I have not been merging stuff these days).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org