You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/30 10:50:12 UTC

[GitHub] [arrow] alamb opened a new pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

alamb opened a new pull request #8079:
URL: https://github.com/apache/arrow/pull/8079


   This is a continuation of the work started in https://github.com/apache/arrow/pull/8031 to clean up the `ExecutionContext` interface
   
   Specifically, the `ExecutionContext` is used during *planning time* (not *execution time*) and thus using a `Mutex/Arc` seemed to make the code more complicated as well as suggest that they could be shared by multiple threads (which they are not). 
   
   This PR also removes several deep copies of structures, which helps readability (and likely performance in some tiny amount)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8079:
URL: https://github.com/apache/arrow/pull/8079


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r479754097



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       The table provider is used during single-thread planning (not multi-threaded execution) and thus there is no reason for the table provider to be `Send` and `Sync` anymore

##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -94,7 +96,7 @@ use crate::sql::{
 /// ```
 pub struct ExecutionContext {
     /// Internal state for the context
-    pub state: Arc<Mutex<ExecutionContextState>>,

Review comment:
       The main aim of this PR is to remove this `Arc<Mutex<>>` -- the rest of the PR is a result of that change




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r480041532



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       I tried to capture the desire for ExecutionContext to be shared across threads in this ticket:  https://issues.apache.org/jira/browse/ARROW-9888 (and it has a test that we could add to ensure that execution context remains shareable across threads, if we want to go that route)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r480045512



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       I'll get a PR up to fix the thread safety issue shortly




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#issuecomment-684735292


   > @alamb I think this is next in line to be merged, but needs a rebase.
   
   Done!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r479872161



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       I was not able to find time to look at this. @jorgecarleitao Could you summarize the use case for sharing an ExecutionContext betweeen threads?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r479790490



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       I agree that `TableProvider` no longer needs `Send` and `Sync` now that it is just used during planning and no longer performs execution.
   
   I don't think changing this will affect whether ExecutionContext can be sent between threads but I don't have time today to actually confirm that.
   
   If we do want to be sure that `ExecutionContext` can be shared between threads, we should add a unit test for that.
   
   Hopefully, I'll be able to find some time this evening to look at this some more.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r479779945



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       I think that we aimed for `TableProvider` to be `Send + Sync` on ARROW-9425 , so that others could use `ExecutionContext` on multi-threaded environments such as Python.
   
   See e.g. the usage [here](https://github.com/jorgecarleitao/datafusion-python/blob/master/src/lib.rs#L57). I think that that code only compiles if `ExecutionContext` is sharable between threads.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r480060296



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       https://github.com/apache/arrow/pull/8082/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#discussion_r479896964



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -255,26 +253,17 @@ impl ExecutionContext {
 
     /// Register a table using a custom TableProvider so that it can be referenced from SQL
     /// statements executed against this context.
-    pub fn register_table(
-        &mut self,
-        name: &str,
-        provider: Box<dyn TableProvider + Send + Sync>,
-    ) {
-        let mut ctx_state = self.state.lock().expect("failed to lock mutex");
-        ctx_state.datasources.insert(name.to_string(), provider);
+    pub fn register_table(&mut self, name: &str, provider: Box<dyn TableProvider>) {

Review comment:
       TL;DR: let's ship this, as master is already not thread safe.
   
   The two use-cases I though at the time:
   
   1. when a project is planning a complex number of plans that depend on a common set of sources and UDFs, it would be nice to be able to multi-thread the planning. This is particularly important when planning requires reading remote metadata to formulate themselves (e.g. when the source is in `s3` with many partitions). Metadata reading is often slow and network bounded, which makes threads suitable for these workloads. If multi-threading is not possible, either each plan needs to read the metadata independently (one context per plan) or planning must be sequential (with lots of network waiting).
   
   2. when creating bindings to programming languages that support multi-threading, it would be nice for the `ExecutionContext` to be thread safe, so that we can more easily integrate with those languages.
   
   The reason for 2 is described in [`pyo3`'s documentation](https://pyo3.rs/master/class.html#defining-a-new-class):
   
   > Because Python objects are freely shared between threads by the Python interpreter, all structs annotated with #[pyclass] must implement Send
   
   Thread safety is not a strict requirement, though `pyo3` offers the option to `panic!` if the instance is used across threads. Not a great user experience, but definitely better than undefined behavior :)
   
   The code below works in DataFusion 1.0.0, and no longer works in master:
   
   ```
   #[pyclass]
   struct ExecutionContext {
       ctx: datafusion::execution::context::ExecutionContext,
   }
   ```
   
   > `(dyn datafusion::execution::physical_plan::PhysicalPlanner + 'static)` cannot be shared between threads safely
   
   Since this is already an issue on master, I think that we just ship this, that is a great simplification.
   
   However, I think that we would enjoy some clarity of our stance wrt to thread safety of the `ExecutionContext`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#issuecomment-683406785


   https://issues.apache.org/jira/browse/ARROW-9815


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8079: ARROW-9815: [Rust][DataFusion] Remove the use of Arc/Mutex to protect plan time structures

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8079:
URL: https://github.com/apache/arrow/pull/8079#issuecomment-684154323


   @alamb I think this is next in line to be merged, but needs a rebase.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org