You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/20 19:55:19 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by locking access to the scalar registry.

jorgecarleitao opened a new pull request #8018:
URL: https://github.com/apache/arrow/pull/8018


   @andygrove and @alamb , I have no formal training in thread and mutex management, so I am not certain about this proposal or the following explanation:
   
   My understanding is that because the result of
   
   ```
   ctx_state
                   .lock()
                   .expect("failed to lock mutex")
                   .scalar_functions
                   .lock()
                   .expect("failed to lock mutex")
                   .get(name)
   ```
   
   is of temporary lifetime, using this in `match` blocks any access to `scalar_functions` until we leave the match, which deadlocks when we recursively call the function. Here I just cloned `.scalar_functions` so that we allow the lock to be released.
   
   I may also be dead wrong in every word that I wrote above.
   
   This does work, but if you could validate my reasoning above, I would appreciate very much!
   
   Note that we are also doing the same for `.datasources` in this file, which I suspect will also deadlock if when we have a plan with two sources. I did not touch that as I do not know the idiom/pattern to address this (locking within recursions).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678428145


   You agree, I agree, we all agree!!! 🎉🎉🎉🎉 I will re-visit this and re-submit a PR with this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-677894563


   I left a second commit, which is *in alternative* to the first one. The main difference between them is that the latter does not lock mutexes inside the recursion, which often helps avoiding deadlocks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678281123


   I had another thought on this. We do need the `ExecutionContextState` to be mutable, but only up until execution begins. Perhaps we should rename `ExecutionContextState` to `MutableExecutionContextState` and create a new `ExecutionContextState` that does not have mutexes and just contains a read-only copy of state that is needed during execution?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678609613


   Hey both, I think I implemented what we discussed, but the devil is always in the details.
   
   The main changes are:
   
   * `ExecutionContextState::datasources` and `ExecutionContextState::scalar_functions` are now boxed, not `Arc`. This is so that when we lock `ExecutionContextState`, both get locked, which means that we only need to lock `ExecutionContextState` (but threads can't change each independently).
   * `ExecutionContextState` is no longer clonable.
   * optimizers now expect references, not `Arc<Mutex<_>>`.
   * `SqlToRel` is now lifetimed, as it is now bound to `ExecutionContextState`.
   
   I feel that we should probably have a test to demonstrate the that `ExecutionContext` is thread safe (e.g. run two queries from two different threads).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678406791


   That definitely sounds cleaner to me
   
   On Fri, Aug 21, 2020 at 12:07 PM Andy Grove <no...@github.com>
   wrote:
   
   > @alamb <https://github.com/alamb> I think you are right. Rather than
   > having Arc<Mutex<_>> within ExecutionContextState, we should have
   > Arc<Mutex<ExecutionContextState>> where we need mutability and then
   > clone() the state to get a read-only snapshot when we need it for
   > execution.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/pull/8018#issuecomment-678369613>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADXZMNL3AKJJYNL7NO7ZLLSB2LTTANCNFSM4QGRFXAQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-677958379


   > I left a second commit, which is _in alternative_ to the first one. The main difference between them is that the latter does not lock mutexes inside the recursion, which often helps avoiding deadlocks.
   
   I found reviewing with `w=1` easier for the second commit -- link https://github.com/apache/arrow/pull/8018/commits/2ff23f67e3d9de9419faea89d7702af449d82791?w=1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678369613


   @alamb I think you are right. Rather than having `Arc<Mutex<_>>` within `ExecutionContextState`, we should have `Arc<Mutex<ExecutionContextState>>` where we need mutability and then `clone()` the state to get a read-only snapshot when we need it for execution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678333145


   > I had another thought on this. We do need the `ExecutionContextState` to be mutable, but only up until execution begins. Perhaps we should rename `ExecutionContextState` to `MutableExecutionContextState` and create a new `ExecutionContextState` that does not have mutexes and just contains a read-only copy of state that is needed during execution?
   
   @andygrove  that sounds like a better idea to me. What I don't get is why a separate type would be needed to differentiate between mutable and non mutable versions of state -- the Rust type system has the notion of mutable/non-mutable built in deeply.
   
   So maybe, for example, anything that needs a mutable `ExecutionContextState` could be passed a `&mut ExecutionContextState` (the conversion to physical plan, for example), and anything during execution that needs a read only `ExecutionContextState` could be passed an `Arc<ExecutionContextState`.
   
   I am probably missing something obvious


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#discussion_r474329092



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -201,6 +201,18 @@ fn csv_query_avg_sqrt() -> Result<()> {
     Ok(())
 }
 
+#[test]
+fn csv_query_sqrt_sqrt() -> Result<()> {

Review comment:
       ```suggestion
   // ARROW-9851: This query deadlocked when processing a
   // udf that called another udf
   fn csv_query_sqrt_sqrt() -> Result<()> {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by locking access to the scalar registry.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-677876253


   https://issues.apache.org/jira/browse/ARROW-9815


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678032001


   Thank you both for the help.
   
   IMO having this in pure functions make it more explicit about the functionality's dependency on "self" (via the arguments it needs). For example, it is now clear that `create_physical_expr` and `create_aggregate_expr` only need the scalar functions, not everything from "self". If we `pub` those functions, others can also re-use them without having to use the struct ExecutionContext from DataFusion.
   
   I agree with you @alamb that the create_* may end up clustered with arguments. I think that there may be a change that would place all those in a struct a-la contextState that does not need to lock (e.g. just `HashMap`), so that they can be easily reused on a recursion without locking on every recursion, both of the plan and every expression on it.
   
   @andygrove , feel free to pick either commit to push, or if you prefer the third option you mentioned, that I will work it out and push.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-678609613


   Hey both, I think I implemented what we discussed, but the devil is always in the details.
   
   The main changes are:
   
   * `ExecutionContextState::datasources` and `ExecutionContextState::scalar_functions` are now boxed, not `Arc`. This is so that when we lock `ExecutionContextState`, both get locked, which means that we only need to lock `ExecutionContextState` (but threads can't change each independently).
   * `ExecutionContextState` is no longer clonable.
   * optimizers now expect references, not `Arc<Mutex<_>>`.
   * `SqlToRel` is now lifetimed, as it is now bound to `ExecutionContextState`.
   
   I feel that we should probably have a test to demonstrate that the `ExecutionContext` is thread safe (e.g. run two queries from two different threads).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8018:
URL: https://github.com/apache/arrow/pull/8018


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8018: ARROW-9815 [Rust] [DataFusion] Fixed deadlock caused by accessing the scalar functions' registry.

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8018:
URL: https://github.com/apache/arrow/pull/8018#issuecomment-677999442


   Thanks @jorgecarleitao I think both approaches look good too. Another potential solution is just to put the mutex lock in a smaller scope by putting it in curly braces e.g.
   
   ```rust
   let scalar_functions = {
       ctx_state
                   .expect("failed to lock mutex")
                       .lock()
                   .scalar_functions
                       .expect("failed to lock mutex")
                   .lock()
                       .scalar_functions
                   .expect("failed to lock mutex")
                       .lock()
                   .get(name)
                       .expect("Unable to lock mutex")
               {
                       .clone()
   };
   ```
   
   This way, the mutex gets released within this block instead of being kept open.
   
   @alamb Yes, the locking is yet more tech debt to be addressed. These things need to be mutable due to the dynamic/interactive nature of the context, but I'm sure there is a cleaner solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org