You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mirza Aliev (Jira)" <ji...@apache.org> on 2020/09/22 15:30:00 UTC

[jira] [Comment Edited] (IGNITE-12451) Introduce deadlock detection for cache entry reentrant locks

    [ https://issues.apache.org/jira/browse/IGNITE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200154#comment-17200154 ] 

Mirza Aliev edited comment on IGNITE-12451 at 9/22/20, 3:29 PM:
----------------------------------------------------------------

I see that your solution is based on a _tryLock_ idea, we also came to that because real-time locks graph analyzing brought a significant performance drop. 

I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot of false-positive results, this method does not search for cycles in locks acquisition order. Also, it does not guarantee that system won't hang because there is a chance that the retry-lock-acquisition loop also will lead to deadlocks. 

So, I believe that the right way to handle deadlock and let the system continue to work is the following: we need to throw an exception when _tryLock_ timeouts and a user will see _CachePartialUpdateCheckedException_. This solution is almost developed in my PR and will be ready in a few days.

 

[~alex_pl] what do you think? 

 

 

 


was (Author: maliev):
I see that your solution is based on a _tryLock_ idea, we also came to that because real-time locks graph analyzing brought a significant performance drop. 

I see that you introduced _hasLockCollisions_ method, but IMHO it gives a lot of false-positive results, this method does not search for cycles in locks acquisition order. Also, it does not guarantee that system won't hang because there is a chance that the retry-lock-acquisition loop also will lead to deadlocks. 

So, I believe that the right way to handle deadlock and let the system continue to work is the following: we need to throw an exception when _tryLock_ timeouts and a user will see _CachePartialUpdateCheckedException_. __ This solution is almost developed in my PR and will be ready in a few days.

 

[~alex_pl] what do you think? 

 

 

 

> Introduce deadlock detection for cache entry reentrant locks
> ------------------------------------------------------------
>
>                 Key: IGNITE-12451
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12451
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.7.6
>            Reporter: Ivan Rakov
>            Assignee: Mirza Aliev
>            Priority: Major
>             Fix For: 2.10
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Aside from IGNITE-12365, we still have possible threat of cache-entry-level deadlock in case of careless usage of JCache mass operations (putAll, removeAll):
> 1. If two different user threads will perform putAll on the same two keys in reverse order (primary node for which is the same), there's a chance that sys-stripe threads will be deadlocked.
> 2. Even without direct contract violation from user side, HashMap can be passed as argument for putAll. Even if user threads have called mass operations with two keys in the same order, HashMap iteration order is not strictly defined, which may cause the same deadlock. 
> Local deadlock detection should mitigate this issue. We can create a wrapper for ReentrantLock with logic that performs cycle detection in wait-for graph in case we are waiting for lock acquisition for too long. Exception will be thrown from one of the threads in such case, failing user operation, but letting the system make progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)