You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoram Thanga (JIRA)" <ji...@apache.org> on 2018/03/06 18:35:00 UTC
[jira] [Resolved] (IMPALA-2642) Potential deadlock in statestore error path

     [ https://issues.apache.org/jira/browse/IMPALA-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoram Thanga resolved IMPALA-2642.
----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.12.0


{noformat}
commit f97634fcf9a2e48d307f19595e70bdcee9c1980e
Author: Zoram Thanga <zo...@cloudera.com>
Date:   Tue Jan 16 12:01:09 2018 -0800

    IMPALA-2642: Fix a potential deadlock in statestore
    
    The statestored can deadlock if the number of subscribers has
    reached STATESTORE_MAX_SUBSCRIBERS, because the DoSubscriberUpdate()
    method calls OfferUpdate(), while holding subscribers_lock_, which
    also tries to take the same lock in this situation.
    
    Fix the issue by moving out the call to acquire subscribers_lock_ from
    OfferUpdate(), and depend on the callers to take it. We also make
    the maximum number of statestore subscribers a start-up time tuneable,
    to allow us to test the limit more easily.
    
    Testing: The problem is easily reproduced by lowering the value of
    STATESTORE_MAX_SUBSCRIBERS to 3, and then launching a mini cluster
    with 3 impalads. Without the fix, the statestored becomes completely
    deadlocked.
    
    A new EE test has been added to exercise this scenario. The test
    verifies that statestored correctly rejects new subscription
    requests when the limit it reached.
    
    Change-Id: I5d49dede221ce1f50ec299643b5532c61f93f0c6
    Reviewed-on: http://gerrit.cloudera.org:8080/9038
    Reviewed-by: Sailesh Mukil <sa...@cloudera.com>
    Tested-by: Impala Public Jenkins

{noformat}


> Potential deadlock in statestore error path
> -------------------------------------------
>
>                 Key: IMPALA-2642
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2642
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 2.3.0
>            Reporter: Henry Robinson
>            Assignee: Zoram Thanga
>            Priority: Minor
>              Labels: hang, statestore
>             Fix For: Impala 2.12.0
>
>
> I just noticed this while reading the statestore code: {{OfferUpdate()}} takes {{subscribers_lock_}} if the update queue is full, but in one place that lock has already been taken by the caller:
> {code}
>  {
>     lock_guard<mutex> l(subscribers_lock_);
>     // ... snip ...
>     if (state == FailureDetector::FAILED) {
>       if (is_heartbeat) {
>         // TODO: Consider if a metric to track the number of failures would be useful.
>         LOG(INFO) << "Subscriber '" << subscriber->id() << "' has failed, disconnected "
>                   << "or re-registered (last known registration ID: " << update.second
>                   << ")";
>         UnregisterSubscriber(subscriber.get());
>       }
>     } else {
>       // Schedule the next message.
>       VLOG(3) << "Next " << (is_heartbeat ? "heartbeat" : "update") << " deadline for: "
>               << subscriber->id() << " is in " << deadline_ms << "ms";
>       // vvvvvvvvvvvvvvvvvv oops vvvvvvvvvvvvvvvvvv
>       OfferUpdate(make_pair(deadline_ms, subscriber->id()), is_heartbeat ?
>           &subscriber_heartbeat_threadpool_ : &subscriber_topic_update_threadpool_);
>     }
>   }
> {code}
> It's not as scary as it sounds, because if the update queue has > 10000 entries there's something wrong anyway, but we should fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)