You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoram Thanga (JIRA)" <ji...@apache.org> on 2018/03/06 18:35:00 UTC
[jira] [Resolved] (IMPALA-2642) Potential deadlock in statestore
error path
[ https://issues.apache.org/jira/browse/IMPALA-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zoram Thanga resolved IMPALA-2642.
----------------------------------
Resolution: Fixed
Fix Version/s: Impala 2.12.0
{noformat}
commit f97634fcf9a2e48d307f19595e70bdcee9c1980e
Author: Zoram Thanga <zo...@cloudera.com>
Date: Tue Jan 16 12:01:09 2018 -0800
IMPALA-2642: Fix a potential deadlock in statestore
The statestored can deadlock if the number of subscribers has
reached STATESTORE_MAX_SUBSCRIBERS, because the DoSubscriberUpdate()
method calls OfferUpdate(), while holding subscribers_lock_, which
also tries to take the same lock in this situation.
Fix the issue by moving out the call to acquire subscribers_lock_ from
OfferUpdate(), and depend on the callers to take it. We also make
the maximum number of statestore subscribers a start-up time tuneable,
to allow us to test the limit more easily.
Testing: The problem is easily reproduced by lowering the value of
STATESTORE_MAX_SUBSCRIBERS to 3, and then launching a mini cluster
with 3 impalads. Without the fix, the statestored becomes completely
deadlocked.
A new EE test has been added to exercise this scenario. The test
verifies that statestored correctly rejects new subscription
requests when the limit it reached.
Change-Id: I5d49dede221ce1f50ec299643b5532c61f93f0c6
Reviewed-on: http://gerrit.cloudera.org:8080/9038
Reviewed-by: Sailesh Mukil <sa...@cloudera.com>
Tested-by: Impala Public Jenkins
{noformat}
> Potential deadlock in statestore error path
> -------------------------------------------
>
> Key: IMPALA-2642
> URL: https://issues.apache.org/jira/browse/IMPALA-2642
> Project: IMPALA
> Issue Type: Bug
> Components: Distributed Exec
> Affects Versions: Impala 2.3.0
> Reporter: Henry Robinson
> Assignee: Zoram Thanga
> Priority: Minor
> Labels: hang, statestore
> Fix For: Impala 2.12.0
>
>
> I just noticed this while reading the statestore code: {{OfferUpdate()}} takes {{subscribers_lock_}} if the update queue is full, but in one place that lock has already been taken by the caller:
> {code}
> {
> lock_guard<mutex> l(subscribers_lock_);
> // ... snip ...
> if (state == FailureDetector::FAILED) {
> if (is_heartbeat) {
> // TODO: Consider if a metric to track the number of failures would be useful.
> LOG(INFO) << "Subscriber '" << subscriber->id() << "' has failed, disconnected "
> << "or re-registered (last known registration ID: " << update.second
> << ")";
> UnregisterSubscriber(subscriber.get());
> }
> } else {
> // Schedule the next message.
> VLOG(3) << "Next " << (is_heartbeat ? "heartbeat" : "update") << " deadline for: "
> << subscriber->id() << " is in " << deadline_ms << "ms";
> // vvvvvvvvvvvvvvvvvv oops vvvvvvvvvvvvvvvvvv
> OfferUpdate(make_pair(deadline_ms, subscriber->id()), is_heartbeat ?
> &subscriber_heartbeat_threadpool_ : &subscriber_topic_update_threadpool_);
> }
> }
> {code}
> It's not as scary as it sounds, because if the update queue has > 10000 entries there's something wrong anyway, but we should fix this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)