You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/12/21 18:49:01 UTC

[jira] [Resolved] (IMPALA-3160) Queries may not get cancelled if cancellation pool hits MAX_CANCELLATION_QUEUE_SIZE

     [ https://issues.apache.org/jira/browse/IMPALA-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-3160.
-----------------------------------
    Resolution: Won't Fix

Seems like this is largely an academic issue at this point so no point keeping it open.

> Queries may not get cancelled if cancellation pool hits MAX_CANCELLATION_QUEUE_SIZE
> -----------------------------------------------------------------------------------
>
>                 Key: IMPALA-3160
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3160
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 2.5.0
>            Reporter: Sailesh Mukil
>            Assignee: Thomas Tauber-Marshall
>            Priority: Minor
>
> The ImpalaServer::MembershipCallback() function determines if a backend(s) is down from the topic updates from the statestore. It also cancels all the queries that are already in flight on these failed backends after comparing the failed backend from the topic update to the failed backend in the query_locations_ map which maps backends to queries running on it.
> If the cancellation queue is too large (tracked by MAX_CANCELLATION_QUEUE_SIZE), we do not cancel the queries hoping that by the next heartbeat, the cancellation queue frees up so we can re-try the cancellation of these queries.
> However, by that point we already remove the failed backend from the query_locations_ map. So, the next heartbeat will never find this backend to cancel the queries running on it.
> {code:java}
>     // Maps from query id (to be cancelled) to a list of failed Impalads that are
>     // the cause of the cancellation.
>     map<TUniqueId, vector<TNetworkAddress> > queries_to_cancel; // VVVV: LOCAL MAP
>     {
>       // Build a list of queries that are running on failed hosts (as evidenced by their
>       // absence from the membership list).
>       // TODO: crash-restart failures can give false negatives for failed Impala demons.
>       lock_guard<mutex> l(query_locations_lock_);
>       QueryLocations::const_iterator loc_entry = query_locations_.begin();
>       while (loc_entry != query_locations_.end()) {
>         if (current_membership.find(loc_entry->first) == current_membership.end()) {
>           unordered_set<TUniqueId>::const_iterator query_id = loc_entry->second.begin();
>           // Add failed backend locations to all queries that ran on that backend.
>           for(; query_id != loc_entry->second.end(); ++query_id) {
>             vector<TNetworkAddress>& failed_hosts = queries_to_cancel[*query_id];
>             failed_hosts.push_back(loc_entry->first);
>           }
>           exec_env_->impalad_client_cache()->CloseConnections(loc_entry->first);
>           // We can remove the location wholesale once we know backend's failed. To do so
>           // safely during iteration, we have to be careful not in invalidate the current
>           // iterator, so copy the iterator to do the erase(..) and advance the original.
>           QueryLocations::const_iterator failed_backend = loc_entry;
>           ++loc_entry;
>           // VVVV: WE ERASE THE ENTRY FROM THE GLOBAL MAP HERE.
>           query_locations_.erase(failed_backend);
>         } else {
>           ++loc_entry;
>         }
>       }
>     }
>     if (cancellation_thread_pool_->GetQueueSize() + queries_to_cancel.size() >
>         MAX_CANCELLATION_QUEUE_SIZE) {
>       // Ignore the cancellations - we'll be able to process them on the next heartbeat
>       // instead.
>       LOG_EVERY_N(WARNING, 60) << "Cancellation queue is full";
>       // VVVV: WE DON'T CANCEL HERE AND BY THE NEXT HEARTBEAT, WE WON'T FIND THE FAILED BACKEND AGAIN.
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)