You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Thomas Tauber-Marshall (JIRA)" <ji...@apache.org> on 2017/08/23 16:11:00 UTC

[jira] [Resolved] (IMPALA-5749) Race in coordinator hits DCHECK on 'num_remaining_backends_ > 0'

     [ https://issues.apache.org/jira/browse/IMPALA-5749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Tauber-Marshall resolved IMPALA-5749.
--------------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

commit b98c621a801d75dc1e8f9603c858335548d54cfb
Author: Thomas Tauber-Marshall <tm...@cloudera.com>
Date:   Thu Aug 3 10:09:46 2017 -0700

    IMPALA-5749: coordinator race hits DCHECK 'num_remaining_backends_ > 0'
    
    In Coordinator::UpdateBackendExecStatus(), we check if the backend
    has already completed with BackendState::IsDone() and return without
    applying the update if so to avoid updating num_remaining_backends_
    twice for the same completed backend.
    
    The problem is that the value of BackendState::IsDone() is updated by
    the call to BackendState::ApplyExecStatusReport() that comes after it,
    but these operations are not performed atomically, so if there are
    two simultaneous calls to UpdateBackendExecStatus(), they can both
    call IsDone(), both get 'false', and then proceed to erroneously both
    update num_remaining_backends_, hitting a DCHECK.
    
    This patch modifies ApplyExecStatusReport to return true iff this
    report transitioned the backend to a done status, and then only
    updates num_remaining_backends_ in this case, ensuring it is only
    updated once per backend.
    
    Testing:
    - Ran test_finst_cancel_when_query_complete 10,000 times without
      hitting the DCHECK (previously, it would hit about once per 300
      runs).
    
    Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
    Reviewed-on: http://gerrit.cloudera.org:8080/7577
    Reviewed-by: Dan Hecht <dh...@cloudera.com>
    Tested-by: Impala Public Jenkins

> Race in coordinator hits DCHECK on 'num_remaining_backends_ > 0'
> ----------------------------------------------------------------
>
>                 Key: IMPALA-5749
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5749
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.10.0
>            Reporter: Thomas Tauber-Marshall
>            Assignee: Thomas Tauber-Marshall
>            Priority: Blocker
>             Fix For: Impala 2.10.0
>
>
> Discovered while running 'test_finst_cancel_when_query_complete' in a loop trying to repro a different issue, there's a race in Coordinator::UpdateBackendExecStatus that causes Impala to crash on the 'DCHECK_GT(num_remaining_backends_, 0)'
> The problem is that only the first exec report returned for a particular backend after it has completed is supposed to hit line 992, where we decrease 'num_remaining_backends_'. Per the comments, this is supposed to be ensured by the BackendState::IsDone check on line 945.
> However, the check and the update aren't performed atomically, so you can have a situation where two threads enter UpdateBackendExecStatus at the same time, both check BackendState::IsDone and find it false, and then both proceed to update num_remaining_backends_, with the second one hitting the DCHECK.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)