You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Joe McDonnell (JIRA)" <ji...@apache.org> on 2017/09/07 16:46:00 UTC

[jira] [Resolved] (IMPALA-5750) Handle uncaught exceptions in thread creation

     [ https://issues.apache.org/jira/browse/IMPALA-5750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joe McDonnell resolved IMPALA-5750.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0

commit e993b9712c81dfb66fdf65bb5269cdc38a8eef18
Author: Joe McDonnell <jo...@cloudera.com>
Date:   Wed Aug 16 17:16:45 2017 -0700

    IMPALA-5750: Catch exceptions from boost thread creation
    
    The boost thread constructor will throw boost::thread_resource_error
    if it is unable to spawn a thread on the system
    (e.g. due to a ulimit). This uncaught exception crashes
    Impala. Systems with a large number of nodes and threads
    are hitting this limit.
    
    This change catches the exception from the thread
    constructor and converts it to a Status. This requires
    several changes:
    1. util/thread.h's Thread constructor is now private
    and all Threads are constructed via a new Create()
    static factory method.
    2. util/thread-pool.h's ThreadPool requires that Init()
    be called after the ThreadPool is constructed.
    3. To propagate the Status, Threads cannot be created in
    constructors, so this is moved to initialization methods
    that can return Status.
    4. Threads now use unique_ptr's for management in all
    cases. Threads cannot be used as stack-allocated local
    variables or direct declarations in classes.
    
    Query execution code paths will now handle the error:
    1. If the scan node fails to spawn any scanner thread,
    it will abort the query.
    2. Failing to spawn a fragment instance from the query
    state in StartFInstances() will correctly report the error
    to the coordinator and tear down the query.
    
    Testing:
    This introduces the parameter thread_creation_fault_injection,
    which will cause Thread::Create() calls in eligible
    locations to fail randomly roughly 1% of the time.
    Quite a few locations of Thread::Create() and
    ThreadPool::Init() are necessary for startup and cannot
    be eligible. However, all the locations used for query
    execution are marked as eligible and governed by this
    parameter. The code was tested by setting this parameter
    to true and running queries to verify that queries either
    run to completion with the correct result or fail with
    appropriate status.
    
    Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
    Reviewed-on: http://gerrit.cloudera.org:8080/7730
    Reviewed-by: Alex Behm <al...@cloudera.com>
    Tested-by: Impala Public Jenkins


> Handle uncaught exceptions in thread creation
> ---------------------------------------------
>
>                 Key: IMPALA-5750
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5750
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Silvius Rus
>            Assignee: Joe McDonnell
>            Priority: Critical
>              Labels: crash
>             Fix For: Impala 2.11.0
>
>
> Impalad uses noexcept API of boost library whenever possible. However, there are certain API which don't implement the noexcept variant. One example of this is the thread creation interface:
> {noformat}
> void Thread::StartThread(const ThreadFunctor& functor) {
>   DCHECK(thread_manager.get() != nullptr)
>       << "Thread created before InitThreading called";
>   DCHECK(tid_ == UNINITIALISED_THREAD_ID) << "StartThread called twice";
>   Promise<int64_t> thread_started;
>   thread_.reset(
>       new thread(&Thread::SuperviseThread, name_, category_, functor, &thread_started));
>   // TODO: This slows down thread creation although not enormously. To make this faster,
>   // consider delaying thread_started.Get() until the first call to tid(), but bear in
>   // mind that some coordination is required between SuperviseThread() and this to make
>   // sure that the thread is still available to have its tid set.
>   tid_ = thread_started.Get();
>   VLOG(2) << "Started thread " << tid_ << " - " << category_ << ":" << name_;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)