You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2017/08/18 22:20:17 UTC

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Joe McDonnell has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7730

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw
boost::thread_resource_error if it is unable to spawn
a thread on the system (e.g. due to a ulimit). This
uncaught exception crashes Impala. Several customers
are seeing this.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool constructor is now
private and all ThreadPools are constructed via a
new Create() static factory method.
3. To propagate the Status, Threads and ThreadPools
cannot be created in constructors, so this is moved to
initialization methods that can return Status.
4. Since Threads and ThreadPools need to be passed
as arguments to the Create method, this eliminates
the ability to use them as stack-allocated local
variables or direct declarations in classes. These
have been replaced with scoped_ptr's.

Remaining TODO:
1. QueryState::StartFInstances() needs appropriate
cleanup if thread create fails. This is intricate.
2. There is some duplicate code between ThreadPool
and CallableThreadPool to try to eliminate.
3. The Thread::Create and ThreadPool::Create have
both pointer and scoped_ptr versions. It would be
nice to only have one version.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/kudu-scan-node.cc
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-thread.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/scheduling/admission-controller.cc
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
38 files changed, 371 insertions(+), 177 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 5: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/5/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 48: DEFINE_bool(thread_creation_fault_injection, false, "Causes calls to Thread::Create() "
Nice!

It seems like this should probably be grouped with other stress options in common/global-flags.cc. Those are also only enabled for debug builds. I guess it could be interesting to test on release builds since the timing will be different but ensure if it's worth making this a special case.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 2:

(8 comments)

Did an initial pass.

http://gerrit.cloudera.org:8080/#/c/7730/1//COMMIT_MSG
Commit Message:

Line 8: 
nit: You can make it up to 72 chars per line  to make this more readable.


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/rpc/TAcceptQueueServer.cpp
File be/src/rpc/TAcceptQueueServer.cpp:

Line 226:     stop_ = true;
Should we throw an exception here? What's the behavior when it continues executing?


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/statestore/statestored-main.cc
File be/src/statestore/statestored-main.cc:

Line 71:   ABORT_IF_ERROR(RegisterMemoryMetrics(metrics.get(), false, nullptr, nullptr));
You also need to do:
ABORT_IF_ERROR(StartMemoryMaintenanceThread());

in impalad-main.cc and catalogd-main.cc


PS2, Line 72: StartMemoryMaintenanceThread();
ABORT_IF_ERROR(StartMemoryMaintenanceThread());


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 46:   ///  -- work_function: the function to run every time an item is consumed from the queue
Update comment to include 'tpool' arg.


PS2, Line 184: /// TODO: add comment
Comment?


PS2, Line 188:     DCHECK(pool == nullptr);
             :     boost::scoped_ptr<CallableThreadPool> local_tpool;
             :     local_tpool.reset(new CallableThreadPool(queue_size, &CallableThreadPool::Worker));
             :     RETURN_IF_ERROR(local_tpool->CreateThreads(group, thread_prefix, num_threads));
             :     pool.swap(local_tpool);
             :     return Status::OK();
Why not just call the parent's Create() here?

i.e. ThreadPool::Create()


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/util/thread.cc
File be/src/util/thread.cc:

PS2, Line 319: local_thread
Just call 't' for consistency with the above StartThread(). 'local_thread' sounds a bit confusing.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7730

to look at the new patch set (#9).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 471 insertions(+), 238 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS7, Line 171: ++num_active_scanners_;
> num_active_scanners_ is protected by lock_. We get lock_ at the top of this
My bad, I didn't see that lock.


PS7, Line 173: scanner_threads_.AddThread(move(t));
There may be another race here where if the KuduScanNode is closed before we call scanner_threads_.AddThread(), we may not join a running thread.

It's a little hard to follow, but I'll try to explain it as best as I can.

ThreadAvailableCb() can get called in the context of a ScannerThread (from ReleaseThreadToken() in RunScannerThread()), whereas Close() gets called from the fragment instance execution thread.

ReleaseThreadToken() (L253) calls InvokeCallbacks() which calls ThreadAvailableCb() in the KuduScanNode. In ThreadAvailableCb(), if there's still remaining work, we'll try to spawn a new scanner thread again.

So if Close() gets called when a scanner thread calls ReleaseThreadToken(), we'll end up calling scanner_threads_.JoinAll() (L133) before we call scanner_threads_.AddThread(). We also don't synchronize with lock_ in Close() if done_ == true.

If this is a race, it looks like it existed before your code change, except that now, maybe the window for failure is slightly larger.


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

PS8, Line 63: qs->ReleaseInitialReservationRefcount();
nit: Add a comment saying that it was taken in Init().


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#2).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw
boost::thread_resource_error if it is unable to spawn
a thread on the system (e.g. due to a ulimit). This
uncaught exception crashes Impala. Systems with a large
number of nodes and threads are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool constructor is now
private and all ThreadPools are constructed via a
new Create() static factory method.
3. To propagate the Status, Threads and ThreadPools
cannot be created in constructors, so this is moved to
initialization methods that can return Status.
4. Since Threads and ThreadPools need to be passed
as arguments to the Create method, this eliminates
the ability to use them as stack-allocated local
variables or direct declarations in classes. These
have been replaced with scoped_ptr's.

Remaining TODO:
1. QueryState::StartFInstances() needs appropriate
cleanup if thread create fails. This is intricate.
2. There is some duplicate code between ThreadPool
and CallableThreadPool to try to eliminate.
3. The Thread::Create and ThreadPool::Create have
both pointer and scoped_ptr versions. It would be
nice to only have one version.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/kudu-scan-node.cc
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-thread.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/scheduling/admission-controller.cc
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/util/hdfs-bulk-ops.cc
M be/src/util/hdfs-bulk-ops.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
38 files changed, 371 insertions(+), 177 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS6, Line 159:     Status status;
             :     status =
> nit: combine these lines
Done


PS6, Line 163: num_scanner_threads_started_counter_
> here and other scan node: I guess we undo this rather than just waiting unt
I checked, and this is only used as a statistic and for making the names of the threads unique. I changed this to increment after successful thread create (here and for hdfs-scan-node.cc). (A side effect is that the thread indexes start from zero.)


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

PS6, Line 59:   RETURN_IF_ERROR(Thread::Create("query-exec-mgr",
            :       Substitute("start-query-finstances-$0", PrintId(query_id)),
            :           &QueryExecMgr::StartQueryHelper, this, qs, &t, true));
> Yes, that looks like a bug. I will fix in the next upload.
Done


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS6, Line 199:     // If this fragment instance is not prepared, it must be reporting an error.
             :     DCHECK(fis->IsPrepared() || !status.ok());
> If fis==null case doesn't report the error, then that's a bug we need to fi
Looking into a fix. I will also do fault injection at that point to see what happens. The two might be slightly different, because in that case no fragment instances have been started.


PS6, Line 331:       // Remove fis from fis_map_ and fragment_map_
             :       fis_map_.erase(fis->instance_id());
             :       fis_list.pop_back();
> why don't we just do this on the success path? is it to try to be consisten
I checked and all other uses of these two things check that the instances_preapred_promise_ is set. So, I moved it to the success path, since that is safe.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/thread-resource-mgr.h
File be/src/runtime/thread-resource-mgr.h:

PS6, Line 115: bool in_callback = false);
> this seems complicated. why is it needed now?
ReleaseThreadToken calls InvokeCallbacks. The two locations where we use this (hdfs-scan-node.cc and kudu-scan-node.cc) are in functions that are registered as callbacks. If we try to release inside the callback, this is mutual recursion.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/service/child-query.h
File be/src/service/child-query.h:

Line 159:   Status ExecAsync(std::vector<ChildQuery>&& child_queries);
> WARN_UNUSED_RESULT (or are we relying on the Status annotation now?)
Added WARN_UNUSED_RESULT.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

PS6, Line 598: Thread::Create("query-exec-state", "wait-thread",
             :       &ClientRequestState::Wait, this, &wait_thread_);
> Good point. I will add it to the list of locations that are eligible for fa
Done


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

PS6, Line 71:   Status status;
            :   status = request_state->WaitAsync();
> nit: combine
Done


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 304:     if ((nanos % 100) == 1) {
> Good point, will make that change.
Changed to use rand()


Line 327:   thread->swap(t);
> *thread = move(t) seems clearer
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 11:

(1 comment)

Rebased all the way and incorporated changes from IMPALA-5892.

http://gerrit.cloudera.org:8080/#/c/7730/11/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

PS11, Line 361: This
> The "this" was confusing the first time I read it (I thought it was referri
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7730

to look at the new patch set (#7).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 464 insertions(+), 235 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7730

to look at the new patch set (#6).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 460 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7730/3//COMMIT_MSG
Commit Message:

Line 28: variables or direct declarations in classes.
> Yeah, it does seem like there's an important distinction there.
I labeled the locations that are eligible for fault injection, and then added a parameter to turn on thread creation fault injection. This should make testing much easier.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

Line 361:       COUNTER_ADD(&active_scanner_thread_counter_, -1);
> Oh I see. Seems ok. I'm a little skeptical that it's worth trying to recove
There are arguments in either direction. Given that we think this should be rare, I think it makes sense to go ahead and abort the query. I changed this code and kudu-scan-node.cc to do that.


http://gerrit.cloudera.org:8080/#/c/7730/4/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

Line 164:       UndoGetNextScanToken(token);
> In this case it does seem simpler to fail the query, since it we mess this 
Changed this to fail the query.


http://gerrit.cloudera.org:8080/#/c/7730/4/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

Line 330:       // Either get this to work or remove before merge.
> Still relevant?
Removed


Line 350:     Status instance_status = entry.second->WaitForPrepare();
> I think we should preserve the postcondition that all fragment instances ha
That makes sense. One thing that I noticed when making this change is that in some circumstances the fragment instance will get into an RPC call before the query starts cancelling. In this case, the RPC call needs to timeout for the query to complete its cancel. The query does eventually end and nothing crashes.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 61: 
> Yeah I think that would make most sense to me - reduces the number of diffe
Fixed this by doing Shutdown() + Join() in the error condition.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 301:   unique_ptr<Thread> t(new Thread(category, name));
> Sounds like a good idea to me.
Added.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

Line 226:     if (fis->IsPrepared()) {
> should this check that the prepare status is ok? if Prepare() failed, then 
The first thing that Prepare() does is allocate the RuntimeState. It has a comment saying not to return before that. See https://github.com/apache/incubator-impala/blob/master/be/src/runtime/fragment-instance-state.cc#L119
We also check whether fis->profile() is null. So, I think this should be safe.

When IMPALA-5892 goes in, I will just use fis==nullptr. That will make this part of the change unnecessary, as nothing needs to send status for an unprepared fragment instance.


Line 328:       ExecEnv::GetInstance()->query_exec_mgr()->ReleaseQueryState(this);
> this is a little weird since ReleaseQueryState() can delete this. Except it
Added a comment.


PS8, Line 332: true
> passing instances_started=true can lead to a deadlock if the RPC send fails
Good point, changed this to instances_started=false. If the RPC to the coordinator fails, I think there are other ways to hang, but this is clearly wrong. I'll think about whether there is a better way to handle this case.


PS8, Line 336: update fis_map_
> nit: seems excessive
Done


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/thread-resource-mgr.h
File be/src/runtime/thread-resource-mgr.h:

Line 288:   if (!in_callback) InvokeCallbacks();
> As we discussed in person, I don't think it's safe even in that case genera
I went with option #1. I added a comment with a warning to thread-resource-mgr.h and noted why we were using skip_callbacks in kudu-scan-node.cc and hdfs-scan-node.cc.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/5/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 48: DEFINE_bool(thread_creation_fault_injection, false, "Causes calls to Thread::Create() "
> Nice!
That makes sense. It is good for it to be close to the other fault injection options. I made this only for DEBUG builds. I think that is sufficient for now.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 2:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7730/2//COMMIT_MSG
Commit Message:

Line 34: 1. QueryState::StartFInstances() needs appropriate
> I did that for query-state.cc, but I think that is probably the most common
Yeah makes sense.


http://gerrit.cloudera.org:8080/#/c/7730/3//COMMIT_MSG
Commit Message:

Line 28: as arguments to the Create method, this eliminates
For testing, it might be worth adding a stress startup option to make thread creation randomly fail. It seems hard to make a non-flaky automated test but it would be interesting to run some queries against it and see what the behaviour is.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

Line 361:       COUNTER_ADD(num_scanner_threads_started_counter_, -1);
It seems ok to omit fixing up num_scanner_threads_started_counter_. I don't think it makes a difference whether it counts the number of successful starts versus number of attempted starts. Although this way is fine too.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

Line 154:     ++num_active_scanners_;
This is protected by lock_ so we could just wait until the thread is created successfully to update it


Line 170:       COUNTER_ADD(num_scanner_threads_started_counter_, -1);
Same as the HDFS scan node - it seems ok to omit this fixup.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/rpc/thrift-thread.cc
File be/src/rpc/thrift-thread.cc:

Line 37:   if (!status.ok()) throw atc::SystemResourceException("Thread::Create() failed");
This could use some explanation, since it's an unusual idiom in the code. I.e. why do we need to throw an exception and how do we expect thrift to handle it.


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

Line 359:   Status prepare_status;
It seems like there's a fair bit of commonality between the case when Prepare() fails and the case when thread creating fails. Maybe we can share more of the code.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

Line 469:       HS2_RETURN_ERROR(return_val, status.GetDetail(), SQLSTATE_GENERAL_ERROR);
It looks like we have identical error-handling code in three places. May be cleaner to combine into an error handling block at the end of the function + gotos.


PS3, Line 476: (void) 
discard_result


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 61:     Shutdown();
How do things get cleaned up on an error? Does Shutdown() handle that case?


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 301:   } catch (boost::thread_resource_error& e) {
Might be worth adding a status code to common/thrift/generate_error_codes.py for this error


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread.h
File be/src/util/thread.h:

PS3, Line 69: 
we'd normally use pointers for output arguments - https://google.github.io/styleguide/cppguide.html#Reference_Arguments


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6:

(9 comments)

Still going through the comments, but I thought I'd put up some quick replies.

http://gerrit.cloudera.org:8080/#/c/7730/6//COMMIT_MSG
Commit Message:

Line 28: variables or direct declarations in classes.
> any reason Thread and ThreadPool don't follow the same pattern? i.e. both u
ThreadPool uses the constructor+init pattern because we subclass it for the CallableThreadPool. Subclassing gets awkward with the static create pattern. If someone forgets to run Init(), they find out when they offer a piece of work to the ThreadPool.

For Thread, I think it would be feasible to use the constructor+init pattern. The difference is that if someone forgets an Init(), they might be expecting a thread and it just isn't there.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

PS6, Line 363: pool->ReleaseThreadToken(false, true);
> Do you need to drop the lock before calling this? Based on the comment from
Yes, that is what in_callback is for. ReleaseThreadToken can call these thread availability callbacks, so we could end up back in this code. The in_callback=true prevents that.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

PS6, Line 59:   RETURN_IF_ERROR(Thread::Create("query-exec-mgr",
            :       Substitute("start-query-finstances-$0", PrintId(query_id)),
            :           &QueryExecMgr::StartQueryHelper, this, qs, &t, true));
> Shouldn't we call ReleaseQueryState(qs); if the thread failed to start here
Yes, that looks like a bug. I will fix in the next upload.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS6, Line 199:     // If this fragment instance is not prepared, it must be reporting an error.
             :     DCHECK(fis->IsPrepared() || !status.ok());
> Is this necessarily true?
The race might exist. I'm looking at the code to check.

The old DCHECK is stricter than the new one. Right now, we require that fis is nullptr or fis->IsPrepared(). I'm carving out an exception for failure during thread spawn when the fis won't have a chance to be prepared.


PS6, Line 199:     // If this fragment instance is not prepared, it must be reporting an error.
             :     DCHECK(fis->IsPrepared() || !status.ok());
> I don't really understand this DCHECK.  oh, because we pass 'fis' in the ne
In testing, I noticed that fis==nullptr doesn't actually convey the status. If I use fis==nullptr, the coordinator never gets notified of the status and the query hangs. When I went through ReportExecStatusAux, I noticed that apart from some DCHECKs, we don't use the status argument unless fis is non-null. TReportExecStatusParams doesn't really have a place for status apart from in the TFragmentInstanceExecStatus's.


PS6, Line 328: prepare_status
> why is this called "prepare_status"?  the old 'prepare_status' was the stat
The idea is to share the codepath below, which is waiting for the fragment instance states to be prepared. To incorporate the thread create status into the instances_prepared_promise_, the status needs to be available outside the loop.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

PS6, Line 598: Thread::Create("query-exec-state", "wait-thread",
             :       &ClientRequestState::Wait, this, &wait_thread_);
> Why not inject the thread failures here as well? I think it's an important 
Good point. I will add it to the list of locations that are eligible for fault injection. I will be doing more test runs as I go, so this will now be tested.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 304:     if ((nanos % 100) == 1) {
> on some platforms, the monotonic clock may not actually have nanosecond res
Good point, will make that change.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/util/thread.h
File be/src/util/thread.h:

PS6, Line 158: thread
> nit: 'thread'
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 3:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7730/3//COMMIT_MSG
Commit Message:

Line 28: variables or direct declarations in classes.
> For testing, it might be worth adding a stress startup option to make threa
I have done hand testing on a variety of codepaths. I'm looking into automating the testing. It is useful to make a distinction between the thread creates that are necessary for startup and/or instance life and those that are only needed for query success. I think tagging the query locations can lead to a repeatable test: either the queries return the right answer or they return this error.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

Line 361:       COUNTER_ADD(num_scanner_threads_started_counter_, -1);
> It seems ok to omit fixing up num_scanner_threads_started_counter_. I don't
I'm ok with either way. My thought is that in cases where the query doesn't abort we may want to know how many threads actually ran.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

Line 154:     ++num_active_scanners_;
> This is protected by lock_ so we could just wait until the thread is create
Done


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/rpc/thrift-thread.cc
File be/src/rpc/thrift-thread.cc:

Line 37:   if (!status.ok()) throw atc::SystemResourceException("Thread::Create() failed");
> This could use some explanation, since it's an unusual idiom in the code. I
Added a comment for this, but it is a bit unclear to me how thrift handles the exception. It seems like it gets sent back to the client, but I'm not 100% sure.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

Line 469:       HS2_RETURN_ERROR(return_val, status.GetDetail(), SQLSTATE_GENERAL_ERROR);
> It looks like we have identical error-handling code in three places. May be
Changed to gotos and shared code.


PS3, Line 476: (void) 
> discard_result
Done


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 61:   /// error spawning the threads.
> How do things get cleaned up on an error? Does Shutdown() handle that case?
When we get an error, only successfully started threads are in the threads_ ThreadGroup. ~ThreadPool calls Shutdown() and Join(), so we know that the threads will be cleaned up eventually. The owner of the ThreadPool might also call Shutdown() and Join(), and this is fine, because both of those functions are idempotent. The only thing the owner can do wrong is to Join() without calling Shutdown(), which makes no sense.

It might make sense to have the error case automatically call Shutdown() + Join().


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 301:     return Status(e.what());
> Might be worth adding a status code to common/thrift/generate_error_codes.p
Added this. I don't think the category or name would be useful to the customer, but if they would be useful to us, we can add it to the error message.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread.h
File be/src/util/thread.h:

PS3, Line 69: std::unique_ptr<Thread>&
> we'd normally use pointers for output arguments - https://google.github.io/
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/6//COMMIT_MSG
Commit Message:

Line 28: variables or direct declarations in classes.
> ThreadPool uses the constructor+init pattern because we subclass it for the
Okay, let's leave that alone then.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Sailesh Mukil, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7730

to look at the new patch set (#11).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 491 insertions(+), 239 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(2 comments)

I need to go through query-state one more time, but otherwise here's my remaining comments.

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/rpc/thrift-thread.cc
File be/src/rpc/thrift-thread.cc:

PS8, Line 40: throw atc::SystemResourceException("Thread::Create() failed");
is it worth incorporating the status msg into the exception message?


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/thread-resource-mgr.h
File be/src/runtime/thread-resource-mgr.h:

Line 288:   if (!in_callback) InvokeCallbacks();
This explains why we need to skip the callbacks. But why is it safe to do so? The callbacks are invoked here to give another thread a chance to start up, so why is it safe to skip that just because we're inside the callback? (Given the current callsite, it is safe, but it doesn't seem intuitively true when inside a callback generally).


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#4).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. Non-MT scan nodes, such as HDFS scan node and Kudu
scan node require at least one scanner thread to
make progress. This patch will abort the query if
the scan node cannot obtain at least one scanner thread.
However, if the scan node has at least one scanner
thread, the query will continue.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
Thread::Create() has an optional argument to randomly
inject thread creation errors at any single point in
the code. It causes roughly 1% of the calls to fail.
The code was tested by setting this option in different
code points and verifying that the queries either
run to completion with the correct result or fail with
the appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node-base.cc
M be/src/exec/kudu-scan-node-base.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
56 files changed, 444 insertions(+), 225 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

Line 226:     if (fis->IsPrepared()) {
should this check that the prepare status is ok? if Prepare() failed, then these fields aren't guaranteed to be set, right? alternatively, we coudl check for state != NULL ptr, like we are doing for profile() and like in FragmentInstanceState::Close().


Line 328:       ExecEnv::GetInstance()->query_exec_mgr()->ReleaseQueryState(this);
this is a little weird since ReleaseQueryState() can delete this. Except it won't since the the QueryExcMgr is holding a refcnt.  Could you augment line 287 to make it a bit more obvious, like line 288 does:

<< "Taken in QueryExecMgr::StartQuery()"


PS8, Line 332: true
passing instances_started=true can lead to a deadlock if the RPC send fails (it's used to decide whether to call Cancel(), which waits for instances_prepared_promise_ to be set but that doesn't happen until later...)


PS8, Line 336: update fis_map_
nit: seems excessive


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/thread-resource-mgr.h
File be/src/runtime/thread-resource-mgr.h:

Line 288:   if (!in_callback) InvokeCallbacks();
> You're right. This is only safe from a callback if a thread did TryAcquireT
As we discussed in person, I don't think it's safe even in that case generally, because while the callback held the token, another client may have wanted to start a thread, and now won't get a callback.

I think these two options would be okay:

1) Rename 'in_callback' to 'skip_callbacks'.  That way, the side-effect is more obvious to the caller, and it's a decision the caller needs to make.  In the callsite you are adding, we should then document that the callbacks need to be skipped to avoid recursion, but also that it is safe to do so since the system has hit the thread limit and so there's no point in doing further callbacks right now.

2) Alternatively, you could pass back an "id" when registering the callback (say, the index of the callback), and then pass that in here so that the callback is skipped. 

Given that the ThreadResourceMgr has a limited lifespan (with MT work), option #1 seems okay and easier to reason about right now.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS7, Line 173: scanner_threads_.AddThread(move(t));
> There may be another race here where if the KuduScanNode is closed before w
Here is my understanding:
SetDone() in Close() gets lock_ before calling SetDoneInternal(), so it holds lock_ regardless of whether done_ is true. The thread in ThreadAvailableCb() also gets lock_ before reading done_. So, they agree about done_ and either the thread in Close waits for the thread in ThreadAvailableCb() to add the thread and will need to join with it, or the thread in ThreadAvailableCb() will see done_==true and bail. 

I might be missing something.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#3).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/kudu-scan-node.cc
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
47 files changed, 297 insertions(+), 149 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 11: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/11/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

PS11, Line 361: This
The "this" was confusing the first time I read it (I thought it was referring to the ReleaseThreadToken() call, but it's really referring to the "and skip").  Maybe say "Skipping serves two purposes." to clarify.  Here and the kudu scanner.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS8, Line 332: true
> instances_started only impacts the case when there are 3 repeated RPC failu
Ah right. So I think the assumption is that the coordinator may not be able to talk back to us to cancel, and that's why we want to cancel locally.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS6, Line 199:     // If this fragment instance is not prepared, it must be reporting an error.
             :     DCHECK(fis->IsPrepared() || !status.ok());
> The race might exist. I'm looking at the code to check.
If fis==null case doesn't report the error, then that's a bug we need to fix since we need to report errors that can occur before we have instances (e.g. the call from StartFInstances() to ReportExecStatusAux()).


PS6, Line 328: prepare_status
> The idea is to share the codepath below, which is waiting for the fragment 
hmm, okay. the flow of all this is pretty complicated but I suppose we'll be reworking this entirely pretty soon anyway.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/catalog/catalog-server.h
File be/src/catalog/catalog-server.h:

PS8, Line 39: class TGetAllCatalogObjectsResponse;
nit: Not your change, but no longer required.


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS7, Line 173: scanner_threads_.AddThread(move(t));
> Here is my understanding:
Good point, I was looking at the base code for Close() instead of your updated version where you added SetDone(). Looks like you may have fixed this previously existing subtle race by adding that.


PS7, Line 240: SetDoneInternal();
We previously didn't call materialized_row_batches_->Shutdown() here, but now we will, is that acceptable?


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/rpc/auth-provider.h
File be/src/rpc/auth-provider.h:

PS8, Line 22: #include <boost/scoped_ptr.hpp>
No longer needed.


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/disk-io-mgr-handle-cache.h
File be/src/runtime/disk-io-mgr-handle-cache.h:

Line 28: #include "common/hdfs.h"
nit: Include what you use, status.h (it's already being pulled in through one of the other headers anyway)


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

PS8, Line 341: admission_controller_->Init()
The admission controller dequeue thread starts a little later than before now. I just want to confirm if that's okay.


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/service/child-query.h
File be/src/service/child-query.h:

PS8, Line 22: <boost/thread.hpp>
boost/thread/mutex.hpp


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

Line 30: 
nit: include what you use: status.h


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 12: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 1:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/7730/1//COMMIT_MSG
Commit Message:

Line 38: 3. The Thread::Create and ThreadPool::Create have
> Would unique_ptr do the job? I think you will need to support unique_ptr an
Changed to unique_ptr


http://gerrit.cloudera.org:8080/#/c/7730/2//COMMIT_MSG
Commit Message:

Line 18: 1. util/thread.h's Thread constructor is now private
> I think it's a matter of taste, but in some ways I prefer the constructor +
Yes, the constructor + Init() also makes subclassing easier. I switched ThreadPool over to this approach. I think it makes more sense for that. Thread seems like either would work, so I'm sticking with a static factory method for now.


Line 34: 1. QueryState::StartFInstances() needs appropriate
> One option is to defer the tricky cases til a follow-up patch and bring dow
I did that for query-state.cc, but I think that is probably the most common place for new threads to hit this exception, so it would be really nice to get it working.


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/exec/blocking-join-node.cc
File be/src/exec/blocking-join-node.cc:

Line 197:     RETURN_IF_ERROR(Thread::Create(FragmentInstanceState::FINST_THREAD_GROUP_NAME,
> Do we need to release the thread token on this error path?
Good point, this definitely needs to release the thread token. Fixed.


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/rpc/TAcceptQueueServer.cpp
File be/src/rpc/TAcceptQueueServer.cpp:

Line 226:     stop_ = true;
> Should we throw an exception here? What's the behavior when it continues ex
Setting stop_=true means we will not execute the loop and instead will do shutdown cleanup in the "if (stop_)" block and exit the function. Exiting the function stops the server.

At various points in this file, we catch exceptions, so I think we don't want an exception.


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/runtime/parallel-executor.cc
File be/src/runtime/parallel-executor.cc:

Line 39:     RETURN_IF_ERROR(Thread::Create("parallel-executor", ss.str(),
> If we created some threads I think we still need to call JoinAll() to avoid
Yes, that definitely needs to happen. Those threads reference stack variables, so it would be very dangerous to just leave. Changed to do the JoinAll().


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

Line 339:     // TODO: This error handling needs work.
> I think worst-case we could do a LOG(FATAL) for now and come back to fix th
Changed to LOG(FATAL) for the moment, but I am still looking into getting this to work.


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

PS1, Line 74: (void) U
> should use discard_result() once you rebase onto my change (GCC 7 does not 
Done


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

Line 404:   boost::scoped_ptr<ThreadPool<ScheduledSubscriberUpdate>> subscriber_topic_update_threadpool_;
> long lines
Done


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/statestore/statestored-main.cc
File be/src/statestore/statestored-main.cc:

Line 71:   ABORT_IF_ERROR(RegisterMemoryMetrics(metrics.get(), false, nullptr, nullptr));
> You also need to do:
Done


PS2, Line 72: StartMemoryMaintenanceThread();
> ABORT_IF_ERROR(StartMemoryMaintenanceThread());
Done


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 123:   Status CreateThreads(const std::string& group, const std::string& thread_prefix,
> Should add WARN_UNUSED_RESULT to make sure it's not dropped anywhere.
Done


Line 185:   static Status Create(const std::string& group, const std::string& thread_prefix,
> Should add WARN_UNUSED_RESULT to make sure it's not dropped anywhere.
Done


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 46:   ///  -- work_function: the function to run every time an item is consumed from the queue
> Update comment to include 'tpool' arg.
Changed approach on ThreadPool to make it follow the constructor + Init() pattern, so this change is gone.


PS2, Line 184: /// TODO: add comment
> Comment?
Removed


PS2, Line 188:     DCHECK(pool == nullptr);
             :     boost::scoped_ptr<CallableThreadPool> local_tpool;
             :     local_tpool.reset(new CallableThreadPool(queue_size, &CallableThreadPool::Worker));
             :     RETURN_IF_ERROR(local_tpool->CreateThreads(group, thread_prefix, num_threads));
             :     pool.swap(local_tpool);
             :     return Status::OK();
> Why not just call the parent's Create() here?
Switching to the constructor + Init() pattern eliminated the need for this code. Now CallableThreadPool is the same as before.


http://gerrit.cloudera.org:8080/#/c/7730/2/be/src/util/thread.cc
File be/src/util/thread.cc:

PS2, Line 319: local_thread
> Just call 't' for consistency with the above StartThread(). 'local_thread' 
This code has been removed.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS8, Line 332: true
> What's the consequence of using instances_started=false?  Did you verify th
instances_started only impacts the case when there are 3 repeated RPC failures to the coordinator. I will run a test to see what happens.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS6, Line 159:     Status status;
             :     status =
nit: combine these lines


PS6, Line 163: num_scanner_threads_started_counter_
here and other scan node: I guess we undo this rather than just waiting until success to increment it so there's no window where we've started a scannerthread before the counter is incremented? is that important?


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS6, Line 199:     // If this fragment instance is not prepared, it must be reporting an error.
             :     DCHECK(fis->IsPrepared() || !status.ok());
> Is this necessarily true?
I don't really understand this DCHECK.  oh, because we pass 'fis' in the new case. why do we do that?  the error wasn't on the instance itself, it was in the creation of the instance state, so I think we should just pass nullptr for 'fis'.

the old dcheck might be racy still, as Sailesh points out.


PS6, Line 328: prepare_status
why is this called "prepare_status"?  the old 'prepare_status' was the status of fis->Prepare(), but this is not. i think we should have a different Status for this (and it need not be outside the loop scope).


PS6, Line 331:       // Remove fis from fis_map_ and fragment_map_
             :       fis_map_.erase(fis->instance_id());
             :       fis_list.pop_back();
why don't we just do this on the success path? is it to try to be consistent with undoing the refcnts, or some other reason?


PS6, Line 341: fis
I think this should be nullptr, since the error isn't on the instance itself.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/thread-resource-mgr.h
File be/src/runtime/thread-resource-mgr.h:

PS6, Line 115: bool in_callback = false);
this seems complicated. why is it needed now?


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/service/child-query.h
File be/src/service/child-query.h:

Line 159:   Status ExecAsync(std::vector<ChildQuery>&& child_queries);
WARN_UNUSED_RESULT (or are we relying on the Status annotation now?)


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

PS6, Line 71:   Status status;
            :   status = request_state->WaitAsync();
nit: combine


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 304:     if ((nanos % 100) == 1) {
on some platforms, the monotonic clock may not actually have nanosecond resolution, and so this may not work as expected. I think it would be safer to use a random number generator.


Line 327:   thread->swap(t);
*thread = move(t) seems clearer


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 9: Code-Review+1

LGTM after checking whether not calling StartServices() (and therefore not starting the admission controller's dequeue thread) from FeSupport causes a change in behavior.

Thanks for doing this patch and being patient with the review!

-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 12: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1206/

-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 2:

(10 comments)

I had some initial comments. Need to digest a bit more but I thought I'd push out the comments so you knew there would be some conflicts with my patch.

http://gerrit.cloudera.org:8080/#/c/7730/1//COMMIT_MSG
Commit Message:

Line 38: 3. The Thread::Create and ThreadPool::Create have
Would unique_ptr do the job? I think you will need to support unique_ptr anyway once you rebase since ThreadGroup::AddThread() now takes a unique_ptr.


http://gerrit.cloudera.org:8080/#/c/7730/2//COMMIT_MSG
Commit Message:

Line 18: 1. util/thread.h's Thread constructor is now private
I think it's a matter of taste, but in some ways I prefer the constructor + Init() approach because it gives the caller control over how the object is managed. The Create() approach seems to work out fine though.


Line 34: 1. QueryState::StartFInstances() needs appropriate
One option is to defer the tricky cases til a follow-up patch and bring down the process with LOG(FATAL) or similar, since that's still a strict improvement on the status quo.


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/exec/blocking-join-node.cc
File be/src/exec/blocking-join-node.cc:

Line 197:     RETURN_IF_ERROR(Thread::Create(FragmentInstanceState::FINST_THREAD_GROUP_NAME,
Do we need to release the thread token on this error path?


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/runtime/parallel-executor.cc
File be/src/runtime/parallel-executor.cc:

Line 39:     RETURN_IF_ERROR(Thread::Create("parallel-executor", ss.str(),
If we created some threads I think we still need to call JoinAll() to avoid leaving a bunch of orphaned threads.


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

Line 339:     // TODO: This error handling needs work.
I think worst-case we could do a LOG(FATAL) for now and come back to fix this in a follow-up patch. That would still be easier to diagnose than the current behaviour.


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

PS1, Line 74: (void) U
should use discard_result() once you rebase onto my change (GCC 7 does not consider (void) as actually discarding the status).


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

Line 404:   boost::scoped_ptr<ThreadPool<ScheduledSubscriberUpdate>> subscriber_topic_update_threadpool_;
long lines


http://gerrit.cloudera.org:8080/#/c/7730/1/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 123:   Status CreateThreads(const std::string& group, const std::string& thread_prefix,
Should add WARN_UNUSED_RESULT to make sure it's not dropped anywhere.


Line 185:   static Status Create(const std::string& group, const std::string& thread_prefix,
Should add WARN_UNUSED_RESULT to make sure it's not dropped anywhere.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/catalog/catalog-server.h
File be/src/catalog/catalog-server.h:

PS8, Line 39: class TGetAllCatalogObjectsResponse;
> nit: Not your change, but no longer required.
Done


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS7, Line 240: SetDoneInternal();
> We previously didn't call materialized_row_batches_->Shutdown() here, but n
This previously called Shutdown() when the last scanner thread is done.

I looked into this, and there doesn't seem to be any reason to wait. It is safe to call Shutdown() when the other scanner threads are still around. We do that for the HDFS version of this. We also do it if we reach a limit. I looked at the different methods on the queue, and I don't see any problem.


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/rpc/auth-provider.h
File be/src/rpc/auth-provider.h:

PS8, Line 22: #include <boost/scoped_ptr.hpp>
> No longer needed.
Done


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/rpc/thrift-thread.cc
File be/src/rpc/thrift-thread.cc:

PS8, Line 40: throw atc::SystemResourceException("Thread::Create() failed");
> is it worth incorporating the status msg into the exception message?
I think that makes sense. The status has the category and thread name, which could be useful to us.


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/disk-io-mgr-handle-cache.h
File be/src/runtime/disk-io-mgr-handle-cache.h:

Line 28: #include "common/hdfs.h"
> nit: Include what you use, status.h (it's already being pulled in through o
Done


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

PS8, Line 341: admission_controller_->Init()
> The admission controller dequeue thread starts a little later than before n
I think it is ok. ExecEnv::StartServices() is called from impalad-main.cc soon after constructing the ExecEnv and before starting the backend or beeswax or hs2 servers. There should be nothing to admit.

One thing I noticed is that StartServices() is not called from FeSupport. I think that should be ok, but I'm checking.


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/thread-resource-mgr.h
File be/src/runtime/thread-resource-mgr.h:

Line 288:   if (!in_callback) InvokeCallbacks();
> This explains why we need to skip the callbacks. But why is it safe to do s
You're right. This is only safe from a callback if a thread did TryAcquireThreadToken and then wants to return it without using it. This correct usage is equivalent to just passing on picking up a thread token. If the callback did a release of an unrelated thread, then that would be a problem, because we won't find a replacement for that thread. There is no real way to tell if something has called TryAcquireThreadToken. We don't give it an object or anything.

I can augment the comment to detail that it must be used in a very targeted way, or there could be separate functions.


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/service/child-query.h
File be/src/service/child-query.h:

PS8, Line 22: <boost/thread.hpp>
> boost/thread/mutex.hpp
Done


http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

Line 30: 
> nit: include what you use: status.h
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6:

(2 comments)

Havent looked through all the code yet but had a high level question.

http://gerrit.cloudera.org:8080/#/c/7730/6//COMMIT_MSG
Commit Message:

Line 28: variables or direct declarations in classes.
any reason Thread and ThreadPool don't follow the same pattern? i.e. both use the private-constructor-static-create pattern, or both use the public-constructor-but-init-required pattern?

I think either patterns are fine, but if there's not a good reason for different patterns, maybe pick one and use it everywhere?


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/util/thread.h
File be/src/util/thread.h:

PS6, Line 158: thread
nit: 'thread'


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS8, Line 332: true
> Good point, changed this to instances_started=false. If the RPC to the coor
What's the consequence of using instances_started=false?  Did you verify that the coordinator will call back to us and trigger the cancellation of those other FIS eventually? (I think it should).

The alternative would be to defer the call to ReportExecStatusAux() until the very end of StartFInstances(), so that we can safely cancel all the other local instances.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

PS7, Line 371: COUNTER_ADD(num_scanner_threads_started_counter_, 1);
> The problem with moving this here is when the node is under heavy stress, t
Yes, this could be off. I suppose there is no way around it.


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS7, Line 171: ++num_active_scanners_;
> This can cause quite a few races. It can race with L220, L244 and L245.
num_active_scanners_ is protected by lock_. We get lock_ at the top of this loop and the other locations get lock_.


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

PS6, Line 59:   status = Thread::Create("query-exec-mgr",
            :       Substitute("start-query-finstances-$0", PrintId(query_id)),
            :           &QueryExecMgr::StartQueryHelper, this, qs, &t, true);
> Done
In testing, I found that this also needs qs->ReleaseInitialReservationRefcount(), which was taken in Init(). Added.


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS7, Line 335:     // Fragment instance successfully started
             :     // update fis_map_
             :     fis_map_.emplace(fis->instance_id(), fis);
             :     // update fragment_map_
             :     vector<FragmentInstanceState*>& fis_list = fragment_map_[instance_ctx.fragment_idx];
             :     fis_list.push_back(fis);
> Is it safe to update the map with the Fragment instance state after already
My thought process is something like this:
fis is a local variable. Both fis_map_ and fragment_map_ are private variables on QueryState. None of these variables are referenced outside this QueryState except through APIs protected by the promise. It looks like the fragment instance doesn't care about using the functions that would access these variables.

Worth considering if there is any way around it, but I think it is safe.


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/util/thread.cc
File be/src/util/thread.cc:

PS7, Line 303: rand()
> Not to be pedantic, but a rand() without seeding the PRNG first, will cause
I was thinking it might make sense for us to do a single srand() at startup. I will look into the appropriate place to do this.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Reviewed-on: http://gerrit.cloudera.org:8080/7730
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 455 insertions(+), 215 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Alex Behm: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Sailesh Mukil, Dan Hecht, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7730

to look at the new patch set (#12).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 455 insertions(+), 215 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

PS7, Line 371: COUNTER_ADD(num_scanner_threads_started_counter_, 1);
The problem with moving this here is when the node is under heavy stress, there may be a window where the scanner thread starts running, but this thread never gets scheduled for a while causing the counter to be slightly misleading.

Not that it's a huge problem, but I just wanted to point it out.


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

PS7, Line 171: ++num_active_scanners_;
This can cause quite a few races. It can race with L220, L244 and L245.


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS7, Line 335:     // Fragment instance successfully started
             :     // update fis_map_
             :     fis_map_.emplace(fis->instance_id(), fis);
             :     // update fragment_map_
             :     vector<FragmentInstanceState*>& fis_list = fragment_map_[instance_ctx.fragment_idx];
             :     fis_list.push_back(fis);
Is it safe to update the map with the Fragment instance state after already starting the fragment instance?

I tried going through some scenarios, and they all checked out fine since critical RPCs like Cancel() are protected by 'instances_prepared_promise_', but I'm not sure if I'm missing some other failure case.


http://gerrit.cloudera.org:8080/#/c/7730/7/be/src/util/thread.cc
File be/src/util/thread.cc:

PS7, Line 303: rand()
Not to be pedantic, but a rand() without seeding the PRNG first, will cause the same series of numbers to be generated on a particular node for multiple different runs. Causing the failure injections to be fairly deterministic if we're running the same queries over these test runs.

Some thing closer to actual pseudo-randomness would require something like the following:
https://github.com/apache/incubator-impala/blob/master/be/src/rpc/authentication.cc#L493

But getting a new random device every time could be expensive.

Do you know of a better but cheap way to do this?


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS8, Line 332: true
> Ah right. So I think the assumption is that the coordinator may not be able
My test shows that the fragments that are not on the coordinator node will exit. (This test is the equivalent of  instances_started=false with complete RPC failure.) However, if the message doesn't get to the coordinator, then the coordinator fragments will hang (seemingly indefinitely). I'm using a simple query, so it could be that this is not representative.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7730

to look at the new patch set (#8).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 465 insertions(+), 235 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 10:

(1 comment)

Also rebased.

http://gerrit.cloudera.org:8080/#/c/7730/8/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS8, Line 332: ery 
> My test shows that the fragments that are not on the coordinator node will 
Moved the call to ReportExecStatusAux to after all fragment instances are prepared and verified that this works. The code to do this is ugly (because it relies on calling ReportExecStatusAux using the failed fis), and it will require IMPALA-5892 to become cleaner (where we can pass fis=nullptr to ReportExecStatusAux).


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Sailesh Mukil, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7730

to look at the new patch set (#10).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/global-flags.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
55 files changed, 480 insertions(+), 238 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

PS6, Line 363: pool->ReleaseThreadToken(false, true);
Do you need to drop the lock before calling this? Based on the comment from L405.

Or is that why you added the 'in_callback' parameter?


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-exec-mgr.cc
File be/src/runtime/query-exec-mgr.cc:

PS6, Line 59:   RETURN_IF_ERROR(Thread::Create("query-exec-mgr",
            :       Substitute("start-query-finstances-$0", PrintId(query_id)),
            :           &QueryExecMgr::StartQueryHelper, this, qs, &t, true));
Shouldn't we call ReleaseQueryState(qs); if the thread failed to start here?


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

PS6, Line 199:     // If this fragment instance is not prepared, it must be reporting an error.
             :     DCHECK(fis->IsPrepared() || !status.ok());
Is this necessarily true?

We start the report thread in FragmentInstanceState::Prepare(), and we set the prepared promise after that. What if the report thread asynchronously attempts to send a status report before we set the prepared promise?


http://gerrit.cloudera.org:8080/#/c/7730/6/be/src/service/client-request-state.cc
File be/src/service/client-request-state.cc:

PS6, Line 598: Thread::Create("query-exec-state", "wait-thread",
             :       &ClientRequestState::Wait, this, &wait_thread_);
Why not inject the thread failures here as well? I think it's an important test site since the thread failure will be injected after the query starts executing.

i.e. WaitAsync() fails after a call to Execute().


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7730/3//COMMIT_MSG
Commit Message:

Line 28: variables or direct declarations in classes.
> I have done hand testing on a variety of codepaths. I'm looking into automa
Yeah, it does seem like there's an important distinction there.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

Line 361:       COUNTER_ADD(num_scanner_threads_started_counter_, -1);
> I'm ok with either way. My thought is that in cases where the query doesn't
Oh I see. Seems ok. I'm a little skeptical that it's worth trying to recover from this instead of just failing the query.


http://gerrit.cloudera.org:8080/#/c/7730/4/be/src/exec/kudu-scan-node.cc
File be/src/exec/kudu-scan-node.cc:

Line 164:     auto fn = [this, token, name]() { this->RunScannerThread(name, token); };
In this case it does seem simpler to fail the query, since it we mess this up we might drop a scan range and give incorrect results.

Your change looks correct but it seems like it would be easy for someone to accidentally break it.


http://gerrit.cloudera.org:8080/#/c/7730/4/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

Line 330:     initial_reservation_refcnt_.Add(1);  // decremented in ExecFInstance()
Still relevant?


Line 350:       // Undo refcnts
I think we should preserve the postcondition that all fragment instances have finished Prepare() when this function returns, just to reduce the possible divergence in behaviour from the normal path.


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread-pool.h
File be/src/util/thread-pool.h:

Line 61:   /// error spawning the threads.
> When we get an error, only successfully started threads are in the threads_
Yeah I think that would make most sense to me - reduces the number of different things that can happen. I guess it would also be good to reduce the number of threads running sooner rather than later (although it does seem like tearing things down in the destructor would mostly be ok).


http://gerrit.cloudera.org:8080/#/c/7730/3/be/src/util/thread.cc
File be/src/util/thread.cc:

Line 301:     return Status(e.what());
> Added this. I don't think the category or name would be useful to the custo
Sounds like a good idea to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5750: Catch exceptions from boost thread creation

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#5).

Change subject: IMPALA-5750: Catch exceptions from boost thread creation
......................................................................

IMPALA-5750: Catch exceptions from boost thread creation

The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
---
M be/src/benchmarks/thread-create-benchmark.cc
M be/src/catalog/catalog-server.cc
M be/src/catalog/catalog-server.h
M be/src/catalog/catalogd-main.cc
M be/src/common/init.cc
M be/src/common/init.h
M be/src/exec/blocking-join-node.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scan-node.h
M be/src/exec/kudu-scan-node.cc
M be/src/exec/kudu-scan-node.h
M be/src/rpc/TAcceptQueueServer.cpp
M be/src/rpc/auth-provider.h
M be/src/rpc/authentication.cc
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M be/src/rpc/thrift-thread.cc
M be/src/rpc/thrift-thread.h
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/disk-io-mgr-handle-cache.h
M be/src/runtime/disk-io-mgr-handle-cache.inline.h
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/exec-env.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/parallel-executor.cc
M be/src/runtime/query-exec-mgr.cc
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/thread-resource-mgr.h
M be/src/scheduling/admission-controller.cc
M be/src/scheduling/admission-controller.h
M be/src/service/child-query.cc
M be/src/service/child-query.h
M be/src/service/client-request-state.cc
M be/src/service/client-request-state.h
M be/src/service/impala-beeswax-server.cc
M be/src/service/impala-hs2-server.cc
M be/src/service/impala-server.cc
M be/src/service/impala-server.h
M be/src/service/impalad-main.cc
M be/src/statestore/statestore-subscriber.cc
M be/src/statestore/statestore-subscriber.h
M be/src/statestore/statestore.cc
M be/src/statestore/statestore.h
M be/src/statestore/statestored-main.cc
M be/src/testutil/in-process-servers.cc
M be/src/testutil/in-process-servers.h
M be/src/util/thread-pool-test.cc
M be/src/util/thread-pool.h
M be/src/util/thread.cc
M be/src/util/thread.h
M common/thrift/generate_error_codes.py
54 files changed, 454 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/7730/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7730
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>