You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Michael Ho (JIRA)" <ji...@apache.org> on 2017/12/08 08:08:01 UTC

[jira] [Resolved] (IMPALA-6285) Avoid printing the stack as part of DoTransmitDataRpc as it leads to burning lots of kernel CPU

     [ https://issues.apache.org/jira/browse/IMPALA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Ho resolved IMPALA-6285.
--------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0

https://github.com/apache/impala/commit/d60eb192a959afd5e1a7062b360ade2ef8a8f4f4

IMPALA-6285: Don't print stack trace on RPC errors.
There is not much benefit in printing the stack trace when
Thrift RPC hits an error. As long as we print enough info about
the error and identify the caller, that should be sufficient.
In fact, it has been observed that stack crawl caused unnecessary
CPU spikes in the past. This change replaces Status() with
Status::Expected() in DoRpc(), RetryRpc(), RetryRpcRecv() and
Coordinator::BackendState::Exec() to avoid unnecessary stack crawls.

Testing done: private core build. Verified error strings with
test_rpc_timeout.py and test_rpc_exception.py

Change-Id: Ia83294494442ef21f7934f92ba9112e80d81fa58
Reviewed-on: http://gerrit.cloudera.org:8080/8788
Reviewed-by: Michael Ho <kw...@cloudera.com>
Tested-by: Impala Public Jenkins

> Avoid printing the stack as part of DoTransmitDataRpc as it leads to burning lots of kernel CPU
> -----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-6285
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6285
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.11.0
>            Reporter: David Rorke
>            Assignee: Michael Ho
>            Priority: Blocker
>              Labels: cloud
>             Fix For: Impala 2.11.0
>
>
> When running on 32 concurrent TPCDS queries against 20 r4.8xlarge some of the RPCs timeout but don't fail the query 
> {code}
> I1206 12:44:14.925405 25274 status.cc:58] RPC recv timed out: Client foo-17.domain.com:22000 timed-out during recv call.
>     @           0x957a6a  impala::Status::Status()
>     @          0x11dd5fe  impala::DataStreamSender::Channel::DoTransmitDataRpc()
>     @          0x11ddcd4  impala::DataStreamSender::Channel::TransmitDataHelper()
>     @          0x11de080  impala::DataStreamSender::Channel::TransmitData()
>     @          0x11e1004  impala::ThreadPool<>::WorkerThread()
>     @           0xd10063  impala::Thread::SuperviseThread()
>     @           0xd107a4  boost::detail::thread_data<>::run()
>     @          0x128997a  (unknown)
>     @     0x7f68c5bc7e25  start_thread
>     @     0x7f68c58f534d  __clone
> {code}
> {code}
> I1206 12:44:15.152775 25296 status.cc:58] RPC recv timed out: Client foo-5.domain.com:22000 timed-out during recv call.
>     @           0x957a6a  impala::Status::Status()
>     @          0x11dd5fe  impala::DataStreamSender::Channel::DoTransmitDataRpc()
>     @          0x11ddcd4  impala::DataStreamSender::Channel::TransmitDataHelper()
>     @          0x11de080  impala::DataStreamSender::Channel::TransmitData()
>     @          0x11e1004  impala::ThreadPool<>::WorkerThread()
>     @           0xd10063  impala::Thread::SuperviseThread()
>     @           0xd107a4  boost::detail::thread_data<>::run()
>     @          0x128997a  (unknown)
>     @     0x7f68c5bc7e25  start_thread
>     @     0x7f68c58f534d  __clone
> {code}
> The status can be changed to expected but it is worth verifying that this timeout can be tolerated. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)