You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Will Berkeley (JIRA)" <ji...@apache.org> on 2019/03/01 01:03:00 UTC
[jira] [Comment Edited] (KUDU-1865) Create fast path for RespondSuccess() in KRPC

    [ https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781146#comment-16781146 ] 

Will Berkeley edited comment on KUDU-1865 at 3/1/19 1:02 AM:
-------------------------------------------------------------

Just wanted to bump this JIRA to indicate that this is still an issue, especially (unsurprisingly) in workloads that send many small RPCs. I see it holding up many threads executed transactions during many TabletServerService queue overflows in some workloads I am running.


was (Author: wdberkeley):
Just wanted to bump this JIRA to indicate that this is still an issue, especially (unsurprisingly) in workloads that send many small RPCs. I see it being the main contributor to many TabletServerService queue overflows in some workloads I am running.

> Create fast path for RespondSuccess() in KRPC
> ---------------------------------------------
>
>                 Key: KUDU-1865
>                 URL: https://issues.apache.org/jira/browse/KUDU-1865
>             Project: Kudu
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Sailesh Mukil
>            Priority: Major
>              Labels: perfomance, rpc
>         Attachments: alloc-pattern.py, cross-thread.txt
>
>
> A lot of RPCs just respond with RespondSuccess() which returns the exact payload every time. This takes the same path as any other response by ultimately calling Connection::QueueResponseForCall() which has a few small allocations. These small allocations (and their corresponding deallocations) are called quite frequently (once for every IncomingCall) and end up taking quite some time in the kernel (traversing the free list, spin locks etc.)
> This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 node cluster and found the following:
> The exact % of time spent is hard to quantify from the profiles, but these were the among the top 5 of the slowest stacks:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - new_allocator.h:110
> impalad ! std::_Vector_base<kudu::Slice, std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178
> impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
> impalad ! ~vector - stl_vector.h:425                           <----Deleting 'slices' vector
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - connection.cc:433
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ..
> {code}
> {code:java}
> impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source file]
> impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown source file]
> impalad ! operator new + 0x297 - [unknown source file]        <--- Creating new 'OutboundTransferTask' object.
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - connection.cc:432
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ...
> {code}
> Even creating and deleting the 'RpcContext' takes a lot of time:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - impala_internal_service.pb.cc:1221
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - impala_internal_service.pb.cc:1222
> impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - gscoped_ptr.h:145
> impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228
> impalad ! ~gscoped_ptr - gscoped_ptr.h:318
> impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53   <-----
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67
> {code}
> The above show that creating these small objects under moderately heavy load results in heavy contention in the kernel. We will benefit a lot if we create a fast path for 'RespondSuccess'.
> My suggestion is to create all these small objects at once along with the 'InboundCall' object while it is being created, in a 'RespondSuccess' structure, and just use that structure when we want to send 'success' back to the sender. This would already contain the 'OutboundTransferTask', a 'Slice' with 'success', etc. We would expect that most RPCs respond with 'success' a majority of the time.
> How this would benefit us is that we don't go back and forth every time to allocate and deallocate memory for these small objects, instead we do it all at once while creating the 'InboundCall' object.
> I just wanted to start a discussion about this, so even if what I suggested seems a little off, hopefully we can move forward with this on some level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)