You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Michael Ho (JIRA)" <ji...@apache.org> on 2017/12/19 22:05:00 UTC

[jira] [Comment Edited] (KUDU-1865) Create fast path for RespondSuccess() in KRPC

    [ https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294467#comment-16294467 ] 

Michael Ho edited comment on KUDU-1865 at 12/19/17 10:04 PM:
-------------------------------------------------------------

While working on IMPALA-5528, I noticed some of the overhead has to do with the way Protobuf handles fields which aren't primitive types. Normally, we'd use the {{mutable_<field_name>()}} interface to get a handle to a field in a protobuf buffer in order to set it to the some values. That implicitly creates an object on the heap. In the example above, {{TransmitDataResponsePb}} will always a {{StatusPB}} object in the fast path to serialize the value {{Status::OK()}} into. This is causing extra pressure of the thread caches in TCMalloc.

Part of this is due to the limitation of the interface {{RpcContext}}. In particular, {{RpcContext}} always allocates and owns the response protobuf buffer. Upon calling {{RpcContext::RespondSuccess()}}, the {{RpcContext}} object will be deleted and so will the response protobuf buffer, making it impossible to stack allocate {{StatusPB}} as protobuf internally assumes that it owns the {{StatusPB}} buffer.

I notice KRPC in general is inducing quite a lot of pressure on the thread cache.
I tried doing the followings and they seem to reduce the pressure in TCMalloc thread cache
- Introduce a new interface {{RpcContext::RespondSucceess(const MessageLite& message)}} to allow callers to stack allocate the response protobuf buffer
- Introduce a virtual function in {{ServiceIf::AllocResponsePB()}} to allow each KRPC based service to override the way response protobuf buffer is allocated. This complements point 1 as services which only utilize the new interface to stack allocate response buffer don't care about allocating a response protobuf buffer in the heap.
- Recycle {{InboundCall}} objects with an object pool and embed {{RpcContext}} into {{InboundCall}}

Other things which may help:
- Allows {{Trace}} objects to be recycled. This seems tricky with the current way the {{ObjectPool}} is implemented.

[~tlipcon], what do you think ?



was (Author: kwho):
While working on IMPALA-5528, I noticed some of the overhead has to do with the way Protobuf handles fields which aren't primitive types. Normally, we'd use the {{mutable_<field_name>()}} interface to get a handle to a field in a protobuf buffer in order to set it to the some values. That implicitly creates an object on the heap. In the example above, {{TransmitDataResponsePb}} will always a {{StatusPB}} object in the fast path to serialize the value {{Status::OK()}} into. This is causing extra pressure of the thread caches in TCMalloc.

Part of this is due to the limitation of the interface {{RpcContext}}. In particular, {{RpcContext}} always allocates and owns the response protobuf buffer. Upon calling {{RpcContext::RespondSuccess()}}, the {{RpcContext}} object will be deleted and so will the response protobuf buffer, making it impossible to stack allocate {{StatusPB}} as protobuf internally assumes that it owns the {{StatusPB}} buffer.

I notice KRPC in general is inducing quite a lot of pressure on the thread cache.
I tried doing the followings and they seem to reduce the pressure in TCMalloc thread cache
- Introduce a new interface {{RpcContext::RespondSucceess(const MessageLite& message)}} to allow callers to stack allocate the response protobuf buffer
- Introduce a virtual function in {{ServiceIf::AllocResponsePB()}} to allow each KRPC based service to override the way response protobuf buffer is allocated. This complements point 1 as services which only utilize the new interface to stack allocate response buffer don't care about allocating a response protobuf buffer in the heap.
- Recycle {{RpcContext}} with an object pool
- Recycle {{InboundCall}} with an object pool

Other things which may help:
- Allows {{Trace}} objects to be recycled. This seems tricky with the current way the {{ObjectPool}} is implemented.

[~tlipcon], what do you think ?


> Create fast path for RespondSuccess() in KRPC
> ---------------------------------------------
>
>                 Key: KUDU-1865
>                 URL: https://issues.apache.org/jira/browse/KUDU-1865
>             Project: Kudu
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Sailesh Mukil
>              Labels: perfomance, rpc
>         Attachments: alloc-pattern.py, cross-thread.txt
>
>
> A lot of RPCs just respond with RespondSuccess() which returns the exact payload every time. This takes the same path as any other response by ultimately calling Connection::QueueResponseForCall() which has a few small allocations. These small allocations (and their corresponding deallocations) are called quite frequently (once for every IncomingCall) and end up taking quite some time in the kernel (traversing the free list, spin locks etc.)
> This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20 node cluster and found the following:
> The exact % of time spent is hard to quantify from the profiles, but these were the among the top 5 of the slowest stacks:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - new_allocator.h:110
> impalad ! std::_Vector_base<kudu::Slice, std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178
> impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
> impalad ! ~vector - stl_vector.h:425                           <----Deleting 'slices' vector
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - connection.cc:433
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ..
> {code}
> {code:java}
> impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source file]
> impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown source file]
> impalad ! operator new + 0x297 - [unknown source file]        <--- Creating new 'OutboundTransferTask' object.
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - connection.cc:432
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ...
> {code}
> Even creating and deleting the 'RpcContext' takes a lot of time:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - impala_internal_service.pb.cc:1221
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - impala_internal_service.pb.cc:1222
> impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - gscoped_ptr.h:145
> impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228
> impalad ! ~gscoped_ptr - gscoped_ptr.h:318
> impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53   <-----
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67
> {code}
> The above show that creating these small objects under moderately heavy load results in heavy contention in the kernel. We will benefit a lot if we create a fast path for 'RespondSuccess'.
> My suggestion is to create all these small objects at once along with the 'InboundCall' object while it is being created, in a 'RespondSuccess' structure, and just use that structure when we want to send 'success' back to the sender. This would already contain the 'OutboundTransferTask', a 'Slice' with 'success', etc. We would expect that most RPCs respond with 'success' a majority of the time.
> How this would benefit us is that we don't go back and forth every time to allocate and deallocate memory for these small objects, instead we do it all at once while creating the 'InboundCall' object.
> I just wanted to start a discussion about this, so even if what I suggested seems a little off, hopefully we can move forward with this on some level.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)