You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Pace <we...@gmail.com> on 2020/10/26 15:48:45 UTC

[c++] Futures API review & help understanding benchmark result

Hi all,

I've completed the initial composable futures API and iterator work.
The CSV reader portion is still WIP.

First, I'm interested in getting any feedback on the futures API.  In
particular Future<T>.Then in future.h (and the type erased
Composable.Compose).  The actual implementation can probably be
cleaned up with regards to DRY (the 10 specializations of the Continue
function) which I plan to do at the end.

This approach is a little different than my earlier prototype.  In the
prototype it would always submit continuations on the thread pool as
new tasks.  Instead I've changed it so continuations will run
synchronously when the future is marked complete.  If there is a
desire to move the continuation into a thread pool task it can be done
with Executor.Transfer.  As an example usage this is done in
AsyncForEachHelper so that the applied for-each function is not run on
the reader thread.

Second, and perhaps what I'm more interested in, I've switched
ThreadedTaskGroup to using futures (e.g. using Compose to add a
callback that calls OneTaskDone instead of making a wrapper lambda
function).  In theory this should be more or less the exact same work
as the previous task group implementation.  However, I am seeing
noticeable overhead in arrow-thread-pool-benchmark for small tasks.
The benchmark runs on my system at ~950k items/s for no task group,
~890k items/s with the old task group implementation, and ~450k
items/s with the futures based implementation.  The change is isolated
to one method in task_group.cc so if you replace the method at line
102 with the commented out version at line 127 the original
performance returns.  I've verified that the task is not getting
copied.  There are a few extra moves and function calls and futures
have to be created and copied around so it is possible that is the
cause of it but I'm curious if a second eye could see some other cause
for the degradation that I am missing.  I'll also be seeing if I can
get gprof running later in hopes that can provide some insight.
However, I probably won't spend too much more time on it before
finishing up the CSV reader work and checking the performance of the
CSV reader.

If I can't figure out the cause of the performance I can always allow
task group to keep the implementation it has for Append(task) while
using future for Append(future).  I suspect that the CSV reader tasks
are long enough tasks that the overhead won't be an issue.

Code: https://github.com/apache/arrow/compare/master...westonpace:feature/arrow-10183?expand=1

-Weston

Re: [c++] Futures API review & help understanding benchmark result

Posted by Weston Pace <we...@gmail.com>.
Please enjoy your vacation!

I was able to track it down to the allocations performed when creating
futures.  Simply changing from Executor.Spawn to Executor.Submit
explains nearly half of the performance difference.  Perhaps someday
an investigation can be done to look at pooled allocators.  It seems
we should, at the very least, be able to easily reuse instances of
FutureImpl, FutureStorage<void> and FutureStorage<Status>.

However, for the moment, there is no need for that degree of
performance.  I'll leave futures out of the existing TaskGroup methods
and move on ahead to working on the reader.

On Tue, Oct 27, 2020 at 4:48 AM Antoine Pitrou <an...@python.org> wrote:
>
>
>
> Le 27/10/2020 à 15:47, Antoine Pitrou a écrit :
> >
> > Hi Weston,
> >
> > Note: I'm on vacation, so won't be able to look at this before ~2 weeks.
> >
> > For information, there's a micro-benchmark of thread pools and task
> > groups in src/arrow/util/thread_pool_benchmark.cc.  It should allow you
> > isolate performance concerns a bit better.
>
> Oops, sorry, just saw that you already using it.  That's what I get for
> reading too quickly.
>
> Regards
>
> Antoine.
>
>
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 26/10/2020 à 16:48, Weston Pace a écrit :
> >> Hi all,
> >>
> >> I've completed the initial composable futures API and iterator work.
> >> The CSV reader portion is still WIP.
> >>
> >> First, I'm interested in getting any feedback on the futures API.  In
> >> particular Future<T>.Then in future.h (and the type erased
> >> Composable.Compose).  The actual implementation can probably be
> >> cleaned up with regards to DRY (the 10 specializations of the Continue
> >> function) which I plan to do at the end.
> >>
> >> This approach is a little different than my earlier prototype.  In the
> >> prototype it would always submit continuations on the thread pool as
> >> new tasks.  Instead I've changed it so continuations will run
> >> synchronously when the future is marked complete.  If there is a
> >> desire to move the continuation into a thread pool task it can be done
> >> with Executor.Transfer.  As an example usage this is done in
> >> AsyncForEachHelper so that the applied for-each function is not run on
> >> the reader thread.
> >>
> >> Second, and perhaps what I'm more interested in, I've switched
> >> ThreadedTaskGroup to using futures (e.g. using Compose to add a
> >> callback that calls OneTaskDone instead of making a wrapper lambda
> >> function).  In theory this should be more or less the exact same work
> >> as the previous task group implementation.  However, I am seeing
> >> noticeable overhead in arrow-thread-pool-benchmark for small tasks.
> >> The benchmark runs on my system at ~950k items/s for no task group,
> >> ~890k items/s with the old task group implementation, and ~450k
> >> items/s with the futures based implementation.  The change is isolated
> >> to one method in task_group.cc so if you replace the method at line
> >> 102 with the commented out version at line 127 the original
> >> performance returns.  I've verified that the task is not getting
> >> copied.  There are a few extra moves and function calls and futures
> >> have to be created and copied around so it is possible that is the
> >> cause of it but I'm curious if a second eye could see some other cause
> >> for the degradation that I am missing.  I'll also be seeing if I can
> >> get gprof running later in hopes that can provide some insight.
> >> However, I probably won't spend too much more time on it before
> >> finishing up the CSV reader work and checking the performance of the
> >> CSV reader.
> >>
> >> If I can't figure out the cause of the performance I can always allow
> >> task group to keep the implementation it has for Append(task) while
> >> using future for Append(future).  I suspect that the CSV reader tasks
> >> are long enough tasks that the overhead won't be an issue.
> >>
> >> Code: https://github.com/apache/arrow/compare/master...westonpace:feature/arrow-10183?expand=1
> >>
> >> -Weston
> >>

Re: [c++] Futures API review & help understanding benchmark result

Posted by Antoine Pitrou <an...@python.org>.

Le 27/10/2020 à 15:47, Antoine Pitrou a écrit :
> 
> Hi Weston,
> 
> Note: I'm on vacation, so won't be able to look at this before ~2 weeks.
> 
> For information, there's a micro-benchmark of thread pools and task
> groups in src/arrow/util/thread_pool_benchmark.cc.  It should allow you
> isolate performance concerns a bit better.

Oops, sorry, just saw that you already using it.  That's what I get for
reading too quickly.

Regards

Antoine.


> 
> Regards
> 
> Antoine.
> 
> 
> Le 26/10/2020 à 16:48, Weston Pace a écrit :
>> Hi all,
>>
>> I've completed the initial composable futures API and iterator work.
>> The CSV reader portion is still WIP.
>>
>> First, I'm interested in getting any feedback on the futures API.  In
>> particular Future<T>.Then in future.h (and the type erased
>> Composable.Compose).  The actual implementation can probably be
>> cleaned up with regards to DRY (the 10 specializations of the Continue
>> function) which I plan to do at the end.
>>
>> This approach is a little different than my earlier prototype.  In the
>> prototype it would always submit continuations on the thread pool as
>> new tasks.  Instead I've changed it so continuations will run
>> synchronously when the future is marked complete.  If there is a
>> desire to move the continuation into a thread pool task it can be done
>> with Executor.Transfer.  As an example usage this is done in
>> AsyncForEachHelper so that the applied for-each function is not run on
>> the reader thread.
>>
>> Second, and perhaps what I'm more interested in, I've switched
>> ThreadedTaskGroup to using futures (e.g. using Compose to add a
>> callback that calls OneTaskDone instead of making a wrapper lambda
>> function).  In theory this should be more or less the exact same work
>> as the previous task group implementation.  However, I am seeing
>> noticeable overhead in arrow-thread-pool-benchmark for small tasks.
>> The benchmark runs on my system at ~950k items/s for no task group,
>> ~890k items/s with the old task group implementation, and ~450k
>> items/s with the futures based implementation.  The change is isolated
>> to one method in task_group.cc so if you replace the method at line
>> 102 with the commented out version at line 127 the original
>> performance returns.  I've verified that the task is not getting
>> copied.  There are a few extra moves and function calls and futures
>> have to be created and copied around so it is possible that is the
>> cause of it but I'm curious if a second eye could see some other cause
>> for the degradation that I am missing.  I'll also be seeing if I can
>> get gprof running later in hopes that can provide some insight.
>> However, I probably won't spend too much more time on it before
>> finishing up the CSV reader work and checking the performance of the
>> CSV reader.
>>
>> If I can't figure out the cause of the performance I can always allow
>> task group to keep the implementation it has for Append(task) while
>> using future for Append(future).  I suspect that the CSV reader tasks
>> are long enough tasks that the overhead won't be an issue.
>>
>> Code: https://github.com/apache/arrow/compare/master...westonpace:feature/arrow-10183?expand=1
>>
>> -Weston
>>

Re: [c++] Futures API review & help understanding benchmark result

Posted by Antoine Pitrou <an...@python.org>.
Hi Weston,

Note: I'm on vacation, so won't be able to look at this before ~2 weeks.

For information, there's a micro-benchmark of thread pools and task
groups in src/arrow/util/thread_pool_benchmark.cc.  It should allow you
isolate performance concerns a bit better.

Regards

Antoine.


Le 26/10/2020 à 16:48, Weston Pace a écrit :
> Hi all,
> 
> I've completed the initial composable futures API and iterator work.
> The CSV reader portion is still WIP.
> 
> First, I'm interested in getting any feedback on the futures API.  In
> particular Future<T>.Then in future.h (and the type erased
> Composable.Compose).  The actual implementation can probably be
> cleaned up with regards to DRY (the 10 specializations of the Continue
> function) which I plan to do at the end.
> 
> This approach is a little different than my earlier prototype.  In the
> prototype it would always submit continuations on the thread pool as
> new tasks.  Instead I've changed it so continuations will run
> synchronously when the future is marked complete.  If there is a
> desire to move the continuation into a thread pool task it can be done
> with Executor.Transfer.  As an example usage this is done in
> AsyncForEachHelper so that the applied for-each function is not run on
> the reader thread.
> 
> Second, and perhaps what I'm more interested in, I've switched
> ThreadedTaskGroup to using futures (e.g. using Compose to add a
> callback that calls OneTaskDone instead of making a wrapper lambda
> function).  In theory this should be more or less the exact same work
> as the previous task group implementation.  However, I am seeing
> noticeable overhead in arrow-thread-pool-benchmark for small tasks.
> The benchmark runs on my system at ~950k items/s for no task group,
> ~890k items/s with the old task group implementation, and ~450k
> items/s with the futures based implementation.  The change is isolated
> to one method in task_group.cc so if you replace the method at line
> 102 with the commented out version at line 127 the original
> performance returns.  I've verified that the task is not getting
> copied.  There are a few extra moves and function calls and futures
> have to be created and copied around so it is possible that is the
> cause of it but I'm curious if a second eye could see some other cause
> for the degradation that I am missing.  I'll also be seeing if I can
> get gprof running later in hopes that can provide some insight.
> However, I probably won't spend too much more time on it before
> finishing up the CSV reader work and checking the performance of the
> CSV reader.
> 
> If I can't figure out the cause of the performance I can always allow
> task group to keep the implementation it has for Append(task) while
> using future for Append(future).  I suspect that the CSV reader tasks
> are long enough tasks that the overhead won't be an issue.
> 
> Code: https://github.com/apache/arrow/compare/master...westonpace:feature/arrow-10183?expand=1
> 
> -Weston
>