You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Tobias Zagorni (Jira)" <ji...@apache.org> on 2022/04/27 23:13:00 UTC

[jira] [Comment Edited] (ARROW-16161) [C++] Overhead of std::shared_ptr copies is causing thread contention

    [ https://issues.apache.org/jira/browse/ARROW-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529119#comment-17529119 ] 

Tobias Zagorni edited comment on ARROW-16161 at 4/27/22 11:12 PM:
------------------------------------------------------------------

I created a hacky prototype of an ExecArrayData class that does not hold a shared_ptr to DataType. It is used in Datum as an alternative to ArrayData. I can currently run all the ExecuteScalarExpressionOverhead benchmarks with it.

There are noticeable performance improvements in multi-threaded runs with small (e.g. 1000) batch sizes. The best is in zero_copy_expression:

{{old: ExecuteScalarExpressionOverhead/zero_copy_expression/rows_per_batch:1000/threads:16        1613271 ns     25730049 ns           32 rows_per_second=388.651M/s}}
{{new: ExecuteScalarExpressionOverhead/zero_copy_expression/rows_per_batch:1000/threads:16        1121647 ns     17913402 ns           32 rows_per_second=558.241M/s}}

(full results are attached: [^ExecArrayData-difference.txt])

In other places I see a small performance reduction, and also the simple_expression/100 improvement is smaller than with the first copyable DataType experiment. Probably due to the additional code paths in this implementation and my way of getting a shared_ptr<DataType> back from ExecArrayData, cloning the DataType object, which is far from optimal. I'm still looking into this.

most of the remaining reference counting overhead seems to come from 3 places:
 * ValueDescr / resolving output types using it
 * Scalar datums
 * pointers to the CPUDevice instance, because CPUMemeoryManager instances are recreated a lot. I'll create a diffrent subtask for this

[Code|https://github.com/zagto/arrow/compare/datatype-performance-execarraydata-baseline...zagto:datatype-performance?expand=1] - (ignore the weird mix of template and subclassing for ExecArrayData)


was (Author: JIRAUSER286565):
I created a hacky prototype of an ExecArrayData class that does not hold a shared_ptr to DataType. It is used in Datum as an alternative to ArrayData. I can currently run all the ExecuteScalarExpressionOverhead benchmarks with it.

There are noticeable performance improvements in multi-threaded runs with small (e.g. 1000) batch sizes. The best is in zero_copy_expression:

{{old: ExecuteScalarExpressionOverhead/zero_copy_expression/rows_per_batch:1000/threads:16        1613271 ns     25730049 ns           32 rows_per_second=388.651M/s
new: ExecuteScalarExpressionOverhead/zero_copy_expression/rows_per_batch:1000/threads:16        1121647 ns     17913402 ns           32 rows_per_second=558.241M/s}}

(full results are attached: [^ExecArrayData-difference.txt])

In other places I see a small performance reduction, and also the simple_expression/100 improvement is smaller than with the first copyable DataType experiment. Probably due to the additional code paths in this implementation and my way of getting a shared_ptr<DataType> back from ExecArrayData, cloning the DataType object, which is far from optimal. I'm still looking into this.

most the remaining reference counting overhead seems to come from 3 places:
 * ValueDescr / resolving output types using it
 * Scalar datums
 * pointers to the CPUDevice instance, because CPUMemeoryManager instances are recreated a lot. I'll create a diffrent subtask for this

[Code|https://github.com/zagto/arrow/compare/datatype-performance-execarraydata-baseline...zagto:datatype-performance?expand=1] - (ignore the weird mix of template and subclassing for ExecArrayData)

> [C++] Overhead of std::shared_ptr<DataType> copies is causing thread contention
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-16161
>                 URL: https://issues.apache.org/jira/browse/ARROW-16161
>             Project: Apache Arrow
>          Issue Type: Sub-task
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Tobias Zagorni
>            Priority: Major
>         Attachments: ExecArrayData-difference.txt
>
>
> We created a benchmark to measure ExecuteScalarExpression performance in ARROW-16014.  We noticed significant thread contention (even though there shouldn't be much, if any, for this task) As part of ARROW-16138 we have been investigating possible causes.
> One cause seems to be contention from copying shared_ptr<DataType> objects.
> Two possible solutions jump to mind and I'm sure there are many more.
> ExecBatch is an internal type and used inside of ExecuteScalarExpression as well as inside of the execution engine.  In the former we can safely assume the data types will exist for the duration of the call.  In the latter we can safely assume the data types will exist for the duration of the execution plan.  Thus we can probably take a more targetted fix and migrate only ExecBatch to using DataType* (or const DataType&).
> On the other hand, we might consider a more global approach.  All of our "stock" data types are assumed to have static storage duration.  However, we must use std::shared_ptr<DataType> because users could create their own extension types.  We could invent an "extension type registration" system where extension types must first be registered with the C++ lib before being used.  Then we could have long-lived DataType instances and we could replace std::shared_ptr<DataType> with DataType* (or const DataType&) throughout most of the entire code base.
> But, as I mentioned, I'm sure there are many approaches to take.  CC [~lidavidm] and [~apitrou] and [~yibocai] for thoughts but this might be interesting for just about any C++ dev.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)