You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/29 22:53:25 UTC

[GitHub] [arrow] wesm opened a new pull request, #13753: ARROW-17259: [C++] Do not shared_ptr with kernel function signatures, do less copying of shared_ptrs

wesm opened a new pull request, #13753:
URL: https://github.com/apache/arrow/pull/13753

   This is something that I've hacked on a little bit while on airplanes and in idle moments for my own curiosity and just cleaned up to turn into a PR. This PR changes `arrow::compute::InputType` to only store `const DataType*` instead of `shared_ptr<DataType>`, which means that we don't have to copy the smart pointers for every kernel and function that's registered. Amazingly, this trims the size of libarrow.so by about 1% (~300KB) because of not having a bunch of inlined shared_ptr code. 
   
   This change does have consequences -- developers need to be careful about creating function signatures with data types that may have temporary storage duration (all of the APIs like `arrow::int64()` return static instances, and I changed `arrow::duration` and some other type factories to return static instances, too). But I think the benefits outweigh the costs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1204049003

   @pitrou yes, I'll fix them today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1205307758

   @pitrou this isn’t a micro-optimization imho — when 1% of your binary size is concerned with inline shared_ptr code in this narrow part of the library that speaks to a design problem. The kernel registry population is still very bloated and should be able to be streamlined further to yield smaller binary size. The use of shared pointers also causes extra overhead in kernel dispatching and input type checking, but we would need to write some benchmarks to quantify that better 
   
   it might be a useful exercise at some point to analyze the disassembly of libarrow.so to understand the total contribution of inline shared_ptr code to the binary size. There are a lot of places where we inline constructors for example where there is probably no need 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
lidavidm commented on code in PR #13753:
URL: https://github.com/apache/arrow/pull/13753#discussion_r937778580


##########
cpp/src/arrow/compute/kernels/aggregate_var_std.cc:
##########
@@ -287,12 +285,15 @@ const FunctionDoc variance_doc{
     {"array"},
     "VarianceOptions"};
 
+namespace {}  // namespace
+

Review Comment:
   nit: empty namespace?



##########
cpp/src/arrow/compute/kernels/hash_aggregate.cc:
##########
@@ -2775,6 +2773,39 @@ const FunctionDoc hash_one_doc{"Get one value from each group",
 const FunctionDoc hash_list_doc{"List all values in each group",
                                 ("Null values are also returned."),
                                 {"array", "group_id_array"}};
+
+void PopulateHashGenericKernels(Result<HashAggregateKernel> make_kernel(const DataType*),
+                                std::shared_ptr<HashAggregateFunction> func,
+                                FunctionRegistry* registry) {
+  DCHECK_OK(AddHashAggKernels(NumericTypes(), make_kernel, func.get()));
+  DCHECK_OK(AddHashAggKernels(TemporalTypes(), make_kernel, func.get()));
+  DCHECK_OK(AddHashAggKernels(BaseBinaryTypes(), make_kernel, func.get()));
+  DCHECK_OK(AddHashAggKernels(
+      {null().get(), boolean().get(), decimal128(1, 1).get(), decimal256(1, 1).get(),
+       month_interval().get(), fixed_size_binary(1).get()},
+      make_kernel, func.get()));
+  DCHECK_OK(registry->AddFunction(std::move(func)));
+}
+
+void PopulateNumericHashKernels(Result<HashAggregateKernel> make_kernel(const DataType*),
+                                bool with_null_boolean,
+                                std::shared_ptr<HashAggregateFunction> func,
+                                FunctionRegistry* registry) {
+  static std::shared_ptr<DataType> decimal128_example = decimal128(1, 1);
+  static std::shared_ptr<DataType> decimal256_example = decimal256(1, 1);

Review Comment:
   extreme nit: there's some inconsistency here and above, maybe `PopulateHashGenericKernels` should just call `PopulateNumericHashKernels`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1205179702

   `InputType` and `OutputType` instances are only created at kernel registration, if I'm not mistaken?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1204242802

   Hmm... I assume it's because of the `serialized_size` difference? It's weird that it would differ by one byte, but perhaps that's due to a different Thrift version. We should probably simply relax this test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1200010827

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1200010816

   https://issues.apache.org/jira/browse/ARROW-17259


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1206891452

   If there's no sizable performance improvement then I'm against merging this. Robustness should be a guiding principle and using replacing shared pointers with raw pointers goes against it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1205352932

   > The use of shared pointers also causes extra overhead in kernel dispatching and input type checking, but we would need to write some benchmarks to quantify that better
   
   I would like to see benchmarks indeed, because as long as you don't copy them the overhead should be minimal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1206847368

   Any objections to merging this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1204382202

   I patched this in https://github.com/apache/arrow/pull/13790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1204244852

   I think the reason is the change from 9.0.0 to 10.0.0 which added one byte, here is the doctest
   
   ```
       >>> pq.read_metadata('example.parquet')
       <pyarrow._parquet.FileMetaData object at ...>
         created_by: parquet-cpp-arrow version ...
         num_columns: 2
         num_rows: 3
         num_row_groups: 1
         format_version: 2.6
         serialized_size: 561
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1204238475

   @pitrou there's a doctest failure here unrelated to this PR:
   
   ```
   2022-08-03T17:03:42.8027721Z =================================== FAILURES ===================================
   2022-08-03T17:03:42.8028142Z ___________________ [doctest] pyarrow.parquet.read_metadata ____________________
   2022-08-03T17:03:42.8028491Z 3406 
   2022-08-03T17:03:42.8028721Z 3407     Examples
   2022-08-03T17:03:42.8029041Z 3408     --------
   2022-08-03T17:03:42.8029336Z 3409     >>> import pyarrow as pa
   2022-08-03T17:03:42.8029679Z 3410     >>> import pyarrow.parquet as pq
   2022-08-03T17:03:42.8030399Z 3411     >>> table = pa.table({'n_legs': [4, 5, 100],
   2022-08-03T17:03:42.8030920Z 3412     ...                   'animal': ["Dog", "Brittle stars", "Centipede"]})
   2022-08-03T17:03:42.8031437Z 3413     >>> pq.write_table(table, 'example.parquet')
   2022-08-03T17:03:42.8031751Z 3414 
   2022-08-03T17:03:42.8032144Z 3415     >>> pq.read_metadata('example.parquet')
   2022-08-03T17:03:42.8032625Z Differences (unified diff with -expected +actual):
   2022-08-03T17:03:42.8033004Z     @@ -1,7 +1,7 @@
   2022-08-03T17:03:42.8033412Z     -<pyarrow._parquet.FileMetaData object at ...>
   2022-08-03T17:03:42.8033907Z     -  created_by: parquet-cpp-arrow version ...
   2022-08-03T17:03:42.8034339Z     +<pyarrow._parquet.FileMetaData object at 0x7fd2089b3830>
   2022-08-03T17:03:42.8034877Z     +  created_by: parquet-cpp-arrow version 10.0.0-SNAPSHOT
   2022-08-03T17:03:42.8035225Z        num_columns: 2
   2022-08-03T17:03:42.8035477Z        num_rows: 3
   2022-08-03T17:03:42.8035752Z        num_row_groups: 1
   2022-08-03T17:03:42.8036047Z        format_version: 2.6
   2022-08-03T17:03:42.8036407Z     -  serialized_size: 561
   2022-08-03T17:03:42.8036708Z     +  serialized_size: 562
   2022-08-03T17:03:42.8037076Z 
   2022-08-03T17:03:42.8037502Z /opt/conda/envs/arrow/lib/python3.9/site-packages/pyarrow/parquet/__init__.py:3415: DocTestFailure
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1204489971

   I think this can be merged once the build is green if the changes look okay. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on a diff in pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on code in PR #13753:
URL: https://github.com/apache/arrow/pull/13753#discussion_r933668310


##########
cpp/src/arrow/type.h:
##########
@@ -2120,35 +2108,4 @@ ARROW_EXPORT
 std::string ToString(TimeUnit::type unit);
 
 }  // namespace internal
-
-// Helpers to get instances of data types based on general categories
-
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& SignedIntTypes();
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& UnsignedIntTypes();
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& IntTypes();
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& FloatingPointTypes();
-// Number types without boolean
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& NumericTypes();
-// Binary and string-like types (except fixed-size binary)
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& BaseBinaryTypes();
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& BinaryTypes();
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& StringTypes();
-// Temporal types including time and timestamps for each unit
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& TemporalTypes();
-// Interval types
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& IntervalTypes();
-// Integer, floating point, base binary, and temporal
-ARROW_EXPORT
-const std::vector<std::shared_ptr<DataType>>& PrimitiveTypes();
-

Review Comment:
   There wasn't really a good reason for these functions to be here -- they probably even should be in an internal namespace in arrow::compute



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1205267594

   > But those Input/OutputType instances are stored in the KernelSignature, so those references need to stay valid.
   
   My question was more along the lines of: what is the point of micro-optimizing a number of shared_ptr copies that only happen at startup?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1204042803

   @wesm Would you like to take a look at the compile failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1205352298

   > @pitrou this isn’t a micro-optimization imho — when 1% of your binary size is concerned with inline shared_ptr code in this narrow part of the library that speaks to a design problem.
   
   Is this looking at the text segment, or the entire binary size? If the latter, that might mean debug symbols or something... (those can be stripped for distribution purposes)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1490652151

   Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- closed pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- closed pull request #13753: ARROW-17259: [C++] Do not use shared_ptr<DataType> with kernel function signatures, do less copying of shared_ptrs
URL: https://github.com/apache/arrow/pull/13753


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1205369067

   I took the PR and rebased it internally to then compare the sizes:
   
   * master:
   ```
   $ size build-release/release/libarrow.so
      text	   data	    bss	    dec	    hex	filename
   23471557	 317416	2568225	26357198	1922dce	build-release/release/libarrow.so
   $ ls -la build-release/release/libarrow.so.1000.0.0 
   -rwxrwxr-x 1 antoine antoine 37289616 août   4 16:48 build-release/release/libarrow.so.1000.0.0
   $ strip build-release/release/libarrow.so.1000.0.0 
   $ ls -la build-release/release/libarrow.so.1000.0.0 
   -rwxrwxr-x 1 antoine antoine 23794552 août   4 16:48 build-release/release/libarrow.so.1000.0.0
   ```
   
   * PR:
   ```
   $ size build-release/release/libarrow.so
      text	   data	    bss	    dec	    hex	filename
   23314548	 314736	2568801	26198085	18fc045	build-release/release/libarrow.so
   $ ls -la build-release/release/libarrow.so.1000.0.0 
   -rwxrwxr-x 1 antoine antoine 37097032 août   4 16:49 build-release/release/libarrow.so.1000.0.0
   $ strip build-release/release/libarrow.so.1000.0.0 
   $ ls -la build-release/release/libarrow.so.1000.0.0 
   -rwxrwxr-x 1 antoine antoine 23634808 août   4 16:50 build-release/release/libarrow.so.1000.0.0
   ```
   
   So it's a 0.7% saving here in both text segment size and stripped library size.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1206908421

   I do not think the implementation is less robust — the only change is that ephemeral `DataType` instances cannot be used to instantiate `InputType` without creating a TypeMatcher that wraps them. If the developer does create a KernelSignature / InputType with an ephemeral `shared_ptr<DataType>`, the signature would be unusable so any unit test which attempts to use the kernel would segfault. The probability of an Arrow user ever facing a bug related to this essentially zero — it would amount to us merging untested kernels code (because tests would immediately reveal the programming error). 
   
   It also comes down to a cost philosophy in this code:
   
   * Pay the cost of using shared_ptr always — both in generated code (because most shared_ptr operations are inlined by the compiler) and in performance
   * Pay the cost only when it's needed
   
   It turns out that in this code, that retaining a `shared_ptr<DataType>` for kernel function signatures and type validation is rarely needed. Why pay the cost for > 95% of cases (or whatever the number is?) when the shared_ptr is only actually needed in < 5% of cases? I recall that we've already seen shared_ptr contention show up unfavorably in performance profiles. 
   
   I will take the time when I can to write some benchmarks (for me, the present-and-future savings in generated code is evidence enough) that will hopefully provide some additional evidence to support this. In the meantime I hope that the rebase conflicts will not be too annoying to keep up with. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
lidavidm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1205239183

   But those Input/OutputType instances are stored in the KernelSignature, so those references need to stay valid.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1208362155

   I've done a little exploration locally and concluded that `shared_ptr<DataType>` versus `const DataType*` has no impact on the performance of `Function::DispatchExact` (at least on recent clang), which confirms the general guidance that dereferencing shared pointers is no different than a raw pointer — it's the creation, copying, and destruction of shared pointers that's expensive, especially in a multithreaded setting. 
   
   The last thing I'm interested in is the "bootup time" of the function registry, so I'm going to compare the before and after performance of that and report it here. If there is no significant difference, I will refactor this PR to restore shared pointers to `InputType` and `OutputType` but rather avoid allowing shared pointer copies to get inlined in these functions, so hopefully that will yield the same kind of reduction in binary bloat. There are some other beneficial changes here so I'd like to get those in — I'll report back when I have that extra performance data and have done the refactoring. 
   
   There's probably other places where we're either passing a shared pointer where there is no need or where we inline non-performance-sensitive-constructors which copy shared pointers where there is similarly little need, so let's be an eye out for htat. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on pull request #13753: ARROW-17259: [C++] Do not use shared_ptr with kernel function signatures, do less copying of shared_ptrs

Posted by GitBox <gi...@apache.org>.
wesm commented on PR #13753:
URL: https://github.com/apache/arrow/pull/13753#issuecomment-1241071227

   I had some personal stuff come up in August that got in the way of completing this work -- I will pick this up hopefully sometime this month (September) and hopefully reach a point that is satisfactory to all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org