You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/26 17:35:12 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays

jorgecarleitao opened a new pull request #9329:
URL: https://github.com/apache/arrow/pull/9329


   This PR is another of those experiments to gather feedback and share results.
   
   # Rational
   
   Currently, all our arrays use an `Arc<ArrayData>`, which they expose via `Array::data` and `Array::data_ref`. This adds a level of indirection. Now, it happens that, afaik, in the current code base `Arc<>` is not needed.
   
   On #9271, where we are observing some perf issues with small arrays, and one of the ideas that came up was to get rid of `Arc` and see what happens.
   
   # This PR
   
   Well, this PR replaces all `Arc<ArrayData>` by `ArrayData`. On the one hand, this means that cloning an array is a tad more expensive (`Arc` vs `ArrayData`), even though we seldom clone an `Arc<ArrayData>`. On the other hand, it means that often the compiler can optimize out, as many operations will never leave the stack.
   
   The gist of the benchmarks below is:
   * ~10%-20% improvement over basically everything
   * ~20%-100% improvement in `take`
   
   There is some noise, as there are benches that are not expected to be affected and are being affected, which I am trying to reduce by running the benches at night.
   
   Also, the arrow tests pass, but I did not port this to the other crates nor the SIMD branch of the code.
   
   Personally, I like this PR because it makes working with `ArrayData` and arrays so much simpler: no need for `Arc::new` or `as_ref` and company (besides the speed).
   
   # questions
   
   * does anyone knows why we are using `Arc<ArrayData>` in all arrays?
   * Do you envision an issue with removing the `Arc`?
   * Would someone be so kind and run the benches independently, just to be sure.
   
   # Benchmarks
   
   ```bash
   # modify cargo.toml by adding `bench = false` to the section [lib]
   
   git checkout master
   cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`
   
   git checkout arcless
   cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`
   ```
   
   ```bash
   critcmp arcless-3dbcaca49 master-437c8c944 -t 10
   ```
   
   ```
   group                                         arcless-3dbcaca49                       master-437c8c944
   -----                                         -----------------                       ----------------
   add_nulls_512                                 1.00   509.1±98.69ns        ? B/sec     1.14   581.9±98.83ns        ? B/sec
   array_from_vec 128                            1.00  1180.2±304.50ns        ? B/sec    1.20  1411.2±475.00ns        ? B/sec
   array_slice 128                               1.00  383.5±430.02ns        ? B/sec     1.23   471.3±94.42ns        ? B/sec
   array_slice 2048                              1.00   393.3±94.60ns        ? B/sec     1.22  478.2±293.70ns        ? B/sec
   array_slice 512                               1.00   393.0±77.38ns        ? B/sec     1.26  496.1±145.06ns        ? B/sec
   array_string_from_vec 128                     1.00      3.5±0.39µs        ? B/sec     1.22      4.3±1.68µs        ? B/sec
   array_string_from_vec 256                     1.00      4.4±0.46µs        ? B/sec     1.11      4.9±0.94µs        ? B/sec
   buffer_bit_ops or                             1.11  706.1±102.45ns        ? B/sec     1.00   633.3±56.65ns        ? B/sec
   cast date32 to date64 512                     1.00      7.2±0.85µs        ? B/sec     1.11      8.0±1.25µs        ? B/sec
   cast int32 to float64 512                     1.00      3.0±0.35µs        ? B/sec     1.16      3.4±0.77µs        ? B/sec
   cast int32 to int64 512                       1.00      3.3±0.41µs        ? B/sec     1.11      3.6±0.67µs        ? B/sec
   cast time32s to time32ms 512                  1.13  1864.9±585.48ns        ? B/sec    1.00  1648.3±145.69ns        ? B/sec
   cast timestamp_ms to i64 512                  1.00   350.6±67.17ns        ? B/sec     1.35   474.9±72.34ns        ? B/sec
   cast timestamp_ms to timestamp_ns 512         1.00      2.3±0.28µs        ? B/sec     1.22      2.8±4.01µs        ? B/sec
   cast utf8 to date64 512                       1.22    96.5±22.75µs        ? B/sec     1.00     79.1±6.32µs        ? B/sec
   concat str 1024                               1.00      9.9±0.41µs        ? B/sec     1.34     13.2±5.26µs        ? B/sec
   eq scalar Float32                             1.25    82.9±33.34µs        ? B/sec     1.00     66.4±6.96µs        ? B/sec
   equal_512                                     1.00    28.1±18.85ns        ? B/sec     1.76    49.5±12.56ns        ? B/sec
   equal_bool_512                                1.00     21.0±4.26ns        ? B/sec     1.97     41.5±2.29ns        ? B/sec
   equal_bool_513                                1.00    28.3±25.49ns        ? B/sec     1.60     45.4±3.84ns        ? B/sec
   equal_nulls_512                               1.00      2.4±0.19µs        ? B/sec     1.10      2.7±0.37µs        ? B/sec
   equal_string_512                              1.00    98.3±21.85ns        ? B/sec     1.14    111.9±5.30ns        ? B/sec
   equal_string_nulls_512                        1.00      3.7±0.46µs        ? B/sec     1.10      4.0±1.29µs        ? B/sec
   filter context f32                            1.00   503.9±14.15µs        ? B/sec     1.11   561.8±68.59µs        ? B/sec
   filter context f32 high selectivity           1.00    309.3±8.26µs        ? B/sec     1.11   343.2±57.50µs        ? B/sec
   filter context string high selectivity        1.00  1111.6±36.80µs        ? B/sec     1.11  1238.6±174.82µs        ? B/sec
   filter context u8                             1.00   238.9±12.44µs        ? B/sec     1.18  281.0±112.36µs        ? B/sec
   filter context u8 low selectivity             1.17      2.4±1.25µs        ? B/sec     1.00      2.1±0.26µs        ? B/sec
   filter context u8 w NULLs                     1.00   526.7±98.56µs        ? B/sec     1.11  586.5±155.02µs        ? B/sec
   filter context u8 w NULLs high selectivity    1.00    298.7±6.11µs        ? B/sec     1.15   344.2±63.42µs        ? B/sec
   filter context u8 w NULLs low selectivity     1.00      2.7±0.21µs        ? B/sec     1.31      3.5±5.02µs        ? B/sec
   filter f32                                    1.00   802.9±44.38µs        ? B/sec     1.21  971.4±135.07µs        ? B/sec
   filter u8 low selectivity                     1.00      7.8±0.23µs        ? B/sec     1.14      8.9±1.47µs        ? B/sec
   from_slice                                    1.13  1665.5±215.62µs        ? B/sec    1.00  1475.4±324.38µs        ? B/sec
   from_slice prepared                           1.31  1092.8±125.08µs        ? B/sec    1.00   834.2±59.41µs        ? B/sec
   gt Float32                                    1.00    71.8±30.69µs        ? B/sec     1.18     84.6±9.31µs        ? B/sec
   gt scalar Float32                             1.82    74.1±25.74µs        ? B/sec     1.00     40.8±4.06µs        ? B/sec
   gt_eq scalar Float32                          1.14    71.8±24.75µs        ? B/sec     1.00     63.0±7.38µs        ? B/sec
   json_list_primitive_to_record_batch           1.00     64.6±4.38µs        ? B/sec     1.14    73.4±11.01µs        ? B/sec
   length                                        1.00      2.9±0.07µs        ? B/sec     1.29      3.7±0.77µs        ? B/sec
   like_utf8 scalar ends with                    1.00   243.8±15.05µs        ? B/sec     1.21   294.5±45.77µs        ? B/sec
   like_utf8 scalar equals                       1.21   101.6±13.08µs        ? B/sec     1.00    83.7±10.13µs        ? B/sec
   limit 512, 512                                1.00   327.3±60.96ns        ? B/sec     1.38  451.1±130.11ns        ? B/sec
   lt Float32                                    1.00     69.4±7.77µs        ? B/sec     1.27    88.3±12.27µs        ? B/sec
   lt scalar Float32                             1.00     66.5±4.71µs        ? B/sec     1.31    87.3±25.97µs        ? B/sec
   lt_eq Float32                                 1.00   118.9±19.95µs        ? B/sec     1.26   149.9±56.97µs        ? B/sec
   max nulls 512                                 1.00  1567.2±204.81ns        ? B/sec    1.16  1820.6±260.46ns        ? B/sec
   min nulls 512                                 1.00  1706.9±802.12ns        ? B/sec    1.16  1980.7±259.55ns        ? B/sec
   min nulls string 512                          1.00      7.5±0.44µs        ? B/sec     1.18      8.9±1.66µs        ? B/sec
   min string 512                                1.00      5.6±0.30µs        ? B/sec     1.10      6.2±1.42µs        ? B/sec
   multiply 512                                  1.00  541.6±224.47ns        ? B/sec     1.11  601.0±102.46ns        ? B/sec
   mutable                                       1.11  630.0±349.07µs        ? B/sec     1.00   566.7±77.31µs        ? B/sec
   mutable str 1024                              1.00  1515.5±59.36µs        ? B/sec     1.17  1776.6±329.88µs        ? B/sec
   neq Float32                                   1.00    77.8±16.48µs        ? B/sec     1.18    91.5±20.30µs        ? B/sec
   neq scalar Float32                            1.52   101.2±52.55µs        ? B/sec     1.00     66.4±6.21µs        ? B/sec
   sort 2^10                                     1.00    147.2±4.97µs        ? B/sec     1.19   175.6±25.84µs        ? B/sec
   sort 2^12                                     1.00   726.9±50.41µs        ? B/sec     1.17  849.8±109.51µs        ? B/sec
   sort nulls 2^12                               1.00   631.5±29.67µs        ? B/sec     1.18   746.5±87.64µs        ? B/sec
   struct_array_from_vec 1024                    1.20     17.3±4.32µs        ? B/sec     1.00     14.4±0.93µs        ? B/sec
   subtract 512                                  1.00  487.5±173.36ns        ? B/sec     1.18   574.3±37.34ns        ? B/sec
   sum nulls 512                                 1.11   363.1±73.55ns        ? B/sec     1.00   327.8±97.10ns        ? B/sec
   take bool 512                                 1.00      2.2±0.28µs        ? B/sec     1.22      2.6±0.32µs        ? B/sec
   take bool nulls 1024                          1.00      4.0±0.25µs        ? B/sec     2.04      8.1±2.08µs        ? B/sec
   take bool nulls 512                           1.00      2.2±0.13µs        ? B/sec     1.80      3.9±0.77µs        ? B/sec
   take i32 1024                                 1.00  1507.3±81.31ns        ? B/sec     1.57      2.4±0.58µs        ? B/sec
   take i32 512                                  1.00  1030.9±39.48ns        ? B/sec     1.32  1364.4±97.02ns        ? B/sec
   take i32 nulls 1024                           1.00  1538.9±51.37ns        ? B/sec     1.81      2.8±1.00µs        ? B/sec
   take i32 nulls 512                            1.00  1023.8±51.95ns        ? B/sec     2.48      2.5±1.64µs        ? B/sec
   take str 512                                  1.00      4.1±0.51µs        ? B/sec     1.14      4.6±0.96µs        ? B/sec
   take str null indices 1024                    1.00      6.1±0.40µs        ? B/sec     1.18      7.2±1.63µs        ? B/sec
   take str null indices 512                     1.00      3.8±0.15µs        ? B/sec     1.21      4.6±1.32µs        ? B/sec
   take str null values 1024                     1.00      6.0±0.40µs        ? B/sec     1.18      7.1±1.02µs        ? B/sec
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-791898261


   Benchmarks on an ARM machine:
   
   ```
   nevilledipale@Nevilles-MacBook-Air rust % critcmp master-bfa99d98e arcless-ef2250b9d -t 10
   group                           arcless-ef2250b9d                       master-bfa99d98e
   -----                           -----------------                       ----------------
   add_nulls_512                   1.00   351.5±22.53ns        ? B/sec     1.12   394.1±19.39ns        ? B/sec
   array_from_vec 128              1.00   555.6±26.43ns        ? B/sec     1.11   619.1±17.89ns        ? B/sec
   array_slice 128                 1.00     98.3±8.62ns        ? B/sec     1.42    139.4±7.01ns        ? B/sec
   array_slice 2048                1.00   105.8±13.50ns        ? B/sec     1.37    145.1±8.93ns        ? B/sec
   array_slice 512                 1.00    101.9±6.16ns        ? B/sec     1.41    143.3±9.44ns        ? B/sec
   cast date32 to date64 512       1.00   246.7±17.81ns        ? B/sec     1.23   303.7±14.81ns        ? B/sec
   cast time32s to time32ms 512    1.00   223.1±13.25ns        ? B/sec     1.18   262.3±16.68ns        ? B/sec
   cast timestamp_ms to i64 512    1.00     98.6±9.32ns        ? B/sec     1.56    153.7±7.79ns        ? B/sec
   divide_scalar 512               1.00   189.9±12.11ns        ? B/sec     1.16    220.4±8.05ns        ? B/sec
   divide_scalar_nulls_512         1.00   186.6±14.60ns        ? B/sec     1.21   225.1±11.87ns        ? B/sec
   eq Float32                      1.15     40.0±1.54µs        ? B/sec     1.00     34.7±1.30µs        ? B/sec
   equal_bool_512                  1.00     25.8±1.34ns        ? B/sec     1.12     28.9±1.15ns        ? B/sec
   equal_bool_513                  1.00     27.7±1.36ns        ? B/sec     1.14     31.5±1.42ns        ? B/sec
   gt Float32                      1.15     40.6±2.00µs        ? B/sec     1.00     35.2±1.75µs        ? B/sec
   gt_eq Float32                   1.20     42.1±2.70µs        ? B/sec     1.00     35.2±1.60µs        ? B/sec
   limit 512, 512                  1.00    106.5±6.65ns        ? B/sec     1.31    139.0±7.04ns        ? B/sec
   lt Float32                      1.15     40.5±1.97µs        ? B/sec     1.00     35.3±1.73µs        ? B/sec
   lt_eq Float32                   1.14     40.3±2.40µs        ? B/sec     1.00     35.2±1.50µs        ? B/sec
   max nulls 512                   1.00   889.7±59.93ns        ? B/sec     3.26      2.9±0.11µs        ? B/sec
   min nulls 512                   1.00   916.4±70.85ns        ? B/sec     2.64      2.4±0.10µs        ? B/sec
   multiply 512                    1.00   361.3±28.15ns        ? B/sec     1.12   403.5±21.82ns        ? B/sec
   neq Float32                     1.16     40.6±1.72µs        ? B/sec     1.00     34.9±1.68µs        ? B/sec
   take bool 1024                  1.00      4.2±0.30µs        ? B/sec     1.29      5.4±0.29µs        ? B/sec
   take bool 512                   1.00  1750.4±100.33ns        ? B/sec    1.38      2.4±0.19µs        ? B/sec
   take bool nulls 1024            1.00      3.3±0.16µs        ? B/sec     1.30      4.3±0.28µs        ? B/sec
   take bool nulls 512             1.00  1635.3±119.87ns        ? B/sec    1.20  1956.5±83.47ns        ? B/sec
   take i32 512                    1.00   421.9±11.67ns        ? B/sec     1.14   480.0±27.45ns        ? B/sec
   take str null values 1024       1.00      6.8±0.42µs        ? B/sec     1.25      8.5±1.49µs        ? B/sec
   ```
   
   Only equality benches are worse, but not by that much.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r597200305



##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
     /// ```
     fn as_any(&self) -> &Any;
 
-    /// Returns a reference-counted pointer to the underlying data of this array.
-    fn data(&self) -> ArrayDataRef;
+    /// Returns a reference to the underlying data of this array.
+    fn data(&self) -> &ArrayData;

Review comment:
       @yordan-pavlov I think it's safer to deprecate `data()`, then remove it later. We still use it a lot in the codebase, but it's often more performant to use `array_ref()`, so returning `ArrayData` doesn't guide users on a faster path.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802722483


   @nevi-me  I think we should merge this PR:
   1. It cleans the code up
   2. It shows decent performance improvements
   3. It sets us up for future work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
   > Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (967e13e) into [master](https://codecov.io/gh/apache/arrow/commit/356c300c5ee1e2b23a83652514af11e3a731d596?el=desc) (356c300) will **decrease** coverage by `0.30%`.
   > The diff coverage is `95.02%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9329      +/-   ##
   ==========================================
   - Coverage   82.31%   82.01%   -0.31%     
   ==========================================
     Files         233      233              
     Lines       54412    53958     -454     
   ==========================================
   - Hits        44791    44253     -538     
   - Misses       9621     9705      +84     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `88.03% <ø> (+0.21%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `30.00% <ø> (ø)` | |
   | [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
   | [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.63% <66.66%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `88.70% <71.42%> (+0.18%)` | :arrow_up: |
   | [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.76% <83.33%> (-0.07%)` | :arrow_down: |
   | [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `84.79% <85.71%> (-6.96%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.24% <98.48%> (+0.04%)` | :arrow_up: |
   | ... and [50 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [356c300...967e13e](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r574834375



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -229,7 +229,7 @@ pub struct ArrayData {
 
     /// The child(ren) of this array. Only non-empty for nested types, currently
     /// `ListArray` and `StructArray`.
-    child_data: Vec<ArrayDataRef>,
+    child_data: Vec<ArrayData>,

Review comment:
       Would the impact of the change be that the overall size of a nested `ArrayData` becomes slightly larger; as we're now only using `Arc` on `Buffer`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767737664


   > I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?
   
   I remember that we had a similar issue some time ago where we were using `data` instead of `data_ref`, and that was causing a significant performance difference.
   
   My general approach here is to write simpler code in the hope that it helps the compiler.
   
   In practice, I humbly make these offerings to the gods of LLVM and pray for a sign.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r598286407



##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
     /// ```
     fn as_any(&self) -> &Any;
 
-    /// Returns a reference-counted pointer to the underlying data of this array.
-    fn data(&self) -> ArrayDataRef;
+    /// Returns a reference to the underlying data of this array.
+    fn data(&self) -> &ArrayData;

Review comment:
       yes, deprecate `data()` is what I meant, immediate removal is not great for backwards compatibility




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774287446


   https://issues.apache.org/jira/browse/ARROW-11511


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r571429799



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -221,17 +221,36 @@ fn write_leaf(
             )?
         }
         ColumnWriter::BoolColumnWriter(ref mut typed) => {
-            let array = arrow_array::BooleanArray::from(column.data());
+            let array = column

Review comment:
       this is a nice find / cleanup I think

##########
File path: rust/arrow/src/array/array_union.rs
##########
@@ -403,16 +397,6 @@ mod tests {
             let value = slot.value(0);
             assert_eq!(expected_value, &value);
         }
-
-        assert_eq!(

Review comment:
       What was the reasoning for removing this size test rather than updating it for the new values ?

##########
File path: rust/arrow/src/array/null.rs
##########
@@ -134,13 +130,6 @@ mod tests {
         assert_eq!(null_arr.len(), 32);
         assert_eq!(null_arr.null_count(), 32);
         assert_eq!(null_arr.is_valid(0), false);
-
-        assert_eq!(0, null_arr.get_buffer_memory_size());

Review comment:
       same question here about why does the PR remove the test coverage rather than updating it?

##########
File path: rust/arrow/src/array/equal/mod.rs
##########
@@ -877,17 +903,17 @@ mod tests {
                 builder.append(false).unwrap()
             }
         }
-        builder.finish().data()
+        builder.finish().data().clone()

Review comment:
       it probably doesn't matter as this is in a test, but is this `clone()` necessary? Or maybe there is just no good way to destructure the Array to give back the `ArrayData`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802247250


   > does anyone knows why we are using Arc<ArrayData> in all arrays?
   
   I think it comes down just to memory size, as cloning `Arc<T>` would give us a pointer-sized variable, compared to cloning `T` including its `DataType`.
   
   > Do you envision an issue with removing the Arc?
   
   I personally don't. The underlying bytes are still backed by `Arc`, so we shouldn't use a lot more memory for copies of arrays.
   The perf benchmarks are also positive.
   
   > Would someone be so kind and run the benches independently, just to be sure.
   
   I attached my set of benchmark results, but I think the datafusion benchmarks would be more meaningful.
   @Dandandan would you be able to run datafusion benchmarks from this branch vs master, and share results?
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r582556952



##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
     /// ```
     fn as_any(&self) -> &Any;
 
-    /// Returns a reference-counted pointer to the underlying data of this array.
-    fn data(&self) -> ArrayDataRef;
+    /// Returns a reference to the underlying data of this array.
+    fn data(&self) -> &ArrayData;

Review comment:
       I understand why we wanted to avoid `Array::data().clone()` to get `ArrayData`, but with these 2 functions returning the same reference, I'd opt to make `Array::data() -> ArrayData`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
   > Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (7a10fe7) into [master](https://codecov.io/gh/apache/arrow/commit/77ae93d6ecaac8fb5f4a18ca5287b7456cd88784?el=desc) (77ae93d) will **decrease** coverage by `0.27%`.
   > The diff coverage is `86.32%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9329      +/-   ##
   ==========================================
   - Coverage   82.00%   81.72%   -0.28%     
   ==========================================
     Files         230      230              
     Lines       53487    53584      +97     
   ==========================================
   - Hits        43864    43794      -70     
   - Misses       9623     9790     +167     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz) | `98.92% <ø> (ø)` | |
   | [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `87.82% <ø> (ø)` | |
   | [rust/arrow/src/util/bench\_util.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9iZW5jaF91dGlsLnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL2V4YW1wbGVzL3NpbXBsZV91ZGFmLnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `94.33% <ø> (-0.24%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/extension.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXh0ZW5zaW9uLnJz) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `67.10% <0.00%> (-0.90%)` | :arrow_down: |
   | [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `30.00% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `88.10% <0.00%> (-0.14%)` | :arrow_down: |
   | ... and [97 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [5e5c2b4...abcdabe](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802733041


   > I fixed the clippy lint already in master, so we need not worry about it.
   
   Thanks -- you beat me to it! (#9754) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-781292564


   I think this one needs a rebase (Jorge fixed the real error in https://github.com/apache/arrow/pull/9511)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] yordan-pavlov commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767877784


   @jorgecarleitao that's an interesting idea; simplification makes sense, especially if cloning `ArrayData` is cheap and there is no obvious reason to use `Arc<ArrayData>`.
   
   I ran the filter benchmarks with SIMD enabled and got the following results (in comparison to master).
   Possibly some improvement, but the results are not very clear - often the differences are small.
   
   filter u8               time:   [556.89 us 562.29 us 568.54 us]
                           change: [-3.0769% +0.2211% +3.2221%] (p = 0.89 > 0.05)
                           
   
   filter u8 high selectivity
                           time:   [13.859 us 13.959 us 14.081 us]
                           change: [-2.3832% +0.3285% +3.0131%] (p = 0.81 > 0.05)
   
   filter u8 low selectivity
                           time:   [7.7940 us 7.8819 us 7.9793 us]
                           change: [+8.3043% +11.999% +16.830%] (p = 0.00 < 0.05)
                           
   filter context u8       time:   [205.24 us 209.06 us 213.16 us]
                           change: [+0.4175% +5.7203% +11.914%] (p = 0.04 < 0.05)
                           
   filter context u8 high selectivity
                           time:   [3.7003 us 3.7456 us 3.8010 us]
                           change: [-3.2048% -0.7748% +1.5407%] (p = 0.55 > 0.05)
                           
   filter context u8 low selectivity
                           time:   [1.3066 us 1.3243 us 1.3443 us]
                           change: [-3.4133% -1.5255% +0.1897%] (p = 0.11 > 0.05)
                           
   filter context u8 w NULLs
                           time:   [496.69 us 501.24 us 506.60 us]
                           change: [-8.8715% -4.9662% -1.0448%] (p = 0.02 < 0.05)
                           
   filter context u8 w NULLs high selectivity
                           time:   [254.92 us 258.17 us 262.51 us]
                           change: [+8.6837% +13.067% +17.583%] (p = 0.00 < 0.05)
                           
   filter context u8 w NULLs low selectivity
                           time:   [1.9758 us 2.0159 us 2.0623 us]
                           change: [-10.590% -7.4822% -4.2891%] (p = 0.00 < 0.05)
                           
   filter f32              time:   [779.40 us 789.32 us 801.26 us]
                           change: [-17.189% -13.338% -9.2795%] (p = 0.00 < 0.05)
                           
   filter context f32      time:   [485.95 us 495.16 us 506.07 us]
                           change: [-7.7798% -3.5929% +0.9816%] (p = 0.12 > 0.05)
                           
   filter context f32 high selectivity
                           time:   [258.28 us 259.96 us 261.83 us]
                           change: [+2.0252% +5.7103% +9.4974%] (p = 0.00 < 0.05)
                           
   filter context f32 low selectivity
                           time:   [1.9878 us 2.0332 us 2.0857 us]
                           change: [-7.9039% -4.8554% -1.8226%] (p = 0.00 < 0.05)
                           
   filter context string   time:   [639.51 us 646.56 us 655.15 us]
                           change: [-9.9164% -5.9396% -2.2137%] (p = 0.00 < 0.05)
                           
   filter context string high selectivity
                           time:   [1.0009 ms 1.0168 ms 1.0344 ms]
                           change: [-5.8412% -1.1777% +4.1292%] (p = 0.65 > 0.05)
                           
   filter context string low selectivity
                           time:   [2.9339 us 2.9668 us 3.0060 us]
                           change: [-6.4880% -2.3484% +1.7663%] (p = 0.29 > 0.05)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-770440474


   > How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include
   > 
   >     * the datatype enum, which can include boxes/vecs of nested types and the field metadata
   > 
   >     * the vec of buffers, and cloning a buffer would include cloning the `Arc<Bytes>`
   > 
   > 
   > So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.
   > 
   > For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.
   
   I am sorry, I am not sure I follow the arguments:
   
   * `DataType`: all our operations already clone `DataType`, since they create a new `ArrayData`. Even something as simple as slicing an array requires cloning a `DataType`, as we need to increase the `offset`, which requires a new `ArrayData`, which requires a new `DataType`.
   * `Arc<Bytes>`: I do not see how using `Arc<Bytes>` is relevant to why we should use `Arc<ArrayData>`. Regardless, AFAI understand, its usage is really important: `Bytes` is an immutable memory region, and sharing it between arrays imo makes a lot of sense: the most used case is when the null buffer does not change due to an operation. So, the two natural options are `Rc` and an `Arc`, and allowing Arrays to be sharable across thread boundaries seems like a good justification for `Arc`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays (1.2-2x speedup)

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767706849


   cc @Dandandan ^_^


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
   > Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (00dd9b2) into [master](https://codecov.io/gh/apache/arrow/commit/4086409e1b4cf4feac3b5c84060c69e6c7de898d?el=desc) (4086409) will **decrease** coverage by `0.31%`.
   > The diff coverage is `94.30%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9329      +/-   ##
   ==========================================
   - Coverage   82.33%   82.01%   -0.32%     
   ==========================================
     Files         233      233              
     Lines       54443    53991     -452     
   ==========================================
   - Hits        44823    44283     -540     
   - Misses       9620     9708      +88     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `88.03% <ø> (+0.21%)` | :arrow_up: |
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `38.15% <ø> (ø)` | |
   | [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `30.00% <ø> (ø)` | |
   | [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
   | [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `83.15% <57.14%> (-0.24%)` | :arrow_down: |
   | [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.63% <66.66%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `88.70% <71.42%> (+0.18%)` | :arrow_up: |
   | [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.76% <83.33%> (-0.07%)` | :arrow_down: |
   | ... and [53 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [3a98a68...00dd9b2](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays (1.2-2x speedup)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767729858


   Nice @jorgecarleitao !! Looking forward to see the impact in DataFusion on the slicing of smaller arrays and maybe other parts as well. I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?
   
   Ergonomically it is definitely an improvement, the ArrayData struct is small and should be cheap to clone, this avoids quite some boilerplate.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-803576080


   @nevi-me performance on some queries in here https://github.com/h2oai/db-benchmark/pull/182 improve by ~5%.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-780076943


   @jorgecarleitao  shall we merge this PR? Or is it still an "another of those experiments to gather feedback and share results." in the description?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
   > Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (6cdec30) into [master](https://codecov.io/gh/apache/arrow/commit/a3e151307143d020ab9a0af609b2d185e1006785?el=desc) (a3e1513) will **decrease** coverage by `0.28%`.
   > The diff coverage is `92.57%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9329      +/-   ##
   ==========================================
   - Coverage   82.32%   82.03%   -0.29%     
   ==========================================
     Files         245      245              
     Lines       56277    55927     -350     
   ==========================================
   - Hits        46328    45879     -449     
   - Misses       9949    10048      +99     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `87.44% <ø> (+0.21%)` | :arrow_up: |
   | [rust/arrow/src/json/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi93cml0ZXIucnM=) | `87.91% <ø> (+1.74%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `27.27% <ø> (ø)` | |
   | [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `76.80% <50.00%> (+0.49%)` | :arrow_up: |
   | [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
   | [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.61% <66.66%> (ø)` | |
   | [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `83.45% <83.33%> (-0.06%)` | :arrow_down: |
   | [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `84.53% <85.71%> (-7.09%)` | :arrow_down: |
   | ... and [58 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [a3e1513...6cdec30](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802731633


   I concur @alamb, I think the PR's been open long enough for those interested to comment.
   
   I fixed the clippy lint already in master, so we need not worry about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r574634613



##########
File path: rust/arrow/src/array/equal/mod.rs
##########
@@ -877,17 +903,17 @@ mod tests {
                 builder.append(false).unwrap()
             }
         }
-        builder.finish().data()
+        builder.finish().data().clone()

Review comment:
       Exactly, there is no way atm. This is solved with the broader proposal, for now I just cloned it. xD




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-770892740


   > all our operations already clone `DataType`, since they create a new `ArrayData`
   
   You are right, I wasn't looking closely enough at the code or thought only about `ArrayData::clone` vs `Arc<ArrayData>::clone`. I still think there might be some overhead there and it might be worth experimenting with an `Arc<DataType>` or replacing some of the `Box` in the DataType enum with `Arc` in a separate PR. For nested types like `List<Dictionary<Int32, Utf8>>` I think cloning the type currently involves more allocations than cloning the vec of buffers/child data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r564891587



##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
     /// ```
     fn as_any(&self) -> &Any;
 
-    /// Returns a reference-counted pointer to the underlying data of this array.
-    fn data(&self) -> ArrayDataRef;
+    /// Returns a reference to the underlying data of this array.
+    fn data(&self) -> &ArrayData;

Review comment:
       if both `data()` and `data_ref()` return `&ArrayData` may be just `data_ref()` is enough and `data()` should be removed; or possibly change `data()` to return `ArrayData`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jhorstmann commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-768208269


   How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include
   
    - the datatype enum, which can include boxes/vecs of nested types and the field metadata
    - the vec of buffers, and cloning a buffer would include cloning the `Arc<Bytes>`
   
   So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.
   
   For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #9329:
URL: https://github.com/apache/arrow/pull/9329


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-787443066


   Thanks @nevi-me . Let me know if there is anything I can do to help


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774289662


   @nevi-me @sunchao , I would appreciate your feedback here: I can't find a rational to keep `Arc<ArrayData>`, but I do not have all the background, and you are probably the most experienced persons for a qualified answer here. I marked you as reviewers as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-804405923


   Thanks @nevi-me  for pushing the over the line. 🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
   > Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (fbcd2d3) into [master](https://codecov.io/gh/apache/arrow/commit/ab307365198d5724a0b3331a05704d0fe4bcd710?el=desc) (ab30736) will **decrease** coverage by `0.31%`.
   > The diff coverage is `92.57%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9329      +/-   ##
   ==========================================
   - Coverage   82.12%   81.80%   -0.32%     
   ==========================================
     Files         243      243              
     Lines       54737    54288     -449     
   ==========================================
   - Hits        44951    44411     -540     
   - Misses       9786     9877      +91     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `88.03% <ø> (+0.21%)` | :arrow_up: |
   | [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `27.27% <ø> (ø)` | |
   | [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `76.40% <50.00%> (+0.49%)` | :arrow_up: |
   | [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
   | [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.63% <66.66%> (ø)` | |
   | [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.76% <83.33%> (-0.07%)` | :arrow_down: |
   | [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `84.79% <85.71%> (-6.96%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.24% <98.48%> (+0.04%)` | :arrow_up: |
   | ... and [50 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [ab30736...fbcd2d3](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org