You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/26 17:35:12 UTC
[GitHub] [arrow] jorgecarleitao opened a new pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays
jorgecarleitao opened a new pull request #9329:
URL: https://github.com/apache/arrow/pull/9329
This PR is another of those experiments to gather feedback and share results.
# Rational
Currently, all our arrays use an `Arc<ArrayData>`, which they expose via `Array::data` and `Array::data_ref`. This adds a level of indirection. Now, it happens that, afaik, in the current code base `Arc<>` is not needed.
On #9271, where we are observing some perf issues with small arrays, and one of the ideas that came up was to get rid of `Arc` and see what happens.
# This PR
Well, this PR replaces all `Arc<ArrayData>` by `ArrayData`. On the one hand, this means that cloning an array is a tad more expensive (`Arc` vs `ArrayData`), even though we seldom clone an `Arc<ArrayData>`. On the other hand, it means that often the compiler can optimize out, as many operations will never leave the stack.
The gist of the benchmarks below is:
* ~10%-20% improvement over basically everything
* ~20%-100% improvement in `take`
There is some noise, as there are benches that are not expected to be affected and are being affected, which I am trying to reduce by running the benches at night.
Also, the arrow tests pass, but I did not port this to the other crates nor the SIMD branch of the code.
Personally, I like this PR because it makes working with `ArrayData` and arrays so much simpler: no need for `Arc::new` or `as_ref` and company (besides the speed).
# questions
* does anyone knows why we are using `Arc<ArrayData>` in all arrays?
* Do you envision an issue with removing the `Arc`?
* Would someone be so kind and run the benches independently, just to be sure.
# Benchmarks
```bash
# modify cargo.toml by adding `bench = false` to the section [lib]
git checkout master
cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`
git checkout arcless
cargo bench --benches -- --save-baseline `git branch --show-current`-`git rev-parse --short HEAD`
```
```bash
critcmp arcless-3dbcaca49 master-437c8c944 -t 10
```
```
group arcless-3dbcaca49 master-437c8c944
----- ----------------- ----------------
add_nulls_512 1.00 509.1±98.69ns ? B/sec 1.14 581.9±98.83ns ? B/sec
array_from_vec 128 1.00 1180.2±304.50ns ? B/sec 1.20 1411.2±475.00ns ? B/sec
array_slice 128 1.00 383.5±430.02ns ? B/sec 1.23 471.3±94.42ns ? B/sec
array_slice 2048 1.00 393.3±94.60ns ? B/sec 1.22 478.2±293.70ns ? B/sec
array_slice 512 1.00 393.0±77.38ns ? B/sec 1.26 496.1±145.06ns ? B/sec
array_string_from_vec 128 1.00 3.5±0.39µs ? B/sec 1.22 4.3±1.68µs ? B/sec
array_string_from_vec 256 1.00 4.4±0.46µs ? B/sec 1.11 4.9±0.94µs ? B/sec
buffer_bit_ops or 1.11 706.1±102.45ns ? B/sec 1.00 633.3±56.65ns ? B/sec
cast date32 to date64 512 1.00 7.2±0.85µs ? B/sec 1.11 8.0±1.25µs ? B/sec
cast int32 to float64 512 1.00 3.0±0.35µs ? B/sec 1.16 3.4±0.77µs ? B/sec
cast int32 to int64 512 1.00 3.3±0.41µs ? B/sec 1.11 3.6±0.67µs ? B/sec
cast time32s to time32ms 512 1.13 1864.9±585.48ns ? B/sec 1.00 1648.3±145.69ns ? B/sec
cast timestamp_ms to i64 512 1.00 350.6±67.17ns ? B/sec 1.35 474.9±72.34ns ? B/sec
cast timestamp_ms to timestamp_ns 512 1.00 2.3±0.28µs ? B/sec 1.22 2.8±4.01µs ? B/sec
cast utf8 to date64 512 1.22 96.5±22.75µs ? B/sec 1.00 79.1±6.32µs ? B/sec
concat str 1024 1.00 9.9±0.41µs ? B/sec 1.34 13.2±5.26µs ? B/sec
eq scalar Float32 1.25 82.9±33.34µs ? B/sec 1.00 66.4±6.96µs ? B/sec
equal_512 1.00 28.1±18.85ns ? B/sec 1.76 49.5±12.56ns ? B/sec
equal_bool_512 1.00 21.0±4.26ns ? B/sec 1.97 41.5±2.29ns ? B/sec
equal_bool_513 1.00 28.3±25.49ns ? B/sec 1.60 45.4±3.84ns ? B/sec
equal_nulls_512 1.00 2.4±0.19µs ? B/sec 1.10 2.7±0.37µs ? B/sec
equal_string_512 1.00 98.3±21.85ns ? B/sec 1.14 111.9±5.30ns ? B/sec
equal_string_nulls_512 1.00 3.7±0.46µs ? B/sec 1.10 4.0±1.29µs ? B/sec
filter context f32 1.00 503.9±14.15µs ? B/sec 1.11 561.8±68.59µs ? B/sec
filter context f32 high selectivity 1.00 309.3±8.26µs ? B/sec 1.11 343.2±57.50µs ? B/sec
filter context string high selectivity 1.00 1111.6±36.80µs ? B/sec 1.11 1238.6±174.82µs ? B/sec
filter context u8 1.00 238.9±12.44µs ? B/sec 1.18 281.0±112.36µs ? B/sec
filter context u8 low selectivity 1.17 2.4±1.25µs ? B/sec 1.00 2.1±0.26µs ? B/sec
filter context u8 w NULLs 1.00 526.7±98.56µs ? B/sec 1.11 586.5±155.02µs ? B/sec
filter context u8 w NULLs high selectivity 1.00 298.7±6.11µs ? B/sec 1.15 344.2±63.42µs ? B/sec
filter context u8 w NULLs low selectivity 1.00 2.7±0.21µs ? B/sec 1.31 3.5±5.02µs ? B/sec
filter f32 1.00 802.9±44.38µs ? B/sec 1.21 971.4±135.07µs ? B/sec
filter u8 low selectivity 1.00 7.8±0.23µs ? B/sec 1.14 8.9±1.47µs ? B/sec
from_slice 1.13 1665.5±215.62µs ? B/sec 1.00 1475.4±324.38µs ? B/sec
from_slice prepared 1.31 1092.8±125.08µs ? B/sec 1.00 834.2±59.41µs ? B/sec
gt Float32 1.00 71.8±30.69µs ? B/sec 1.18 84.6±9.31µs ? B/sec
gt scalar Float32 1.82 74.1±25.74µs ? B/sec 1.00 40.8±4.06µs ? B/sec
gt_eq scalar Float32 1.14 71.8±24.75µs ? B/sec 1.00 63.0±7.38µs ? B/sec
json_list_primitive_to_record_batch 1.00 64.6±4.38µs ? B/sec 1.14 73.4±11.01µs ? B/sec
length 1.00 2.9±0.07µs ? B/sec 1.29 3.7±0.77µs ? B/sec
like_utf8 scalar ends with 1.00 243.8±15.05µs ? B/sec 1.21 294.5±45.77µs ? B/sec
like_utf8 scalar equals 1.21 101.6±13.08µs ? B/sec 1.00 83.7±10.13µs ? B/sec
limit 512, 512 1.00 327.3±60.96ns ? B/sec 1.38 451.1±130.11ns ? B/sec
lt Float32 1.00 69.4±7.77µs ? B/sec 1.27 88.3±12.27µs ? B/sec
lt scalar Float32 1.00 66.5±4.71µs ? B/sec 1.31 87.3±25.97µs ? B/sec
lt_eq Float32 1.00 118.9±19.95µs ? B/sec 1.26 149.9±56.97µs ? B/sec
max nulls 512 1.00 1567.2±204.81ns ? B/sec 1.16 1820.6±260.46ns ? B/sec
min nulls 512 1.00 1706.9±802.12ns ? B/sec 1.16 1980.7±259.55ns ? B/sec
min nulls string 512 1.00 7.5±0.44µs ? B/sec 1.18 8.9±1.66µs ? B/sec
min string 512 1.00 5.6±0.30µs ? B/sec 1.10 6.2±1.42µs ? B/sec
multiply 512 1.00 541.6±224.47ns ? B/sec 1.11 601.0±102.46ns ? B/sec
mutable 1.11 630.0±349.07µs ? B/sec 1.00 566.7±77.31µs ? B/sec
mutable str 1024 1.00 1515.5±59.36µs ? B/sec 1.17 1776.6±329.88µs ? B/sec
neq Float32 1.00 77.8±16.48µs ? B/sec 1.18 91.5±20.30µs ? B/sec
neq scalar Float32 1.52 101.2±52.55µs ? B/sec 1.00 66.4±6.21µs ? B/sec
sort 2^10 1.00 147.2±4.97µs ? B/sec 1.19 175.6±25.84µs ? B/sec
sort 2^12 1.00 726.9±50.41µs ? B/sec 1.17 849.8±109.51µs ? B/sec
sort nulls 2^12 1.00 631.5±29.67µs ? B/sec 1.18 746.5±87.64µs ? B/sec
struct_array_from_vec 1024 1.20 17.3±4.32µs ? B/sec 1.00 14.4±0.93µs ? B/sec
subtract 512 1.00 487.5±173.36ns ? B/sec 1.18 574.3±37.34ns ? B/sec
sum nulls 512 1.11 363.1±73.55ns ? B/sec 1.00 327.8±97.10ns ? B/sec
take bool 512 1.00 2.2±0.28µs ? B/sec 1.22 2.6±0.32µs ? B/sec
take bool nulls 1024 1.00 4.0±0.25µs ? B/sec 2.04 8.1±2.08µs ? B/sec
take bool nulls 512 1.00 2.2±0.13µs ? B/sec 1.80 3.9±0.77µs ? B/sec
take i32 1024 1.00 1507.3±81.31ns ? B/sec 1.57 2.4±0.58µs ? B/sec
take i32 512 1.00 1030.9±39.48ns ? B/sec 1.32 1364.4±97.02ns ? B/sec
take i32 nulls 1024 1.00 1538.9±51.37ns ? B/sec 1.81 2.8±1.00µs ? B/sec
take i32 nulls 512 1.00 1023.8±51.95ns ? B/sec 2.48 2.5±1.64µs ? B/sec
take str 512 1.00 4.1±0.51µs ? B/sec 1.14 4.6±0.96µs ? B/sec
take str null indices 1024 1.00 6.1±0.40µs ? B/sec 1.18 7.2±1.63µs ? B/sec
take str null indices 512 1.00 3.8±0.15µs ? B/sec 1.21 4.6±1.32µs ? B/sec
take str null values 1024 1.00 6.0±0.40µs ? B/sec 1.18 7.1±1.02µs ? B/sec
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-791898261
Benchmarks on an ARM machine:
```
nevilledipale@Nevilles-MacBook-Air rust % critcmp master-bfa99d98e arcless-ef2250b9d -t 10
group arcless-ef2250b9d master-bfa99d98e
----- ----------------- ----------------
add_nulls_512 1.00 351.5±22.53ns ? B/sec 1.12 394.1±19.39ns ? B/sec
array_from_vec 128 1.00 555.6±26.43ns ? B/sec 1.11 619.1±17.89ns ? B/sec
array_slice 128 1.00 98.3±8.62ns ? B/sec 1.42 139.4±7.01ns ? B/sec
array_slice 2048 1.00 105.8±13.50ns ? B/sec 1.37 145.1±8.93ns ? B/sec
array_slice 512 1.00 101.9±6.16ns ? B/sec 1.41 143.3±9.44ns ? B/sec
cast date32 to date64 512 1.00 246.7±17.81ns ? B/sec 1.23 303.7±14.81ns ? B/sec
cast time32s to time32ms 512 1.00 223.1±13.25ns ? B/sec 1.18 262.3±16.68ns ? B/sec
cast timestamp_ms to i64 512 1.00 98.6±9.32ns ? B/sec 1.56 153.7±7.79ns ? B/sec
divide_scalar 512 1.00 189.9±12.11ns ? B/sec 1.16 220.4±8.05ns ? B/sec
divide_scalar_nulls_512 1.00 186.6±14.60ns ? B/sec 1.21 225.1±11.87ns ? B/sec
eq Float32 1.15 40.0±1.54µs ? B/sec 1.00 34.7±1.30µs ? B/sec
equal_bool_512 1.00 25.8±1.34ns ? B/sec 1.12 28.9±1.15ns ? B/sec
equal_bool_513 1.00 27.7±1.36ns ? B/sec 1.14 31.5±1.42ns ? B/sec
gt Float32 1.15 40.6±2.00µs ? B/sec 1.00 35.2±1.75µs ? B/sec
gt_eq Float32 1.20 42.1±2.70µs ? B/sec 1.00 35.2±1.60µs ? B/sec
limit 512, 512 1.00 106.5±6.65ns ? B/sec 1.31 139.0±7.04ns ? B/sec
lt Float32 1.15 40.5±1.97µs ? B/sec 1.00 35.3±1.73µs ? B/sec
lt_eq Float32 1.14 40.3±2.40µs ? B/sec 1.00 35.2±1.50µs ? B/sec
max nulls 512 1.00 889.7±59.93ns ? B/sec 3.26 2.9±0.11µs ? B/sec
min nulls 512 1.00 916.4±70.85ns ? B/sec 2.64 2.4±0.10µs ? B/sec
multiply 512 1.00 361.3±28.15ns ? B/sec 1.12 403.5±21.82ns ? B/sec
neq Float32 1.16 40.6±1.72µs ? B/sec 1.00 34.9±1.68µs ? B/sec
take bool 1024 1.00 4.2±0.30µs ? B/sec 1.29 5.4±0.29µs ? B/sec
take bool 512 1.00 1750.4±100.33ns ? B/sec 1.38 2.4±0.19µs ? B/sec
take bool nulls 1024 1.00 3.3±0.16µs ? B/sec 1.30 4.3±0.28µs ? B/sec
take bool nulls 512 1.00 1635.3±119.87ns ? B/sec 1.20 1956.5±83.47ns ? B/sec
take i32 512 1.00 421.9±11.67ns ? B/sec 1.14 480.0±27.45ns ? B/sec
take str null values 1024 1.00 6.8±0.42µs ? B/sec 1.25 8.5±1.49µs ? B/sec
```
Only equality benches are worse, but not by that much.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r597200305
##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
/// ```
fn as_any(&self) -> &Any;
- /// Returns a reference-counted pointer to the underlying data of this array.
- fn data(&self) -> ArrayDataRef;
+ /// Returns a reference to the underlying data of this array.
+ fn data(&self) -> &ArrayData;
Review comment:
@yordan-pavlov I think it's safer to deprecate `data()`, then remove it later. We still use it a lot in the codebase, but it's often more performant to use `array_ref()`, so returning `ArrayData` doesn't guide users on a faster path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802722483
@nevi-me I think we should merge this PR:
1. It cleans the code up
2. It shows decent performance improvements
3. It sets us up for future work
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
> Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (967e13e) into [master](https://codecov.io/gh/apache/arrow/commit/356c300c5ee1e2b23a83652514af11e3a731d596?el=desc) (356c300) will **decrease** coverage by `0.30%`.
> The diff coverage is `95.02%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9329 +/- ##
==========================================
- Coverage 82.31% 82.01% -0.31%
==========================================
Files 233 233
Lines 54412 53958 -454
==========================================
- Hits 44791 44253 -538
- Misses 9621 9705 +84
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `88.03% <ø> (+0.21%)` | :arrow_up: |
| [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `30.00% <ø> (ø)` | |
| [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
| [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.63% <66.66%> (ø)` | |
| [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `88.70% <71.42%> (+0.18%)` | :arrow_up: |
| [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
| [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.76% <83.33%> (-0.07%)` | :arrow_down: |
| [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `84.79% <85.71%> (-6.96%)` | :arrow_down: |
| [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.24% <98.48%> (+0.04%)` | :arrow_up: |
| ... and [50 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [356c300...967e13e](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r574834375
##########
File path: rust/arrow/src/array/data.rs
##########
@@ -229,7 +229,7 @@ pub struct ArrayData {
/// The child(ren) of this array. Only non-empty for nested types, currently
/// `ListArray` and `StructArray`.
- child_data: Vec<ArrayDataRef>,
+ child_data: Vec<ArrayData>,
Review comment:
Would the impact of the change be that the overall size of a nested `ArrayData` becomes slightly larger; as we're now only using `Arc` on `Buffer`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767737664
> I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?
I remember that we had a similar issue some time ago where we were using `data` instead of `data_ref`, and that was causing a significant performance difference.
My general approach here is to write simpler code in the hope that it helps the compiler.
In practice, I humbly make these offerings to the gods of LLVM and pray for a sign.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r598286407
##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
/// ```
fn as_any(&self) -> &Any;
- /// Returns a reference-counted pointer to the underlying data of this array.
- fn data(&self) -> ArrayDataRef;
+ /// Returns a reference to the underlying data of this array.
+ fn data(&self) -> &ArrayData;
Review comment:
yes, deprecate `data()` is what I meant, immediate removal is not great for backwards compatibility
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774287446
https://issues.apache.org/jira/browse/ARROW-11511
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r571429799
##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -221,17 +221,36 @@ fn write_leaf(
)?
}
ColumnWriter::BoolColumnWriter(ref mut typed) => {
- let array = arrow_array::BooleanArray::from(column.data());
+ let array = column
Review comment:
this is a nice find / cleanup I think
##########
File path: rust/arrow/src/array/array_union.rs
##########
@@ -403,16 +397,6 @@ mod tests {
let value = slot.value(0);
assert_eq!(expected_value, &value);
}
-
- assert_eq!(
Review comment:
What was the reasoning for removing this size test rather than updating it for the new values ?
##########
File path: rust/arrow/src/array/null.rs
##########
@@ -134,13 +130,6 @@ mod tests {
assert_eq!(null_arr.len(), 32);
assert_eq!(null_arr.null_count(), 32);
assert_eq!(null_arr.is_valid(0), false);
-
- assert_eq!(0, null_arr.get_buffer_memory_size());
Review comment:
same question here about why does the PR remove the test coverage rather than updating it?
##########
File path: rust/arrow/src/array/equal/mod.rs
##########
@@ -877,17 +903,17 @@ mod tests {
builder.append(false).unwrap()
}
}
- builder.finish().data()
+ builder.finish().data().clone()
Review comment:
it probably doesn't matter as this is in a test, but is this `clone()` necessary? Or maybe there is just no good way to destructure the Array to give back the `ArrayData`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802247250
> does anyone knows why we are using Arc<ArrayData> in all arrays?
I think it comes down just to memory size, as cloning `Arc<T>` would give us a pointer-sized variable, compared to cloning `T` including its `DataType`.
> Do you envision an issue with removing the Arc?
I personally don't. The underlying bytes are still backed by `Arc`, so we shouldn't use a lot more memory for copies of arrays.
The perf benchmarks are also positive.
> Would someone be so kind and run the benches independently, just to be sure.
I attached my set of benchmark results, but I think the datafusion benchmarks would be more meaningful.
@Dandandan would you be able to run datafusion benchmarks from this branch vs master, and share results?
Thanks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r582556952
##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
/// ```
fn as_any(&self) -> &Any;
- /// Returns a reference-counted pointer to the underlying data of this array.
- fn data(&self) -> ArrayDataRef;
+ /// Returns a reference to the underlying data of this array.
+ fn data(&self) -> &ArrayData;
Review comment:
I understand why we wanted to avoid `Array::data().clone()` to get `ArrayData`, but with these 2 functions returning the same reference, I'd opt to make `Array::data() -> ArrayData`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
> Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (7a10fe7) into [master](https://codecov.io/gh/apache/arrow/commit/77ae93d6ecaac8fb5f4a18ca5287b7456cd88784?el=desc) (77ae93d) will **decrease** coverage by `0.27%`.
> The diff coverage is `86.32%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9329 +/- ##
==========================================
- Coverage 82.00% 81.72% -0.28%
==========================================
Files 230 230
Lines 53487 53584 +97
==========================================
- Hits 43864 43794 -70
- Misses 9623 9790 +167
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz) | `98.92% <ø> (ø)` | |
| [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `87.82% <ø> (ø)` | |
| [rust/arrow/src/util/bench\_util.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9iZW5jaF91dGlsLnJz) | `0.00% <0.00%> (ø)` | |
| [rust/datafusion/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL2V4YW1wbGVzL3NpbXBsZV91ZGFmLnJz) | `0.00% <0.00%> (ø)` | |
| [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <ø> (ø)` | |
| [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `94.33% <ø> (-0.24%)` | :arrow_down: |
| [rust/datafusion/src/logical\_plan/extension.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXh0ZW5zaW9uLnJz) | `0.00% <ø> (ø)` | |
| [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `67.10% <0.00%> (-0.90%)` | :arrow_down: |
| [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `30.00% <ø> (ø)` | |
| [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `88.10% <0.00%> (-0.14%)` | :arrow_down: |
| ... and [97 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [5e5c2b4...abcdabe](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802733041
> I fixed the clippy lint already in master, so we need not worry about it.
Thanks -- you beat me to it! (#9754)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-781292564
I think this one needs a rebase (Jorge fixed the real error in https://github.com/apache/arrow/pull/9511)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767877784
@jorgecarleitao that's an interesting idea; simplification makes sense, especially if cloning `ArrayData` is cheap and there is no obvious reason to use `Arc<ArrayData>`.
I ran the filter benchmarks with SIMD enabled and got the following results (in comparison to master).
Possibly some improvement, but the results are not very clear - often the differences are small.
filter u8 time: [556.89 us 562.29 us 568.54 us]
change: [-3.0769% +0.2211% +3.2221%] (p = 0.89 > 0.05)
filter u8 high selectivity
time: [13.859 us 13.959 us 14.081 us]
change: [-2.3832% +0.3285% +3.0131%] (p = 0.81 > 0.05)
filter u8 low selectivity
time: [7.7940 us 7.8819 us 7.9793 us]
change: [+8.3043% +11.999% +16.830%] (p = 0.00 < 0.05)
filter context u8 time: [205.24 us 209.06 us 213.16 us]
change: [+0.4175% +5.7203% +11.914%] (p = 0.04 < 0.05)
filter context u8 high selectivity
time: [3.7003 us 3.7456 us 3.8010 us]
change: [-3.2048% -0.7748% +1.5407%] (p = 0.55 > 0.05)
filter context u8 low selectivity
time: [1.3066 us 1.3243 us 1.3443 us]
change: [-3.4133% -1.5255% +0.1897%] (p = 0.11 > 0.05)
filter context u8 w NULLs
time: [496.69 us 501.24 us 506.60 us]
change: [-8.8715% -4.9662% -1.0448%] (p = 0.02 < 0.05)
filter context u8 w NULLs high selectivity
time: [254.92 us 258.17 us 262.51 us]
change: [+8.6837% +13.067% +17.583%] (p = 0.00 < 0.05)
filter context u8 w NULLs low selectivity
time: [1.9758 us 2.0159 us 2.0623 us]
change: [-10.590% -7.4822% -4.2891%] (p = 0.00 < 0.05)
filter f32 time: [779.40 us 789.32 us 801.26 us]
change: [-17.189% -13.338% -9.2795%] (p = 0.00 < 0.05)
filter context f32 time: [485.95 us 495.16 us 506.07 us]
change: [-7.7798% -3.5929% +0.9816%] (p = 0.12 > 0.05)
filter context f32 high selectivity
time: [258.28 us 259.96 us 261.83 us]
change: [+2.0252% +5.7103% +9.4974%] (p = 0.00 < 0.05)
filter context f32 low selectivity
time: [1.9878 us 2.0332 us 2.0857 us]
change: [-7.9039% -4.8554% -1.8226%] (p = 0.00 < 0.05)
filter context string time: [639.51 us 646.56 us 655.15 us]
change: [-9.9164% -5.9396% -2.2137%] (p = 0.00 < 0.05)
filter context string high selectivity
time: [1.0009 ms 1.0168 ms 1.0344 ms]
change: [-5.8412% -1.1777% +4.1292%] (p = 0.65 > 0.05)
filter context string low selectivity
time: [2.9339 us 2.9668 us 3.0060 us]
change: [-6.4880% -2.3484% +1.7663%] (p = 0.29 > 0.05)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-770440474
> How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include
>
> * the datatype enum, which can include boxes/vecs of nested types and the field metadata
>
> * the vec of buffers, and cloning a buffer would include cloning the `Arc<Bytes>`
>
>
> So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.
>
> For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.
I am sorry, I am not sure I follow the arguments:
* `DataType`: all our operations already clone `DataType`, since they create a new `ArrayData`. Even something as simple as slicing an array requires cloning a `DataType`, as we need to increase the `offset`, which requires a new `ArrayData`, which requires a new `DataType`.
* `Arc<Bytes>`: I do not see how using `Arc<Bytes>` is relevant to why we should use `Arc<ArrayData>`. Regardless, AFAI understand, its usage is really important: `Bytes` is an immutable memory region, and sharing it between arrays imo makes a lot of sense: the most used case is when the null buffer does not change due to an operation. So, the two natural options are `Rc` and an `Arc`, and allowing Arrays to be sharable across thread boundaries seems like a good justification for `Arc`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays (1.2-2x speedup)
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767706849
cc @Dandandan ^_^
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
> Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (00dd9b2) into [master](https://codecov.io/gh/apache/arrow/commit/4086409e1b4cf4feac3b5c84060c69e6c7de898d?el=desc) (4086409) will **decrease** coverage by `0.31%`.
> The diff coverage is `94.30%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9329 +/- ##
==========================================
- Coverage 82.33% 82.01% -0.32%
==========================================
Files 233 233
Lines 54443 53991 -452
==========================================
- Hits 44823 44283 -540
- Misses 9620 9708 +88
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `88.03% <ø> (+0.21%)` | :arrow_up: |
| [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `38.15% <ø> (ø)` | |
| [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `30.00% <ø> (ø)` | |
| [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
| [rust/datafusion/src/sql/planner.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zcWwvcGxhbm5lci5ycw==) | `83.15% <57.14%> (-0.24%)` | :arrow_down: |
| [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.63% <66.66%> (ø)` | |
| [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `88.70% <71.42%> (+0.18%)` | :arrow_up: |
| [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
| [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.76% <83.33%> (-0.07%)` | :arrow_down: |
| ... and [53 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [3a98a68...00dd9b2](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays (1.2-2x speedup)
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-767729858
Nice @jorgecarleitao !! Looking forward to see the impact in DataFusion on the slicing of smaller arrays and maybe other parts as well. I wouldn't have expected > 20% changes for bigger arrays (512/1024 elements) in the kernels only based on removing / not creating or cloning the Arc. Any idea why that could be the case?
Ergonomically it is definitely an improvement, the ArrayData struct is small and should be cheap to clone, this avoids quite some boilerplate.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-803576080
@nevi-me performance on some queries in here https://github.com/h2oai/db-benchmark/pull/182 improve by ~5%.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-780076943
@jorgecarleitao shall we merge this PR? Or is it still an "another of those experiments to gather feedback and share results." in the description?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
> Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (6cdec30) into [master](https://codecov.io/gh/apache/arrow/commit/a3e151307143d020ab9a0af609b2d185e1006785?el=desc) (a3e1513) will **decrease** coverage by `0.28%`.
> The diff coverage is `92.57%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9329 +/- ##
==========================================
- Coverage 82.32% 82.03% -0.29%
==========================================
Files 245 245
Lines 56277 55927 -350
==========================================
- Hits 46328 45879 -449
- Misses 9949 10048 +99
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `87.44% <ø> (+0.21%)` | :arrow_up: |
| [rust/arrow/src/json/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi93cml0ZXIucnM=) | `87.91% <ø> (+1.74%)` | :arrow_up: |
| [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `27.27% <ø> (ø)` | |
| [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `76.80% <50.00%> (+0.49%)` | :arrow_up: |
| [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
| [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.61% <66.66%> (ø)` | |
| [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
| [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `83.45% <83.33%> (-0.06%)` | :arrow_down: |
| [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `84.53% <85.71%> (-7.09%)` | :arrow_down: |
| ... and [58 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [a3e1513...6cdec30](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-802731633
I concur @alamb, I think the PR's been open long enough for those interested to comment.
I fixed the clippy lint already in master, so we need not worry about it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r574634613
##########
File path: rust/arrow/src/array/equal/mod.rs
##########
@@ -877,17 +903,17 @@ mod tests {
builder.append(false).unwrap()
}
}
- builder.finish().data()
+ builder.finish().data().clone()
Review comment:
Exactly, there is no way atm. This is solved with the broader proposal, for now I just cloned it. xD
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jhorstmann commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-770892740
> all our operations already clone `DataType`, since they create a new `ArrayData`
You are right, I wasn't looking closely enough at the code or thought only about `ArrayData::clone` vs `Arc<ArrayData>::clone`. I still think there might be some overhead there and it might be worth experimenting with an `Arc<DataType>` or replacing some of the `Box` in the DataType enum with `Arc` in a separate PR. For nested types like `List<Dictionary<Int32, Utf8>>` I think cloning the type currently involves more allocations than cloning the vec of buffers/child data.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] yordan-pavlov commented on a change in pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
yordan-pavlov commented on a change in pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#discussion_r564891587
##########
File path: rust/arrow/src/array/array.rs
##########
@@ -56,11 +55,13 @@ pub trait Array: fmt::Debug + Send + Sync + JsonEqual {
/// ```
fn as_any(&self) -> &Any;
- /// Returns a reference-counted pointer to the underlying data of this array.
- fn data(&self) -> ArrayDataRef;
+ /// Returns a reference to the underlying data of this array.
+ fn data(&self) -> &ArrayData;
Review comment:
if both `data()` and `data_ref()` return `&ArrayData` may be just `data_ref()` is enough and `data()` should be removed; or possibly change `data()` to return `ArrayData`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jhorstmann commented on pull request #9329: [Rust] [Experiment] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
jhorstmann commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-768208269
How cheap cloning of the ArrayData struct is probably depends a lot on the datatype. The removed indirection seems to be beneficial, but instead we now need to clone all fields, which include
- the datatype enum, which can include boxes/vecs of nested types and the field metadata
- the vec of buffers, and cloning a buffer would include cloning the `Arc<Bytes>`
So for primitive types, removing the arc should be a benefit because of less indirection when accessing it, it would still need to allocates vectors and clone the Arc which is inside the Buffer.
For more complex types, cloning the DataType enum itself and multiple buffers could be slower than the speedup gained by less indirection.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] nevi-me closed pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #9329:
URL: https://github.com/apache/arrow/pull/9329
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-787443066
Thanks @nevi-me . Let me know if there is anything I can do to help
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774289662
@nevi-me @sunchao , I would appreciate your feedback here: I can't find a rational to keep `Arc<ArrayData>`, but I do not have all the background, and you are probably the most experienced persons for a qualified answer here. I marked you as reviewers as well.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-804405923
Thanks @nevi-me for pushing the over the line. 🎉
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9329: ARROW-11511: [Rust] Replace `Arc` by `ArrayData` in all arrays
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9329:
URL: https://github.com/apache/arrow/pull/9329#issuecomment-774423086
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=h1) Report
> Merging [#9329](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=desc) (fbcd2d3) into [master](https://codecov.io/gh/apache/arrow/commit/ab307365198d5724a0b3331a05704d0fe4bcd710?el=desc) (ab30736) will **decrease** coverage by `0.31%`.
> The diff coverage is `92.57%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9329/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9329 +/- ##
==========================================
- Coverage 82.12% 81.80% -0.32%
==========================================
Files 243 243
Lines 54737 54288 -449
==========================================
- Hits 44951 44411 -540
- Misses 9786 9877 +91
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/ipc/writer.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3dyaXRlci5ycw==) | `88.03% <ø> (+0.21%)` | :arrow_up: |
| [...t/datafusion/src/physical\_plan/math\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21hdGhfZXhwcmVzc2lvbnMucnM=) | `27.27% <ø> (ø)` | |
| [rust/integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9pbnRlZ3JhdGlvbi10ZXN0aW5nL3NyYy9saWIucnM=) | `0.00% <0.00%> (ø)` | |
| [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `76.40% <50.00%> (+0.49%)` | :arrow_up: |
| [rust/arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZGF0YS5ycw==) | `76.87% <55.55%> (-2.00%)` | :arrow_down: |
| [rust/parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9hcnJheV9yZWFkZXIucnM=) | `77.63% <66.66%> (ø)` | |
| [rust/arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvaXBjL3JlYWRlci5ycw==) | `84.36% <83.33%> (ø)` | |
| [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.76% <83.33%> (-0.07%)` | :arrow_down: |
| [rust/arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfYmluYXJ5LnJz) | `84.79% <85.71%> (-6.96%)` | :arrow_down: |
| [rust/arrow/src/array/equal/mod.rs](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvbW9kLnJz) | `93.24% <98.48%> (+0.04%)` | :arrow_up: |
| ... and [50 more](https://codecov.io/gh/apache/arrow/pull/9329/diff?src=pr&el=tree-more) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=footer). Last update [ab30736...fbcd2d3](https://codecov.io/gh/apache/arrow/pull/9329?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org