You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/26 06:05:52 UTC
[GitHub] [arrow] jorgecarleitao opened a new pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
jorgecarleitao opened a new pull request #9016:
URL: https://github.com/apache/arrow/pull/9016
Avoids a memcopy from 2 `Vec<T>` to `Buffer`, by building the buffers on the fly.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751324030
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=h1) Report
> Merging [#9016](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=desc) (55b07b3) into [master](https://codecov.io/gh/apache/arrow/commit/cd22be6efedbf9832b5ea875ca59bb42de7b6c28?el=desc) (cd22be6) will **decrease** coverage by `0.00%`.
> The diff coverage is `87.50%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9016/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9016 +/- ##
==========================================
- Coverage 82.55% 82.55% -0.01%
==========================================
Files 203 203
Lines 50043 50044 +1
==========================================
- Hits 41315 41314 -1
- Misses 8728 8730 +2
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `89.24% <87.50%> (-1.03%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=footer). Last update [cd22be6...55b07b3](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-752956691
The full set of Rust CI tests did not run on this PR :(
Can you please rebase this PR against [apache/master](https://github.com/apache/arrow) to pick up the changes in https://github.com/apache/arrow/pull/9056 so that they do?
I apologize for the inconvenience.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
Probably the offset buffer is also faster after converting it to `typed_data_mut` and iterating it with `iter_mut`, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751323053
https://issues.apache.org/jira/browse/ARROW-11037
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
Probably the offset buffer is also faster after converting it to `typed_data_mut` and iterating it with `iter_mut`, this way it can skip the capacity checks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r550416288
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
I.e. the general direction that I am pushing towards is to stop using `Vec` and replace it by `MutableBuffer` to avoid extra allocations. Once that is in place, we can replace these by `iter_mut`.
> also current conversion to a byte slice may add some overhead?
I believe so, as `to_byte_slice` returns a slice of unknown size, both for `&T` and `&[T]`. Either the compiler optimizes it, or there is an extra cost. I benched a 10% diff in building buffers when introduced a method that was not doing bound checks on these. IMO we should be using `to_le_bytes`, and have a method `MutableBuffer::push<ToByteSlice>`. I think we need the crate `byteorder` because AFAIK std's `to_le_bytes` does not have a trait (which we need).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io edited a comment on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751324030
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=h1) Report
> Merging [#9016](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=desc) (6e8d2ff) into [master](https://codecov.io/gh/apache/arrow/commit/86cf246d161512923253baa8fe62e00de88db73a?el=desc) (86cf246) will **decrease** coverage by `0.00%`.
> The diff coverage is `87.50%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9016/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9016 +/- ##
==========================================
- Coverage 82.60% 82.60% -0.01%
==========================================
Files 204 204
Lines 50175 50176 +1
==========================================
- Hits 41447 41446 -1
- Misses 8728 8730 +2
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `88.88% <87.50%> (-1.06%)` | :arrow_down: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=footer). Last update [86cf246...efb7883](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-752862936
I have re-opened this PR as #9032 shows that creating a buffer from a mutable buffer is 2x faster.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r550412011
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
Both great ideas. If it is ok for you, I will leave them to future work, as I am focusing my work on the `MutableBuffer` atm.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
Probably the offset buffer is also faster after converting it to a slice using `typed_data_mut::<OffsetSize>()` and iterating it with `iter_mut`, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751463081
I am moving this to draft until the benches are in place. @Dandandan is right that this is not necessarily an improvement. I am investigating this.\
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] jorgecarleitao commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751378346
That is a good point. I assumed that such a memcopy would always be worse for sufficiently large arrays, but I have not validated this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751356842
I think the previous time I tried this in a kernel, I got worse perf results than copying them from the `Vec`s, but maybe the buffer implementation is faster now? Does this code have a benchmark?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] codecov-io commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751324030
# [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=h1) Report
> Merging [#9016](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=desc) (654eb32) into [master](https://codecov.io/gh/apache/arrow/commit/a4f7c4a2acda874b3d6eb2eb4c986e7c3267c755?el=desc) (a4f7c4a) will **decrease** coverage by `0.00%`.
> The diff coverage is `87.50%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9016/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree)
```diff
@@ Coverage Diff @@
## master #9016 +/- ##
==========================================
- Coverage 82.87% 82.87% -0.01%
==========================================
Files 201 201
Lines 49739 49740 +1
==========================================
Hits 41220 41220
- Misses 8519 8520 +1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree) | Coverage Δ | |
|---|---|---|
| [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `89.36% <87.50%> (-1.02%)` | :arrow_down: |
| [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=continue).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=footer). Last update [a4f7c4a...654eb32](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
Probably the offset buffer is also faster after converting it to `typed_data_mut::<OffsetSize>` and iterating it with `iter_mut`, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] alamb closed pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9016:
URL: https://github.com/apache/arrow/pull/9016
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549109082
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
Might be interesting to test here whether it is cheaper to calculate offsets + total size first for `values` buffer based on final `length_so_far` so that it doesn't require extra allocations.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.
Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549109977
##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
}
pub(crate) fn from_vec(v: Vec<&str>) -> Self {
- let mut offsets = Vec::with_capacity(v.len() + 1);
- let mut values = Vec::new();
+ let mut offsets =
+ MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+ let mut values = MutableBuffer::new(0);
Review comment:
So something like
```rust
for s in &v {
length_so_far = length_so_far + OffsetSize::from_usize(s.len()).unwrap();
offsets.extend_from_slice(length_so_far.to_byte_slice());
}
let mut values = MutableBuffer::new(length_so_far);
for s in &v {
values.extend_from_slice(s.as_bytes());
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org