You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/26 06:05:52 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

jorgecarleitao opened a new pull request #9016:
URL: https://github.com/apache/arrow/pull/9016


   Avoids a memcopy from 2 `Vec<T>` to `Buffer`, by building the buffers on the fly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751324030


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=h1) Report
   > Merging [#9016](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=desc) (55b07b3) into [master](https://codecov.io/gh/apache/arrow/commit/cd22be6efedbf9832b5ea875ca59bb42de7b6c28?el=desc) (cd22be6) will **decrease** coverage by `0.00%`.
   > The diff coverage is `87.50%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9016/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9016      +/-   ##
   ==========================================
   - Coverage   82.55%   82.55%   -0.01%     
   ==========================================
     Files         203      203              
     Lines       50043    50044       +1     
   ==========================================
   - Hits        41315    41314       -1     
   - Misses       8728     8730       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `89.24% <87.50%> (-1.03%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=footer). Last update [cd22be6...55b07b3](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-752956691


   The full set of Rust CI tests did not run on this PR :(
   
   Can you please rebase this PR against [apache/master](https://github.com/apache/arrow) to pick up the changes in https://github.com/apache/arrow/pull/9056 so that they do? 
   
   I apologize for the inconvenience. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       Probably the offset buffer is also faster after converting it to `typed_data_mut` and iterating it with `iter_mut`, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751323053


   https://issues.apache.org/jira/browse/ARROW-11037


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       Probably the offset buffer is also faster after converting it to `typed_data_mut` and iterating it with `iter_mut`, this way it can skip the capacity checks. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r550416288



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       I.e. the general direction that I am pushing towards is to stop using `Vec` and replace it by `MutableBuffer` to avoid extra allocations. Once that is in place, we can replace these by `iter_mut`.
   
   > also current conversion to a byte slice may add some overhead?
   
   I believe so, as `to_byte_slice` returns a slice of unknown size, both for `&T` and `&[T]`. Either the compiler optimizes it, or there is an extra cost. I benched a 10% diff in building buffers when introduced a method that was not doing bound checks on these. IMO we should be using `to_le_bytes`, and have a method `MutableBuffer::push<ToByteSlice>`. I think we need the crate `byteorder` because AFAIK std's `to_le_bytes` does not have a trait (which we need).
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751324030


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=h1) Report
   > Merging [#9016](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=desc) (6e8d2ff) into [master](https://codecov.io/gh/apache/arrow/commit/86cf246d161512923253baa8fe62e00de88db73a?el=desc) (86cf246) will **decrease** coverage by `0.00%`.
   > The diff coverage is `87.50%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9016/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9016      +/-   ##
   ==========================================
   - Coverage   82.60%   82.60%   -0.01%     
   ==========================================
     Files         204      204              
     Lines       50175    50176       +1     
   ==========================================
   - Hits        41447    41446       -1     
   - Misses       8728     8730       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `88.88% <87.50%> (-1.06%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=footer). Last update [86cf246...efb7883](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-752862936


   I have re-opened this PR as #9032 shows that creating a buffer from a mutable buffer is 2x faster.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r550412011



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       Both great ideas. If it is ok for you, I will leave them to future work, as I am focusing my work on the `MutableBuffer` atm.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       Probably the offset buffer is also faster after converting it to a slice using `typed_data_mut::<OffsetSize>()` and iterating it with `iter_mut`, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751463081


   I am moving this to draft until the benches are in place. @Dandandan is right that this is not necessarily an improvement. I am investigating this.\


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751378346


   That is a good point. I assumed that such a memcopy would always be worse for sufficiently large arrays, but I have not validated this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751356842


   I think the previous time I tried this in a kernel, I got worse perf results than copying them from the `Vec`s, but maybe the buffer implementation is faster now? Does this code have a benchmark?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#issuecomment-751324030


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=h1) Report
   > Merging [#9016](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=desc) (654eb32) into [master](https://codecov.io/gh/apache/arrow/commit/a4f7c4a2acda874b3d6eb2eb4c986e7c3267c755?el=desc) (a4f7c4a) will **decrease** coverage by `0.00%`.
   > The diff coverage is `87.50%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9016/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9016      +/-   ##
   ==========================================
   - Coverage   82.87%   82.87%   -0.01%     
   ==========================================
     Files         201      201              
     Lines       49739    49740       +1     
   ==========================================
     Hits        41220    41220              
   - Misses       8519     8520       +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfc3RyaW5nLnJz) | `89.36% <87.50%> (-1.02%)` | :arrow_down: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9016/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.43% <0.00%> (+0.19%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=footer). Last update [a4f7c4a...654eb32](https://codecov.io/gh/apache/arrow/pull/9016?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549110164



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       Probably the offset buffer is also faster after converting it to `typed_data_mut::<OffsetSize>` and iterating it with `iter_mut`, this way it can skip the capacity checks, and also current conversion to a byte slice may add some overhead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9016:
URL: https://github.com/apache/arrow/pull/9016


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549109082



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       Might be interesting to test here whether it is cheaper to calculate offsets + total size first for `values` buffer based on final `length_so_far` so that it doesn't require extra allocations.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9016: ARROW-11037: [Rust] Optimized creation of string array from iterator.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9016:
URL: https://github.com/apache/arrow/pull/9016#discussion_r549109977



##########
File path: rust/arrow/src/array/array_string.rs
##########
@@ -126,25 +126,28 @@ impl<OffsetSize: StringOffsetSizeTrait> GenericStringArray<OffsetSize> {
     }
 
     pub(crate) fn from_vec(v: Vec<&str>) -> Self {
-        let mut offsets = Vec::with_capacity(v.len() + 1);
-        let mut values = Vec::new();
+        let mut offsets =
+            MutableBuffer::new((v.len() + 1) * std::mem::size_of::<OffsetSize>());
+        let mut values = MutableBuffer::new(0);

Review comment:
       So something like
   
   ```rust
   for s in &v {
               length_so_far = length_so_far + OffsetSize::from_usize(s.len()).unwrap();
               offsets.extend_from_slice(length_so_far.to_byte_slice());
           }
           
   let mut values = MutableBuffer::new(length_so_far);
   for s in &v {
       values.extend_from_slice(s.as_bytes());
   }
      ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org