You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/20 21:03:14 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Dandandan opened a new pull request #9279:
URL: https://github.com/apache/arrow/pull/9279


   This PR changes take string to use `MutableBuffer` to create a byte array directly instead of converting it from a `Vec<u8>`.
   There used to be some overhead compared to using a `Vec` and converting it to a buffer afterwards, but the overhead seems to be gone now.
   
   The change seems to be neutral according to benchmarks, giving results within a few %. If there is any remaining overhead in `MutableBufffer` I think we should fix that rather than having some workarounds and inconsistencies with other kernels.
   
   ```
   take str 512            time:   [2.3304 us 2.3358 us 2.3419 us]
                           change: [-4.4130% -4.0693% -3.7241%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     2 (2.00%) high mild
     1 (1.00%) high severe
   
   Benchmarking take str 1024: Collecting 100 samples in estimated 5.0198 s (1.1M i                                                                                take str 1024           time:   [4.3583 us 4.3633 us 4.3694 us]
                           change: [-0.5853% +1.1186% +2.9951%] (p = 0.29 > 0.05)
                           No change in performance detected.
   Found 16 outliers among 100 measurements (16.00%)
     3 (3.00%) low severe
     6 (6.00%) high mild
     7 (7.00%) high severe
   
   Benchmarking take str null indices 512: Collecting 100 samples in estimated 5.00                                                                                take str null indices 512                        
                           time:   [2.4779 us 2.4813 us 2.4844 us]
                           change: [-2.4765% -2.2000% -1.9437%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) low mild
     2 (2.00%) high mild
   
   Benchmarking take str null indices 1024: Collecting 100 samples in estimated 5.0                                                                                take str null indices 1024                        
                           time:   [4.4823 us 4.4910 us 4.5053 us]
                           change: [-4.8482% -4.5426% -4.2894%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     1 (1.00%) high severe
   
   Benchmarking take str null values 1024: Collecting 100 samples in estimated 5.00                                                                                take str null values 1024                        
                           time:   [4.4856 us 4.4889 us 4.4920 us]
                           change: [-2.2093% -2.0471% -1.8925%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     2 (2.00%) low severe
     3 (3.00%) low mild
     1 (1.00%) high mild
     1 (1.00%) high severe
   
   Benchmarking take str null values null indices 1024: Collecting 100 samples in e                                                                                take str null values null indices 1024                        
                           time:   [9.6438 us 9.6514 us 9.6592 us]
                           change: [-2.8600% -2.7478% -2.6338%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 2 outliers among 100 measurements (2.00%)
     1 (1.00%) high mild
     1 (1.00%) high severe
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#discussion_r561307105



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -423,7 +423,7 @@ where
     let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset);
 
     let offsets = offsets_buffer.typed_data_mut();
-    let mut values = Vec::with_capacity(bytes_offset);
+    let mut values = MutableBuffer::new(0);

Review comment:
       Used to be here `bytes_offset` but that doesn't make a lot of sense.
   The values array could be empty as well (e.g. for empty strings), or have a different capacity altogether.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9279:
URL: https://github.com/apache/arrow/pull/9279


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#issuecomment-763944867


   https://issues.apache.org/jira/browse/ARROW-11332


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#issuecomment-763944867


   https://issues.apache.org/jira/browse/ARROW-11332


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#issuecomment-763951458


   I think a nice followup for `take_string` would maybe be looking into using the offsets data to calculate the total length of the buffer to allocate, which should be pretty fast, and then allocating based on that. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#issuecomment-763951458


   I think a nice followup for `take_string` would maybe be looking into using the offsets data to calculate the total length of the buffer to allocate, which should be pretty fast, and then allocating based on that. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#discussion_r561307105



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -423,7 +423,7 @@ where
     let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset);
 
     let offsets = offsets_buffer.typed_data_mut();
-    let mut values = Vec::with_capacity(bytes_offset);
+    let mut values = MutableBuffer::new(0);

Review comment:
       Used to be here `bytes_offset` but that doesn't make a lot of sense.
   The values array could be empty as well (e.g. for empty strings), or have a different capacity altogether.

##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -423,7 +423,7 @@ where
     let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset);
 
     let offsets = offsets_buffer.typed_data_mut();
-    let mut values = Vec::with_capacity(bytes_offset);
+    let mut values = MutableBuffer::new(0);

Review comment:
       Used to be here `bytes_offset` but that doesn't make a lot of sense.
   The values array could be empty as well (e.g. for empty strings), or have a totally different capacity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9279: ARROW-11332: [Rust] Use MutableBuffer in take_string instead of Vec

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9279:
URL: https://github.com/apache/arrow/pull/9279#discussion_r561307105



##########
File path: rust/arrow/src/compute/kernels/take.rs
##########
@@ -423,7 +423,7 @@ where
     let mut offsets_buffer = MutableBuffer::from_len_zeroed(bytes_offset);
 
     let offsets = offsets_buffer.typed_data_mut();
-    let mut values = Vec::with_capacity(bytes_offset);
+    let mut values = MutableBuffer::new(0);

Review comment:
       Used to be here `bytes_offset` but that doesn't make a lot of sense.
   The values array could be empty as well (e.g. for empty strings), or have a totally different capacity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org