You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/30 09:33:46 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #9370: [Rust] Improved from_iter for primitive arrays (-30% for cast)

jorgecarleitao opened a new pull request #9370:
URL: https://github.com/apache/arrow/pull/9370


   This PR refactors `PrimitiveArray::from_iter` to support non-sized iterators, as that is the expectation from `Rust::FromIter`. This makes `from_iter` slower.
   
   To compensate, this PR introduces a new method, `unsafe from_trusted_len_iter` to create an array from an iterator of `Option<T>` that has a trusted len. This is 20-30% faster than using the `from_iter` implemented in master.
   
   This PR uses `from_trusted_len_iter` to speed up casting of `primitive -> primitive` and `bool -> primitive` by 20-30%.
   
   Note that the added complexity (of having new functions and being unable to rely on `collect`) arises from two `unstable` features that we can't use:
   
   1. `unsafe trait TrustedLen`
   2. specialization
   
   if we had both features, we could implement `TrustedLen` for all our iterators, and then use specialization to offer specialized `FromIter` from them. Once these features are stabilized, we can simplify our API by allowing users to call `collect` on an iterator and let the compiler conclude that the iterator is `TrustedLen` and therefore should use the specialized implementation. :)
   
   ```
   Switched to branch 'iter_p'
      Compiling arrow v4.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
       Finished bench [optimized] target(s) in 1m 11s
        Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/cast_kernels-9a3b6d213a9f7a9a
   Gnuplot not found, using plotters backend
   cast int32 to int32 512 time:   [25.384 ns 25.415 ns 25.449 ns]                                     
                           change: [-2.1332% -1.4846% -0.7930%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 11 outliers among 100 measurements (11.00%)
     3 (3.00%) high mild
     8 (8.00%) high severe
   
   cast int32 to uint32 512                                                                             
                           time:   [2.1526 us 2.1576 us 2.1629 us]
                           change: [-24.823% -24.221% -23.610%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     1 (1.00%) low mild
     4 (4.00%) high mild
     4 (4.00%) high severe
   
   cast int32 to float32 512                                                                             
                           time:   [2.4012 us 2.4083 us 2.4170 us]
                           change: [-20.381% -19.079% -17.860%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   cast int32 to float64 512                                                                             
                           time:   [2.4544 us 2.4608 us 2.4680 us]
                           change: [-24.689% -23.471% -22.441%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 8 outliers among 100 measurements (8.00%)
     4 (4.00%) high mild
     4 (4.00%) high severe
   
   cast int32 to int64 512 time:   [2.2424 us 2.2532 us 2.2663 us]                                     
                           change: [-28.692% -28.008% -27.316%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 7 outliers among 100 measurements (7.00%)
     3 (3.00%) high mild
     4 (4.00%) high severe
   
   cast float32 to int32 512                                                                             
                           time:   [2.4966 us 2.5063 us 2.5176 us]
                           change: [-26.820% -26.281% -25.755%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
     3 (3.00%) high severe
   
   cast float64 to float32 512                                                                             
                           time:   [2.4857 us 2.4954 us 2.5070 us]
                           change: [-22.439% -21.818% -21.222%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   cast float64 to uint64 512                                                                             
                           time:   [2.8313 us 2.8369 us 2.8427 us]
                           change: [-32.996% -32.605% -32.263%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     6 (6.00%) high mild
     3 (3.00%) high severe
   
   cast int64 to int32 512 time:   [2.2000 us 2.2073 us 2.2154 us]                                     
                           change: [-32.106% -31.271% -30.265%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 11 outliers among 100 measurements (11.00%)
     1 (1.00%) low mild
     5 (5.00%) high mild
     5 (5.00%) high severe
   
   cast date64 to date32 512                                                                             
                           time:   [1.0772 us 1.0815 us 1.0866 us]
                           change: [+1.3054% +3.6485% +6.8389%] (p = 0.01 < 0.05)
                           Performance has regressed.
   Found 14 outliers among 100 measurements (14.00%)
     2 (2.00%) high mild
     12 (12.00%) high severe
   
   cast date32 to date64 512                                                                             
                           time:   [838.20 ns 840.35 ns 842.74 ns]
                           change: [-1.2203% -0.3511% +0.5828%] (p = 0.47 > 0.05)
                           No change in performance detected.
   Found 10 outliers among 100 measurements (10.00%)
     3 (3.00%) low mild
     2 (2.00%) high mild
     5 (5.00%) high severe
   
   cast time32s to time32ms 512                                                                             
                           time:   [741.99 ns 745.21 ns 748.67 ns]
                           change: [-0.1386% +1.9748% +5.4043%] (p = 0.18 > 0.05)
                           No change in performance detected.
   Found 12 outliers among 100 measurements (12.00%)
     4 (4.00%) low mild
     2 (2.00%) high mild
     6 (6.00%) high severe
   
   cast time32s to time64us 512                                                                             
                           time:   [4.2476 us 4.2596 us 4.2747 us]
                           change: [-19.580% -18.601% -17.667%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 11 outliers among 100 measurements (11.00%)
     2 (2.00%) low mild
     6 (6.00%) high mild
     3 (3.00%) high severe
   
   cast time64ns to time32s 512                                                                             
                           time:   [4.9276 us 4.9371 us 4.9489 us]
                           change: [-0.2071% +0.6046% +1.5380%] (p = 0.17 > 0.05)
                           No change in performance detected.
   Found 8 outliers among 100 measurements (8.00%)
     1 (1.00%) low mild
     2 (2.00%) high mild
     5 (5.00%) high severe
   
   cast timestamp_ns to timestamp_s 512                                                                             
                           time:   [25.938 ns 26.005 ns 26.079 ns]
                           change: [-2.0321% -1.2763% -0.4590%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 9 outliers among 100 measurements (9.00%)
     1 (1.00%) low mild
     3 (3.00%) high mild
     5 (5.00%) high severe
   
   cast timestamp_ms to timestamp_ns 512                                                                             
                           time:   [2.0140 us 2.0187 us 2.0234 us]
                           change: [+0.4914% +1.5932% +2.6619%] (p = 0.00 < 0.05)
                           Change within noise threshold.
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   cast utf8 to f32        time:   [28.568 us 28.651 us 28.749 us]                              
                           change: [-5.4918% -4.8116% -4.1925%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 9 outliers among 100 measurements (9.00%)
     1 (1.00%) low mild
     5 (5.00%) high mild
     3 (3.00%) high severe
   
   cast i64 to string 512  time:   [50.182 us 50.270 us 50.366 us]                                    
                           change: [-2.7854% -1.4471% +0.2627%] (p = 0.05 > 0.05)
                           No change in performance detected.
   Found 13 outliers among 100 measurements (13.00%)
     7 (7.00%) high mild
     6 (6.00%) high severe
   
   cast f32 to string 512  time:   [54.687 us 54.833 us 54.983 us]                                   
                           change: [+2.6122% +3.3716% +4.2074%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 5 outliers among 100 measurements (5.00%)
     1 (1.00%) high mild
     4 (4.00%) high severe
   
   cast timestamp_ms to i64 512                                                                            
                           time:   [424.64 ns 426.27 ns 428.09 ns]
                           change: [+1.3146% +2.0623% +2.7867%] (p = 0.00 < 0.05)
                           Performance has regressed.
   Found 7 outliers among 100 measurements (7.00%)
     2 (2.00%) low mild
     2 (2.00%) high mild
     3 (3.00%) high severe
   
   cast utf8 to date32 512 time:   [45.104 us 45.202 us 45.312 us]                                     
                           change: [-0.3430% +0.3110% +1.0109%] (p = 0.40 > 0.05)
                           No change in performance detected.
   Found 10 outliers among 100 measurements (10.00%)
     5 (5.00%) high mild
     5 (5.00%) high severe
   
   cast utf8 to date64 512 time:   [75.844 us 76.028 us 76.239 us]                                    
                           change: [-10.720% -10.018% -9.3075%] (p = 0.00 < 0.05)
                           Performance has improved.
   Found 5 outliers among 100 measurements (5.00%)
     3 (3.00%) high mild
     2 (2.00%) high severe
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#issuecomment-774457518


   I also merged this PR with master and re-ran the tests to make sure everything is passing and the tests still did


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9370:
URL: https://github.com/apache/arrow/pull/9370


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#discussion_r569780553



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -905,15 +908,18 @@ where
     T: ArrowNumericType,
     <T as ArrowPrimitiveType>::Native: lexical_core::FromLexical,
 {
-    (0..from.len())
-        .map(|i| {
-            if from.is_null(i) {
-                None
-            } else {
-                lexical_core::parse(from.value(i).as_bytes()).ok()
-            }
-        })
-        .collect()
+    let iter = (0..from.len()).map(|i| {
+        if from.is_null(i) {
+            None
+        } else {
+            lexical_core::parse(from.value(i).as_bytes()).ok()
+        }
+    });
+    // Benefit:
+    //     20% performance improvement
+    // Soundness:
+    //     The iterator is trustedLen because it comes from an `StringArray`.

Review comment:
       ```suggestion
       //     The iterator is trustedLen because it comes from an range
   ```

##########
File path: rust/arrow/src/array/array_primitive.rs
##########
@@ -267,41 +270,61 @@ impl<T: ArrowPrimitiveType, Ptr: Borrow<Option<<T as ArrowPrimitiveType>::Native
 {
     fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self {
         let iter = iter.into_iter();
-        let (_, data_len) = iter.size_hint();
-        let data_len = data_len.expect("Iterator must be sized"); // panic if no upper bound.
+        let (lower, _) = iter.size_hint();
 
-        let num_bytes = bit_util::ceil(data_len, 8);
-        let mut null_buf = MutableBuffer::from_len_zeroed(num_bytes);
-        let mut val_buf = MutableBuffer::new(
-            data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(),
-        );
+        let mut null_buf = BooleanBufferBuilder::new(lower);
 
-        let null_slice = null_buf.as_slice_mut();
-        iter.enumerate().for_each(|(i, item)| {
-            if let Some(a) = item.borrow() {
-                bit_util::set_bit(null_slice, i);
-                val_buf.push(*a);
-            } else {
-                // this ensures that null items on the buffer are not arbitrary.
-                // This is important because falible operations can use null values (e.g. a vectorized "add")
-                // which may panic (e.g. overflow if the number on the slots happen to be very large).
-                val_buf.push(T::Native::default());
-            }
-        });
+        let buffer: Buffer = iter
+            .map(|item| {
+                if let Some(a) = item.borrow() {
+                    null_buf.append(true);
+                    *a
+                } else {
+                    null_buf.append(false);
+                    // this ensures that null items on the buffer are not arbitrary.
+                    // This is important because falible operations can use null values (e.g. a vectorized "add")
+                    // which may panic (e.g. overflow if the number on the slots happen to be very large).
+                    T::Native::default()
+                }
+            })
+            .collect();
 
         let data = ArrayData::new(
             T::DATA_TYPE,
-            data_len,
+            null_buf.len(),
             None,
             Some(null_buf.into()),
             0,
-            vec![val_buf.into()],
+            vec![buffer],
             vec![],
         );
         PrimitiveArray::from(Arc::new(data))
     }
 }
 
+impl<T: ArrowPrimitiveType> PrimitiveArray<T> {
+    /// Creates a [`PrimitiveArray`] from an iterator of trusted length.
+    /// # Safety
+    /// The iterator must be [`TrustedLen`](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html).
+    /// I.e. that `size_hint().1` correctly reports its length.
+    #[inline]
+    pub unsafe fn from_trusted_len_iter<I, P>(iter: I) -> Self
+    where
+        P: std::borrow::Borrow<Option<<T as ArrowPrimitiveType>::Native>>,
+        I: IntoIterator<Item = P>,
+    {
+        let iterator = iter.into_iter();
+        let (_, upper) = iterator.size_hint();
+        let len = upper.expect("trusted_len_unzip requires an upper limit");
+
+        let (null, buffer) = trusted_len_unzip(iterator);
+
+        let data =
+            ArrayData::new(T::DATA_TYPE, len, None, Some(null), 0, vec![buffer], vec![]);
+        PrimitiveArray::from(Arc::new(data))

Review comment:
       ```suggestion
           let iterator = iter.into_iter();
           let (null, buffer) = trusted_len_unzip(iterator);
   
           let data =
               ArrayData::new(T::DATA_TYPE, buffer.len(), None, Some(null), 0, vec![buffer], vec![]);
           PrimitiveArray::from(Arc::new(data))
   ```
   
   It seems like the size_hint is also used / checked in `trusted_len_unzip` and set to buffer.len(), so I wonder if you could  avoid some redundancy

##########
File path: rust/arrow/src/util/trusted_len.rs
##########
@@ -0,0 +1,82 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use super::bit_util;
+use crate::{
+    buffer::{Buffer, MutableBuffer},
+    datatypes::ArrowNativeType,
+};
+
+/// Creates two [`Buffer`]s from an iterator of `Option`.
+/// The first buffer corresponds to a bitmap buffer, the second one
+/// corresponds to a values buffer.
+/// # Safety
+/// The caller must ensure that `iterator` is `TrustedLen`.
+#[inline]
+pub(crate) unsafe fn trusted_len_unzip<I, P, T>(iterator: I) -> (Buffer, Buffer)
+where
+    T: ArrowNativeType,
+    P: std::borrow::Borrow<Option<T>>,
+    I: Iterator<Item = P>,
+{
+    let (_, upper) = iterator.size_hint();
+    let upper = upper.expect("trusted_len_unzip requires an upper limit");
+    let len = upper * std::mem::size_of::<T>();
+
+    let mut null = MutableBuffer::from_len_zeroed(upper.saturating_add(7) / 8);
+    let mut buffer = MutableBuffer::new(len);
+
+    let dst_null = null.as_mut_ptr();
+    let mut dst = buffer.as_mut_ptr() as *mut T;
+    for (i, item) in iterator.enumerate() {
+        let item = item.borrow();
+        if let Some(item) = item {
+            std::ptr::write(dst, *item);
+            bit_util::set_bit_raw(dst_null, i);
+        } else {
+            std::ptr::write(dst, T::default());
+        }
+        dst = dst.add(1);
+    }
+    assert_eq!(

Review comment:
       πŸ‘ 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#issuecomment-774457518


   I also merged this PR with master and re-ran the tests to make sure everything is passing and it was


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#issuecomment-770186102


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=h1) Report
   > Merging [#9370](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=desc) (85f9164) into [master](https://codecov.io/gh/apache/arrow/commit/6b8aa09cdc1dee230ddf2fe1c5f3e5182bb18657?el=desc) (6b8aa09) will **increase** coverage by `0.01%`.
   > The diff coverage is `95.52%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9370/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9370      +/-   ##
   ==========================================
   + Coverage   81.94%   81.95%   +0.01%     
   ==========================================
     Files         216      232      +16     
     Lines       53358    53414      +56     
   ==========================================
   + Hits        43722    43777      +55     
   - Misses       9636     9637       +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=tree) | Coverage Ξ” | |
   |---|---|---|
   | [rust/arrow/src/array/iterator.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvaXRlcmF0b3IucnM=) | `93.57% <ΓΈ> (ΓΈ)` | |
   | [rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=) | `97.06% <93.33%> (+<0.01%)` | :arrow_up: |
   | [rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz) | `95.02% <94.73%> (+0.06%)` | :arrow_up: |
   | [rust/arrow/src/util/trusted\_len.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90cnVzdGVkX2xlbi5ycw==) | `96.42% <96.42%> (ΓΈ)` | |
   | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `85.12% <100.00%> (+0.02%)` | :arrow_up: |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `95.43% <100.00%> (+0.02%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | | |
   | [...datafusion/src/physical\_plan/expressions/binary.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2JpbmFyeS5ycw==) | `85.22% <0.00%> (ΓΈ)` | |
   | [...t/datafusion/src/physical\_plan/expressions/case.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2Nhc2UucnM=) | `73.29% <0.00%> (ΓΈ)` | |
   | [...tafusion/src/physical\_plan/expressions/negative.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL25lZ2F0aXZlLnJz) | `42.85% <0.00%> (ΓΈ)` | |
   | ... and [14 more](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Ξ” = absolute <relative> (impact)`, `ΓΈ = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=footer). Last update [6b8aa09...85f9164](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9370: [Rust] Improved from_iter for primitive arrays (-30% for cast)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#issuecomment-770184350


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#issuecomment-772783611


   I plan to review this PR later this afternoon


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-30% for cast)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#issuecomment-770184507


   https://issues.apache.org/jira/browse/ARROW-11436


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#issuecomment-770186102


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=h1) Report
   > Merging [#9370](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=desc) (e8d1b41) into [master](https://codecov.io/gh/apache/arrow/commit/77ae93d6ecaac8fb5f4a18ca5287b7456cd88784?el=desc) (77ae93d) will **increase** coverage by `0.02%`.
   > The diff coverage is `59.13%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9370/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9370      +/-   ##
   ==========================================
   + Coverage   82.00%   82.03%   +0.02%     
   ==========================================
     Files         230      228       -2     
     Lines       53487    53516      +29     
   ==========================================
   + Hits        43864    43903      +39     
   + Misses       9623     9613      -10     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=tree) | Coverage Ξ” | |
   |---|---|---|
   | [rust/arrow/src/array/iterator.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvaXRlcmF0b3IucnM=) | `93.57% <ΓΈ> (ΓΈ)` | |
   | [rust/parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9yZWNvcmQvYXBpLnJz) | `92.63% <8.33%> (-5.39%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=) | `97.06% <93.33%> (+<0.01%)` | :arrow_up: |
   | [rust/arrow/src/array/array\_primitive.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfcHJpbWl0aXZlLnJz) | `95.02% <94.73%> (+0.06%)` | :arrow_up: |
   | [rust/arrow/src/util/trusted\_len.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC90cnVzdGVkX2xlbi5ycw==) | `96.42% <96.42%> (ΓΈ)` | |
   | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `85.12% <100.00%> (+0.02%)` | :arrow_up: |
   | [rust/arrow/src/buffer.rs](https://codecov.io/gh/apache/arrow/pull/9370/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYnVmZmVyLnJz) | `95.43% <100.00%> (+0.02%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Ξ” = absolute <relative> (impact)`, `ΓΈ = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=footer). Last update [2cb03e8...e8d1b41](https://codecov.io/gh/apache/arrow/pull/9370?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9370: ARROW-11436: [Rust] Improved from_iter for primitive arrays (-20-30% for cast)

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9370:
URL: https://github.com/apache/arrow/pull/9370#discussion_r571416818



##########
File path: rust/arrow/src/array/array_primitive.rs
##########
@@ -267,41 +270,61 @@ impl<T: ArrowPrimitiveType, Ptr: Borrow<Option<<T as ArrowPrimitiveType>::Native
 {
     fn from_iter<I: IntoIterator<Item = Ptr>>(iter: I) -> Self {
         let iter = iter.into_iter();
-        let (_, data_len) = iter.size_hint();
-        let data_len = data_len.expect("Iterator must be sized"); // panic if no upper bound.
+        let (lower, _) = iter.size_hint();
 
-        let num_bytes = bit_util::ceil(data_len, 8);
-        let mut null_buf = MutableBuffer::from_len_zeroed(num_bytes);
-        let mut val_buf = MutableBuffer::new(
-            data_len * mem::size_of::<<T as ArrowPrimitiveType>::Native>(),
-        );
+        let mut null_buf = BooleanBufferBuilder::new(lower);
 
-        let null_slice = null_buf.as_slice_mut();
-        iter.enumerate().for_each(|(i, item)| {
-            if let Some(a) = item.borrow() {
-                bit_util::set_bit(null_slice, i);
-                val_buf.push(*a);
-            } else {
-                // this ensures that null items on the buffer are not arbitrary.
-                // This is important because falible operations can use null values (e.g. a vectorized "add")
-                // which may panic (e.g. overflow if the number on the slots happen to be very large).
-                val_buf.push(T::Native::default());
-            }
-        });
+        let buffer: Buffer = iter
+            .map(|item| {
+                if let Some(a) = item.borrow() {
+                    null_buf.append(true);
+                    *a
+                } else {
+                    null_buf.append(false);
+                    // this ensures that null items on the buffer are not arbitrary.
+                    // This is important because falible operations can use null values (e.g. a vectorized "add")
+                    // which may panic (e.g. overflow if the number on the slots happen to be very large).
+                    T::Native::default()
+                }
+            })
+            .collect();
 
         let data = ArrayData::new(
             T::DATA_TYPE,
-            data_len,
+            null_buf.len(),
             None,
             Some(null_buf.into()),
             0,
-            vec![val_buf.into()],
+            vec![buffer],
             vec![],
         );
         PrimitiveArray::from(Arc::new(data))
     }
 }
 
+impl<T: ArrowPrimitiveType> PrimitiveArray<T> {
+    /// Creates a [`PrimitiveArray`] from an iterator of trusted length.
+    /// # Safety
+    /// The iterator must be [`TrustedLen`](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html).
+    /// I.e. that `size_hint().1` correctly reports its length.
+    #[inline]
+    pub unsafe fn from_trusted_len_iter<I, P>(iter: I) -> Self
+    where
+        P: std::borrow::Borrow<Option<<T as ArrowPrimitiveType>::Native>>,
+        I: IntoIterator<Item = P>,
+    {
+        let iterator = iter.into_iter();
+        let (_, upper) = iterator.size_hint();
+        let len = upper.expect("trusted_len_unzip requires an upper limit");
+
+        let (null, buffer) = trusted_len_unzip(iterator);
+
+        let data =
+            ArrayData::new(T::DATA_TYPE, len, None, Some(null), 0, vec![buffer], vec![]);
+        PrimitiveArray::from(Arc::new(data))

Review comment:
       we can handle this as a follow on, if needed. We have quite a backlog of PRs so merging this one in to work that down




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org