You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/05/31 04:34:22 UTC

[GitHub] [arrow] houqp commented on a change in pull request #7306: ARROW-3688: [Rust] Add append_values for primitive builders

houqp commented on a change in pull request #7306:
URL: https://github.com/apache/arrow/pull/7306#discussion_r432907904



##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -500,6 +500,49 @@ impl<T: ArrowPrimitiveType> PrimitiveBuilder<T> {
         Ok(())
     }
 
+    /// Appends values from a slice of type `T` and a validity byte slice
+    pub fn append_values(
+        &mut self,
+        values: &[T::Native],
+        is_valid: &[u8],
+        offset: usize,
+    ) -> Result<()> {
+        let ceil = bit_util::round_upto_power_of_2(values.len(), 8);
+        if offset == 0 {
+            // if there is no offset, the value slice should align with validity
+            assert_eq!(
+                ceil / 8,
+                is_valid.len(),
+                "value slice not aligned with validity slice"
+            );
+        } else {
+            // when there is an offset, the validity slots should be > values
+            assert!(
+                ceil / 8 <= is_valid.len(),
+                "value slots greater than validity slots"
+            );
+        }
+        let mut offset = offset;
+        let mut bools = Vec::with_capacity(values.len() / 8 + 1);
+        is_valid.iter().for_each(|v| {
+            if offset >= 8 {
+                offset -= 8;
+                return;
+            }
+            let bin = format!("{:b}", v);
+            // TODO: reversing to correct the endianness, find a better solution
+            // if we have `11001010` it means first value is null, and 8th value is valid
+            // however if the offset = 3, we should have `11001`, so we skip offset
+            bin.as_str().chars().rev().skip(offset).for_each(|c| {
+                bools.push(c == '1');
+            });
+            // offset should be covered by available bytes
+            offset = 0;
+        });
+        self.bitmap_builder.append_slice(&bools[0..values.len()])?;

Review comment:
       it might be faster to manually call `bimap_builder.reserve(values.len())` ahead of time, then call `bitmap_builder.append()` in the loop to avoid having to create bools vector. `append_slice` is just `append` running in a loop, so if you are already looping manually before calling it, then I feel like using `append` would be better.

##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -500,6 +500,26 @@ impl<T: ArrowPrimitiveType> PrimitiveBuilder<T> {
         Ok(())
     }
 
+    /// Appends values from a slice of type `T` and a validity byte slice
+    pub fn append_values(&mut self, values: &[T::Native], is_valid: &[u8]) -> Result<()> {
+        let ceil = bit_util::round_upto_power_of_2(values.len(), 8);
+        assert_eq!(
+            ceil / 8,
+            is_valid.len(),
+            "value slice not aligned with validity slice"
+        );
+        let mut bools = Vec::with_capacity(values.len() / 8 + 1);
+        is_valid.iter().for_each(|v| {
+            let bin = format!("{:b}", v);
+            // TODO: reversing to correct the endianness, find a better solution
+            bin.as_str().chars().rev().for_each(|c| {
+                bools.push(c == '1');

Review comment:
       `{:b}` ignores leading zeros, which might not be want we want here. How about using bitmaks?
   
   ```rust
   fn main() {
       // 0x38: 0011 1000
       // 0x1F: 0001 1111
       let bitmap = [0x38, 0x1F];
       bitmap.iter().enumerate().for_each(|(idx, byte)| {
           let mut bitmask = 0x01;
           println!("byte {}: {:b}", idx, byte);
           let offset = 0;
           for _ in offset..8 {
               println!("\t{}", (byte & bitmask) != 0);
               bitmask = bitmask << 1;
           }
       });
   }
   ```
   output:
   
   ```
   byte 0: 111000
   	false
   	false
   	false
   	true
   	true
   	true
   	false
   	false
   byte 1: 11111
   	true
   	true
   	true
   	true
   	true
   	false
   	false
   	false
   ```

##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -500,6 +500,49 @@ impl<T: ArrowPrimitiveType> PrimitiveBuilder<T> {
         Ok(())
     }
 
+    /// Appends values from a slice of type `T` and a validity byte slice
+    pub fn append_values(
+        &mut self,
+        values: &[T::Native],
+        is_valid: &[u8],
+        offset: usize,
+    ) -> Result<()> {
+        let ceil = bit_util::round_upto_power_of_2(values.len(), 8);
+        if offset == 0 {
+            // if there is no offset, the value slice should align with validity
+            assert_eq!(
+                ceil / 8,
+                is_valid.len(),
+                "value slice not aligned with validity slice"
+            );

Review comment:
       just my 2 cents, but i feel like assert should be used for internal errors. In this case, it's checking for invalid arguments, which is user/caller error, so perhaps returning an Err instead of crashing is better.

##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -500,6 +500,49 @@ impl<T: ArrowPrimitiveType> PrimitiveBuilder<T> {
         Ok(())
     }
 
+    /// Appends values from a slice of type `T` and a validity byte slice
+    pub fn append_values(
+        &mut self,
+        values: &[T::Native],
+        is_valid: &[u8],
+        offset: usize,
+    ) -> Result<()> {

Review comment:
       Do you expect users of this function to manually construct `values`, `is_valid` and `offset` separately? If not, it might be better to pass in `ArrayDataRef` as argument so these 3 configs will always be consistent.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org