You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/06 12:30:00 UTC

[GitHub] [arrow-rs] HaoYang670 opened a new issue, #1800: Speed up `substring_by_char`

HaoYang670 opened a new issue, #1800:
URL: https://github.com/apache/arrow-rs/issues/1800

   Allocating a string temporary, only to copy out of it, is likely a significant portion of the slow-down. That combined with the null handling.
   
   This could definitely be handled as a separate PR, but you might want to consider doing something like (not tested).
   
   ```
   let nulls = // align bitmap to 0 offset, copying if already aligned
   let mut vals = BufferBuilder::<u8>::new(array.value_data().len());
   let mut indexes = BufferBuilder::<OffsetSize>::new(array.len() + 1);
   indexes.append(0);
   
   for val in array.iter() {
     let char_count = val.chars().count();
     let start = if start >= 0 {
         start.to_usize().unwrap().min(char_count)
     } else {
         char_count - (-start).to_usize().unwrap().min(char_count)
     };
     let length = length.map_or(char_count - start, |length| {
         length.to_usize().unwrap().min(char_count - start)
     });
   
     let mut start_byte = 0;
     let mut end_byte = val.len();
     for ((idx, (byte_idx, _)) in val.char_indices().enumerate() {
       if idx == start {
         start_byte = byte_idx;
       } else if idx == start + length {
         end_byte = byte_idx;
         break
       }
     }
     // Could even be unchecked
     vals.append_slice(&val[start_byte..end_byte]);
     indexes.append(vals.len() as _);
   }
   
   let data = ArrayDataBuilder::new(array.data_type()).len(array.len()).add_buffer(vals.finish())
     .add_buffer(indexes.finish()).add_buffer(vals.finish());
   
   Ok(GenericStringArray::<OffsetSize>::from(unsafe {data.build_unchecked()}))
   ```
   
   _Originally posted by @tustvold in https://github.com/apache/arrow-rs/pull/1784#discussion_r890023586_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1800: Speed up `substring_by_char`

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1800: Speed up `substring_by_char`
URL: https://github.com/apache/arrow-rs/issues/1800


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org