You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/21 22:44:47 UTC

[GitHub] [arrow-rs] Ismail-Maj opened a new pull request, #1720: Implementation string concat

Ismail-Maj opened a new pull request, #1720:
URL: https://github.com/apache/arrow-rs/pull/1720

   # Which issue does this PR close?
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1540 .
   
   # Rationale for this change
    
    <!---
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   Add an implementation for string concat and a single test.
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r880499706


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(

Review Comment:
   I also agree that a new string (or `utf8` module) would be good here. 
   
   There are a bunch of other potential candidates in datafusion to move into arrow-rs (kudos to @ovr  for implementing many of them initially): https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/string_expressions.rs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1134425708

   Looking good :smile: I think the major thing this now needs to handle is sliced inputs, and tests. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r881991590


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(

Review Comment:
   @HaoYang670  would you like to file the follow on issues, or shall I?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ismail-Maj commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
Ismail-Maj commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1135510100

   Only tests left to do (and move the code from `concat.rs`) I believe.
   @tustvold do you think this code can be ported to handle other `DataType`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1133879185

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1720?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1720](https://codecov.io/gh/apache/arrow-rs/pull/1720?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9e6a0f2) into [master](https://codecov.io/gh/apache/arrow-rs/commit/7b164a07b43bc579bb5259fe0f95bf4332661922?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7b164a0) will **increase** coverage by `0.01%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1720      +/-   ##
   ==========================================
   + Coverage   83.32%   83.33%   +0.01%     
   ==========================================
     Files         195      195              
     Lines       56044    56059      +15     
   ==========================================
   + Hits        46698    46718      +20     
   + Misses       9346     9341       -5     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1720?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/compute/kernels/concat.rs](https://codecov.io/gh/apache/arrow-rs/pull/1720/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9jb25jYXQucnM=) | `100.00% <100.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1720/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.65% <0.00%> (+0.19%)` | :arrow_up: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1720/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `87.02% <0.00%> (+0.22%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1720/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (+0.22%)` | :arrow_up: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1720/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.79% <0.00%> (+0.37%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1720?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1720?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [7b164a0...9e6a0f2](https://codecov.io/gh/apache/arrow-rs/pull/1720?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r878851945


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(

Review Comment:
   Perhaps @alamb has some thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879216220


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    // TODO: Handle non-zero offset in source ArrayData
+
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(
+            "StringArrays must have the same length".to_string(),
+        ));
+    }
+
+    let output_bitmap = match (left.data().null_bitmap(), right.data().null_bitmap()) {
+        (Some(left_bitmap), Some(right_bitmap)) => Some((left_bitmap & right_bitmap)?),
+        (Some(left_bitmap), None) => Some(left_bitmap.clone()),
+        (None, Some(right_bitmap)) => Some(right_bitmap.clone()),
+        (None, None) => None,
+    };
+
+    let left_offsets = left.value_offsets();
+    let right_offsets = right.value_offsets();
+
+    let left_buffer = left.value_data();
+    let right_buffer = right.value_data();
+    let left_values = left_buffer.as_slice();
+    let right_values = right_buffer.as_slice();
+
+    let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
+    let mut output_values =
+        BufferBuilder::<u8>::new(left_values.len() + right_values.len());
+
+    output_offsets.append(Offset::zero());
+    for (idx, (l, r)) in left_offsets
+        .windows(2)
+        .zip(right_offsets.windows(2))
+        .enumerate()
+    {
+        match &output_bitmap {
+            Some(bitmap) if { bitmap.is_set(idx) } => {

Review Comment:
   Perhaps I'm missing something, but I think we can ignore the null mask in the loop body. Yes we might copy values only to mark them as null in the output bitmask, but this doesn't actually matter. There is no requirement for null values to have zero slice length.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879217806


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    // TODO: Handle non-zero offset in source ArrayData
+
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(
+            "StringArrays must have the same length".to_string(),
+        ));
+    }
+
+    let output_bitmap = match (left.data().null_bitmap(), right.data().null_bitmap()) {
+        (Some(left_bitmap), Some(right_bitmap)) => Some((left_bitmap & right_bitmap)?),
+        (Some(left_bitmap), None) => Some(left_bitmap.clone()),
+        (None, Some(right_bitmap)) => Some(right_bitmap.clone()),
+        (None, None) => None,
+    };
+
+    let left_offsets = left.value_offsets();
+    let right_offsets = right.value_offsets();
+
+    let left_buffer = left.value_data();
+    let right_buffer = right.value_data();
+    let left_values = left_buffer.as_slice();
+    let right_values = right_buffer.as_slice();
+
+    let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
+    let mut output_values =
+        BufferBuilder::<u8>::new(left_values.len() + right_values.len());

Review Comment:
   Depends on the logic of the values copy loop, if you just blindly ignore the null mask and copy everything, which will likely be faster, this is the exact size of the output values. Even if you do filter in the values copy loop, its a reasonable over-estimate
   
   Edit: It is actually incorrect in the presence of non-zero offsets or slices, it needs to be
   
   ```
   left_offsets[left.len()] - left_offsets[0] + right_offsets[right.len()] - right_offsets[0]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r878850863


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))
+        .map(|i| {
+            if concat_bitmap.is_set(i) {
+                Some(left.value(i).to_owned() + right.value(i))
+            } else {
+                None
+            }
+        })
+        .collect::<GenericStringArray<Offset>>())

Review Comment:
   I think it would be **significantly** faster to do something like (not at all tested)
   
   ```
   // TODO: Handle non-zero offset in source ArrayData
   
   if left.len() != right.len() {
       return Err(...)
   }
   
   let nulls = match (left.data().null_bitmap(), right.data.null_bitmap()) {
     (Some(left), Some(right)) = Some((left & right)?)
     (Some(left), None) => Some(left),
     (None, Some(right)) => Some(right),
     (None, None) => None,
   };
   let left_offsets = left.value_offsets();
   let right_offsets = right.value_offsets();
   
   let left_values = left.value_data().as_slice();
   let right_values = right.value_data().as_slice();
   
   let left_iter = left_offsets.windows(2);
   let right_iter = right_offsets.windows(2);
   
   let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
   let mut output_values = BufferBuilder::<u8>::new(left_data.len() + right_data.len());
   let mut cur_offset = 0;
   output_offsets.append(0);
   
   for (left, right) in left_iter.zip(right_iter) {
     let left_len = left[1] - left[0];
     let right_len = right[1] - right[0];
     cur_offset += left_len + right_len;
     output_offsets.append(cur_offset); // With checked case
     output_values.append_slice(&left_values[left[0]..left[1]]);
     output_values.append_slice(&right_values[right[0]..right[1]]);)
   }
   
   let mut builder = ArrayDataBuilder::new(Offset::get_data_type)
     .add_buffer(output_offsets.finish())
     .add_buffer(output_values.finish());
   
   if let Some(nulls) = nulls {
     builder = builder.null_bit_buffer(nulls);
   }
   
   unsafe {builder.build_unchecked()}
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879217806


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    // TODO: Handle non-zero offset in source ArrayData
+
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(
+            "StringArrays must have the same length".to_string(),
+        ));
+    }
+
+    let output_bitmap = match (left.data().null_bitmap(), right.data().null_bitmap()) {
+        (Some(left_bitmap), Some(right_bitmap)) => Some((left_bitmap & right_bitmap)?),
+        (Some(left_bitmap), None) => Some(left_bitmap.clone()),
+        (None, Some(right_bitmap)) => Some(right_bitmap.clone()),
+        (None, None) => None,
+    };
+
+    let left_offsets = left.value_offsets();
+    let right_offsets = right.value_offsets();
+
+    let left_buffer = left.value_data();
+    let right_buffer = right.value_data();
+    let left_values = left_buffer.as_slice();
+    let right_values = right_buffer.as_slice();
+
+    let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
+    let mut output_values =
+        BufferBuilder::<u8>::new(left_values.len() + right_values.len());

Review Comment:
   Depends on the logic of the values copy loop, if you just blindly ignore the null mask and copy everything, which will likely be faster, this is the exact size of the output values. Even if you do filter in the values copy loop, its a reasonable over-estimate



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1134013185

   @Ismail-Maj Really a good start! Thank you for contributing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r878848263


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();

Review Comment:
   We should definitely handle the case where one or both lack a null bitmap



##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))

Review Comment:
   I think it should be an error if the lengths differ?



##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -569,4 +588,23 @@ mod tests {
         assert!(!copy.data().child_data()[0].ptr_eq(&combined.data().child_data()[0]));
         assert!(!new.data().child_data()[0].ptr_eq(&combined.data().child_data()[0]));
     }
+
+    #[cfg(feature = "test_utils")]
+    #[test]

Review Comment:
   Some things to check
   
   * Arrays with no nulls
   * Arrays with non-zero offset (e.g. those produced by Array::slice)
   * Arrays with empty strings
   * Arrays with different lengths (should error)



##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))
+        .map(|i| {
+            if concat_bitmap.is_set(i) {
+                Some(left.value(i).to_owned() + right.value(i))
+            } else {
+                None
+            }
+        })
+        .collect::<GenericStringArray<Offset>>())

Review Comment:
   I think it would be **significantly** faster to do something like (not at all tested)
   
   ```
   // TODO: Handle non-zero offset in source ArrayData
   
   if left.len() != right.len() {
       return Err(...)
   }
   
   let bitmap = match (left.data().null_bitmap(), right.data.null_bitmap()) {
     (Some(left), Some(right)) = Some((left & right)?)
     (Some(left), None) => Some(left),
     (None, Some(right)) => Some(right),
     (None, None) => None,
   };
   let left_offsets = left.value_offsets();
   let right_offsets = right.value_offsets();
   
   let left_values = left.value_data().as_slice();
   let right_values = right.value_data().as_slice();
   
   let left_iter = left_offsets.windows(2);
   let right_iter = right_offsets.windows(2);
   
   let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
   let mut output_values = BufferBuilder::<u8>::new(left_data.len() + right_data.len());
   let mut cur_offset = 0;
   output_offsets.append(0);
   
   for (left, right) in left_iter.zip(right_iter) {
     let left_len = left[1] - left[0];
     let right_len = right[1] - right[0];
     cur_offset += left_len + right_len;
     output_offsets.append(cur_offset); // With checked case
     output_values.append_slice(&left_values[left[0]..left[1]]);
     output_values.append_slice(&right_values[right[0]..right[1]]);)
   }
   
   let mut builder = ArrayDataBuilder::new(Offset::get_data_type)
     .add_buffer(output_offsets.finish())
     .add_buffer(output_values.finish());
   
   if let Some(buffer) = bitmap {
     builder = builder.null_bit_buffer(bitmap);
   }
   
   unsafe {builder.build_unchecked()}
   ```
   



##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(

Review Comment:
   I wonder if this should be in a new string module, as opposed to concat, as I could see it causing confusion that this performs element-wise concatenation, when the concat kernel above concatenates arrays?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879054723


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays

Review Comment:
   Would be nicer to add user document.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] martin-g commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
martin-g commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879054673


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    // TODO: Handle non-zero offset in source ArrayData
+
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(
+            "StringArrays must have the same length".to_string(),

Review Comment:
   IMO it would be useful if the error message contains the lengths



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r880501204


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))
+        .map(|i| {
+            if concat_bitmap.is_set(i) {
+                Some(left.value(i).to_owned() + right.value(i))
+            } else {
+                None
+            }
+        })
+        .collect::<GenericStringArray<Offset>>())

Review Comment:
   > "valid_str" + None => None
   
   This is the standard SQL semantic and it is how most other arrow kernels I know of behave, thus I think it would be best for consistency



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ismail-Maj commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
Ismail-Maj commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1136984423

   @tustvold it should be good now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1136918975

   @Ismail-Maj could you possibly merge the latest master to bring in #1743 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r881991106


##########
arrow/src/compute/kernels/string.rs:
##########
@@ -0,0 +1,196 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::*;
+use crate::compute::util::combine_option_bitmap;
+use crate::error::{ArrowError, Result};
+
+/// Returns the elementwise concatenation of `StringArray`.
+///
+/// An index of the resulting `StringArray` is null if any of `StringArray` are null at that location.
+///
+/// ```text
+/// e.g:
+///
+///   ["Hello"] + ["World"] = ["HelloWorld"]
+///
+///   ["a", "b"] + [None, "c"] = [None, "bc"]
+/// ```
+///
+/// Attention: `left` and `right` must have the same length.
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(format!(
+            "Arrays must have the same length: {} != {}",
+            left.len(),
+            right.len()
+        )));
+    }
+
+    let output_bitmap = combine_option_bitmap(left.data(), right.data(), left.len())?;
+
+    let left_offsets = left.value_offsets();
+    let right_offsets = right.value_offsets();
+
+    let left_buffer = left.value_data();
+    let right_buffer = right.value_data();
+    let left_values = left_buffer.as_slice();
+    let right_values = right_buffer.as_slice();
+
+    let mut output_values = BufferBuilder::<u8>::new(
+        left_values.len() + right_values.len()
+            - left_offsets[0].to_usize().unwrap()
+            - right_offsets[0].to_usize().unwrap(),
+    );
+
+    let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
+    output_offsets.append(Offset::zero());
+    for (left_idx, right_idx) in left_offsets.windows(2).zip(right_offsets.windows(2)) {
+        output_values.append_slice(
+            &left_values
+                [left_idx[0].to_usize().unwrap()..left_idx[1].to_usize().unwrap()],
+        );
+        output_values.append_slice(
+            &right_values
+                [right_idx[0].to_usize().unwrap()..right_idx[1].to_usize().unwrap()],
+        );
+        output_offsets.append(Offset::from_usize(output_values.len()).unwrap());
+    }
+
+    let mut builder =
+        ArrayDataBuilder::new(GenericStringArray::<Offset>::get_data_type())
+            .len(left.len())
+            .add_buffer(output_offsets.finish())
+            .add_buffer(output_values.finish());
+
+    if let Some(null_bitmap) = output_bitmap {
+        builder = builder.null_bit_buffer(null_bitmap);
+    }
+
+    // SAFETY - offsets valid by construction
+    Ok(unsafe { builder.build_unchecked() }.into())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    #[test]
+    fn test_string_concat() {
+        let left = [Some("foo"), Some("bar"), None]
+            .into_iter()
+            .collect::<StringArray>();
+        let right = [None, Some("yyy"), Some("zzz")]
+            .into_iter()
+            .collect::<StringArray>();
+
+        let output = string_concat(&left, &right).unwrap();
+
+        let expected = [None, Some("baryyy"), None]
+            .into_iter()
+            .collect::<StringArray>();
+
+        assert_eq!(output, expected);
+    }
+
+    #[test]
+    fn test_string_concat_empty_string() {
+        let left = [Some("foo"), Some(""), Some("bar")]
+            .into_iter()
+            .collect::<StringArray>();
+        let right = [Some("baz"), Some(""), Some("")]
+            .into_iter()
+            .collect::<StringArray>();
+
+        let output = string_concat(&left, &right).unwrap();
+
+        let expected = [Some("foobaz"), Some(""), Some("bar")]
+            .into_iter()
+            .collect::<StringArray>();
+
+        assert_eq!(output, expected);
+    }
+
+    #[test]
+    fn test_string_concat_no_null() {
+        let left = StringArray::from(vec!["foo", "bar"]);
+        let right = StringArray::from(vec!["bar", "baz"]);
+
+        let output = string_concat(&left, &right).unwrap();
+
+        let expected = StringArray::from(vec!["foobar", "barbaz"]);
+
+        assert_eq!(output, expected);
+    }
+
+    #[test]
+    fn test_string_concat_error() {
+        let left = StringArray::from(vec!["foo", "bar"]);
+        let right = StringArray::from(vec!["baz"]);
+
+        let output = string_concat(&left, &right);
+
+        assert!(output.is_err());
+    }
+
+    #[test]
+    fn test_string_concat_slice() {
+        let left = &StringArray::from(vec![None, Some("foo"), Some("bar"), Some("baz")]);
+        let right = &StringArray::from(vec![Some("boo"), None, Some("far"), Some("faz")]);
+
+        let left_slice = left.slice(0, 3);
+        let right_slice = right.slice(1, 3);
+        let output = string_concat(
+            left_slice
+                .as_any()
+                .downcast_ref::<GenericStringArray<i32>>()
+                .unwrap(),
+            right_slice
+                .as_any()
+                .downcast_ref::<GenericStringArray<i32>>()
+                .unwrap(),
+        )
+        .unwrap();
+
+        let expected = [None, Some("foofar"), Some("barfaz")]

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1135921699

   >> Only tests left to do (and move the code from concat.rs) I believe.
   > @tustvold do you think this code can be ported to handle other DataTypes?
   
   I personally suggest completing this PR and getting it merged first, and then add other DataTypes as follow on tickets (you might find others interested in working on them as well, if you don't have time)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r878852329


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(

Review Comment:
   I agree with @tustvold to create a new compute kernel. Also, list, binary, fixed size list and fix size binary can all be supported.



##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(

Review Comment:
   Maybe we should support concatenate arbitrary number of arrays. We could file a follow-on issue to track it.



##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))
+        .map(|i| {
+            if concat_bitmap.is_set(i) {
+                Some(left.value(i).to_owned() + right.value(i))
+            } else {
+                None
+            }
+        })
+        .collect::<GenericStringArray<Offset>>())

Review Comment:
   We could file a follow-on issue to add benchmarks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tfeda commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tfeda commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r878891034


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))
+        .map(|i| {
+            if concat_bitmap.is_set(i) {
+                Some(left.value(i).to_owned() + right.value(i))
+            } else {
+                None
+            }
+        })
+        .collect::<GenericStringArray<Offset>>())

Review Comment:
   A couple thoughts, because I spent some time digging into this:
   1. @alamb's example in #1720 suggests that "valid_str" + `None` => `None`, where both implementations in this PR go for "valid_str" + `None` => "valid_str". As a user, I'd prefer the later, but I thought I'd make a note of it. If the former is chosen, then `compute::util` has [`combine_option_bitmap()`](https://github.com/apache/arrow-rs/blob/4de689598df6ea284452e687d69c7654b5a71762/arrow/src/compute/util.rs#L31) which works nicely for the null handling 
   3. How do we handle cases when concatenating two arrays results in the offsets overflowing? There should probably be a test case for that. 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ismail-Maj commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
Ismail-Maj commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879213096


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    // TODO: Handle non-zero offset in source ArrayData
+
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(
+            "StringArrays must have the same length".to_string(),
+        ));
+    }
+
+    let output_bitmap = match (left.data().null_bitmap(), right.data().null_bitmap()) {
+        (Some(left_bitmap), Some(right_bitmap)) => Some((left_bitmap & right_bitmap)?),
+        (Some(left_bitmap), None) => Some(left_bitmap.clone()),
+        (None, Some(right_bitmap)) => Some(right_bitmap.clone()),
+        (None, None) => None,
+    };
+
+    let left_offsets = left.value_offsets();
+    let right_offsets = right.value_offsets();
+
+    let left_buffer = left.value_data();
+    let right_buffer = right.value_data();
+    let left_values = left_buffer.as_slice();
+    let right_values = right_buffer.as_slice();
+
+    let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
+    let mut output_values =
+        BufferBuilder::<u8>::new(left_values.len() + right_values.len());
+
+    output_offsets.append(Offset::zero());
+    for (idx, (l, r)) in left_offsets
+        .windows(2)
+        .zip(right_offsets.windows(2))
+        .enumerate()
+    {
+        match &output_bitmap {
+            Some(bitmap) if { bitmap.is_set(idx) } => {

Review Comment:
   bug here, handle `None`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r882232506


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(

Review Comment:
   I will.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r878850863


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))
+        .map(|i| {
+            if concat_bitmap.is_set(i) {
+                Some(left.value(i).to_owned() + right.value(i))
+            } else {
+                None
+            }
+        })
+        .collect::<GenericStringArray<Offset>>())

Review Comment:
   I think it would be **significantly** faster to do something like (not at all tested)
   
   ```
   // TODO: Handle non-zero offset in source ArrayData
   
   if left.len() != right.len() {
       return Err(...)
   }
   
   let nulls = match (left.data().null_bitmap(), right.data.null_bitmap()) {
     (Some(left), Some(right)) = Some((left & right)?)
     (Some(left), None) => Some(left),
     (None, Some(right)) => Some(right),
     (None, None) => None,
   };
   let left_offsets = left.value_offsets();
   let right_offsets = right.value_offsets();
   
   let left_values = left.value_data().as_slice();
   let right_values = right.value_data().as_slice();
   
   let left_iter = left_offsets.windows(2);
   let right_iter = right_offsets.windows(2);
   
   let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
   let mut output_values = BufferBuilder::<u8>::new(left_data.len() + right_data.len());
   output_offsets.append(0);
   
   for (left, right) in left_iter.zip(right_iter) {
     output_values.append_slice(&left_values[left[0]..left[1]]);
     output_values.append_slice(&right_values[right[0]..right[1]]);)
     
     output_offsets.append(output_values.len()); // With checked cast
   }
   
   let mut builder = ArrayDataBuilder::new(Offset::get_data_type)
     .len(left.len())
     .add_buffer(output_offsets.finish())
     .add_buffer(output_values.finish());
   
   if let Some(nulls) = nulls {
     builder = builder.null_bit_buffer(nulls);
   }
   
   // SAFETY - offsets valid by construction
   unsafe {builder.build_unchecked()}
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ismail-Maj commented on pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
Ismail-Maj commented on PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#issuecomment-1133927976

   @tustvold Thank you for the review and guidance!
   I've pushed something that is faster and close to your implementation.
   I'm still unfamiliar with the codebase, so I'm sure there are still some mistakes.
   
   I'll add more test cases and support for non-zero offset later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r878850863


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,25 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    let left_bitmap = left.data().null_bitmap().unwrap();
+    let right_bitmap = right.data().null_bitmap().unwrap();
+    let concat_bitmap = (left_bitmap & right_bitmap).unwrap();
+    Ok((0..left.len().max(right.len()))
+        .map(|i| {
+            if concat_bitmap.is_set(i) {
+                Some(left.value(i).to_owned() + right.value(i))
+            } else {
+                None
+            }
+        })
+        .collect::<GenericStringArray<Offset>>())

Review Comment:
   I think it would be **significantly** faster to do something like (not at all tested)
   
   ```
   // TODO: Handle non-zero offset in source ArrayData
   
   if left.len() != right.len() {
       return Err(...)
   }
   
   let nulls = match (left.data().null_bitmap(), right.data.null_bitmap()) {
     (Some(left), Some(right)) = Some((left & right)?)
     (Some(left), None) => Some(left),
     (None, Some(right)) => Some(right),
     (None, None) => None,
   };
   let left_offsets = left.value_offsets();
   let right_offsets = right.value_offsets();
   
   let left_values = left.value_data().as_slice();
   let right_values = right.value_data().as_slice();
   
   let left_iter = left_offsets.windows(2);
   let right_iter = right_offsets.windows(2);
   
   let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
   let mut output_values = BufferBuilder::<u8>::new(left_data.len() + right_data.len());
   let mut cur_offset = 0;
   output_offsets.append(0);
   
   for (left, right) in left_iter.zip(right_iter) {
     let left_len = left[1] - left[0];
     let right_len = right[1] - right[0];
     cur_offset += left_len + right_len;
     output_offsets.append(cur_offset); // With checked case
     output_values.append_slice(&left_values[left[0]..left[1]]);
     output_values.append_slice(&right_values[right[0]..right[1]]);)
   }
   
   let mut builder = ArrayDataBuilder::new(Offset::get_data_type)
     .len(left.len())
     .add_buffer(output_offsets.finish())
     .add_buffer(output_values.finish());
   
   if let Some(nulls) = nulls {
     builder = builder.null_bit_buffer(nulls);
   }
   
   unsafe {builder.build_unchecked()}
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879059139


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    // TODO: Handle non-zero offset in source ArrayData
+
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(
+            "StringArrays must have the same length".to_string(),
+        ));
+    }
+
+    let output_bitmap = match (left.data().null_bitmap(), right.data().null_bitmap()) {
+        (Some(left_bitmap), Some(right_bitmap)) => Some((left_bitmap & right_bitmap)?),
+        (Some(left_bitmap), None) => Some(left_bitmap.clone()),
+        (None, Some(right_bitmap)) => Some(right_bitmap.clone()),
+        (None, None) => None,
+    };
+
+    let left_offsets = left.value_offsets();
+    let right_offsets = right.value_offsets();
+
+    let left_buffer = left.value_data();
+    let right_buffer = right.value_data();
+    let left_values = left_buffer.as_slice();
+    let right_values = right_buffer.as_slice();
+
+    let mut output_offsets = BufferBuilder::<Offset>::new(left_offsets.len());
+    let mut output_values =
+        BufferBuilder::<u8>::new(left_values.len() + right_values.len());

Review Comment:
   Is this capacity too much? Consider we only concat elements from two StringArrays if both are available.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb merged pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
alamb merged PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1720: Implementation string concat

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1720:
URL: https://github.com/apache/arrow-rs/pull/1720#discussion_r879218896


##########
arrow/src/compute/kernels/concat.rs:
##########
@@ -102,6 +102,72 @@ pub fn concat(arrays: &[&dyn Array]) -> Result<ArrayRef> {
     Ok(make_array(mutable.freeze()))
 }
 
+// Elementwise concatenation of StringArrays
+pub fn string_concat<Offset: OffsetSizeTrait>(
+    left: &GenericStringArray<Offset>,
+    right: &GenericStringArray<Offset>,
+) -> Result<GenericStringArray<Offset>> {
+    // TODO: Handle non-zero offset in source ArrayData
+
+    if left.len() != right.len() {
+        return Err(ArrowError::ComputeError(
+            "StringArrays must have the same length".to_string(),
+        ));
+    }
+
+    let output_bitmap = match (left.data().null_bitmap(), right.data().null_bitmap()) {

Review Comment:
   Using combine_option_bitmap as suggested by @tfeda is probably a good idea - this currently won't handle offsets.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org