You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/11 22:13:28 UTC

[GitHub] [arrow-datafusion] isidentical opened a new pull request, #3804: Optimize `regexp_replace` when the input is a sparse array

isidentical opened a new pull request, #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3803.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   This is something that arrow-rs already does very heavily, although they also combine it with the unsafe array building APIs which this PR deliberately avoids (reasoning is at the end).
   
   
   It doesn't provide as much speed-up as the other optimizations for the `regex_replace`, but it seems like a relatively straightforward change for a not-so-bad gain (see benchmarks section). It is also the final optimization candidate (at least localized to the `regex_replace` itself), if I am not missing anything. 
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   Instead of rebuilding the string array from the iterator, we now build the underlying array data by ourselves in which case we also have the ability to leverage existing null buffers for the input. In the generic version, this couldn't be done without recombining the underlying bitmaps of all the inputs but since this is the specialized case we know for a fact that the only array input is the `strings` (the first argument).
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   This is an optimization.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   # Benchmarks
   
   As expected, there is no speed-up (or slowdowns, which I guess is noteworthy) in cases where the input array is very dense with data (low amount of NULLs). But depending on the input's data density, we see an average speed-up of %20.
   
   |                            | master | this pr | this pr (unsafe) | speed-up                                             |
   |----------------------------|--------|---------|------------------|------------------------------------------------------|
   | %100 null & %0 data        | 0.121  | 0.088   | 0.071            | 1.38x against this PR (1.71x against unsafe version) |
   | %80 null & %20 data        | 0.489  | 0.396   | 0.347            | 1.23x against this PR (1.40x against unsafe version) |
   | %50 null & %50 data        | 0.741  | 0.640   | 0.619            | 1.15x against this PR (1.20x against unsafe version) |
   | data > null (80-20, 100-0) | 1.104  | 1.101   | 1.067            | No change (<=%3, within noise)                       |
   
   (Built with the release mode, for a variation of the query/dataset in my [example PR](https://github.com/isidentical/arrow-datafusion/pull/3)).
   
   I have also included a column which indicates the speed-ups when we use the unsafe version but I am not so sure if it makes sense to increase the number of 'unsafe' spots in datafusion for a not-so-crazy speed-up (even if that from the soundness perspective it is safe). I'd be happy to also change my PR to include the unsafe version ([which looks like this](https://github.com/isidentical/arrow-datafusion/pull/3)) if it is not that bad. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #3804: Optimize `regexp_replace` when the input is a sparse array

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804#issuecomment-1276413621

   Thanks @isidentical 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical commented on pull request #3804: Optimize `regexp_replace` when the input is a sparse array

Posted by GitBox <gi...@apache.org>.

isidentical commented on PR #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804#issuecomment-1276611596

   @alamb thanks again for the review, I've added it as a separate test and it seems to work perfectly!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3804: Optimize `regexp_replace` when the input is a sparse array

Posted by GitBox <gi...@apache.org>.

isidentical commented on code in PR #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804#discussion_r993804566


##########
datafusion/physical-expr/src/regex_expressions.rs:
##########
@@ -513,4 +539,40 @@ mod tests {
             }
         }
     }
+
+    #[test]
+    fn test_static_pattern_regexp_replace_with_null_buffers() {
+        let values = StringArray::from(vec![
+            Some("a"),
+            None,
+            Some("b"),
+            None,
+            Some("a"),
+            None,
+            None,
+            Some("c"),
+        ]);
+        let patterns = StringArray::from(vec!["a"; 1]);
+        let replacements = StringArray::from(vec!["foo"; 1]);
+        let expected = StringArray::from(vec![
+            Some("foo"),
+            None,
+            Some("b"),
+            None,
+            Some("foo"),
+            None,
+            None,
+            Some("c"),
+        ]);
+
+        let re = _regexp_replace_static_pattern_replace::<i32>(&[
+            Arc::new(values),
+            Arc::new(patterns),
+            Arc::new(replacements),
+        ])
+        .unwrap();
+
+        assert_eq!(re.as_ref(), &expected);
+        assert_eq!(re.null_count(), 4);

Review Comment:
   Great catch, and thanks for the suggestion @alamb! Should be added now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #3804: Optimize `regexp_replace` when the input is a sparse array

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804#issuecomment-1276693358

   Benchmark runs are scheduled for baseline = 37fe938261636bb7710b26cf26e2bbd010e1dbf0 and contender = d5c361b63c31b57d0052c501a94c1b1f7847b402. d5c361b63c31b57d0052c501a94c1b1f7847b402 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/771d3e27bfb749268f1ca404ac8381ef...a204bf4ce3de43d49eb2b14f804efd54/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4d8b58f8432547f2ac9fe4008dc97baa...0dd61ae4d4c949f187b02aed7c2bf2df/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7ae284c19742484a9a026ba1e92a28b5...899b8263e4f54567844667f07c715d09/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1a8613467c4c4295865eead62a080b16...c1d3de67884f4861aef1b799d7b81db9/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3804: Optimize `regexp_replace` when the input is a sparse array

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804#discussion_r993650161


##########
datafusion/physical-expr/src/regex_expressions.rs:
##########
@@ -254,13 +255,38 @@ fn _regexp_replace_static_pattern_replace<T: OffsetSizeTrait>(
     // with rust ones.
     let replacement = regex_replace_posix_groups(replacement);
 
-    let result = string_array
-        .iter()
-        .map(|string| {
-            string.map(|string| re.replacen(string, limit, replacement.as_str()))
-        })
-        .collect::<GenericStringArray<T>>();
-    Ok(Arc::new(result) as ArrayRef)
+    // We are going to create the underlying string buffer from its parts
+    // to be able to re-use the existing null buffer for sparse arrays.
+    let mut vals = BufferBuilder::<u8>::new({
+        let offsets = string_array.value_offsets();
+        (offsets[string_array.len()] - offsets[0])
+            .to_usize()
+            .unwrap()
+    });
+    let mut new_offsets = BufferBuilder::<T>::new(string_array.len() + 1);
+    new_offsets.append(T::zero());
+
+    string_array.iter().for_each(|val| {
+        if let Some(val) = val {
+            let result = re.replacen(val, limit, replacement.as_str());
+            vals.append_slice(result.as_bytes());
+        }
+        new_offsets.append(T::from_usize(vals.len()).unwrap());
+    });
+
+    let data = ArrayData::try_new(
+        GenericStringArray::<T>::DATA_TYPE,
+        string_array.len(),
+        string_array
+            .data_ref()
+            .null_buffer()
+            .map(|b| b.bit_slice(string_array.offset(), string_array.len())),

Review Comment:
   👍  for handling the offset



##########
datafusion/physical-expr/src/regex_expressions.rs:
##########
@@ -513,4 +539,40 @@ mod tests {
             }
         }
     }
+
+    #[test]
+    fn test_static_pattern_regexp_replace_with_null_buffers() {
+        let values = StringArray::from(vec![
+            Some("a"),
+            None,
+            Some("b"),
+            None,
+            Some("a"),
+            None,
+            None,
+            Some("c"),
+        ]);
+        let patterns = StringArray::from(vec!["a"; 1]);
+        let replacements = StringArray::from(vec!["foo"; 1]);
+        let expected = StringArray::from(vec![
+            Some("foo"),
+            None,
+            Some("b"),
+            None,
+            Some("foo"),
+            None,
+            None,
+            Some("c"),
+        ]);
+
+        let re = _regexp_replace_static_pattern_replace::<i32>(&[
+            Arc::new(values),
+            Arc::new(patterns),
+            Arc::new(replacements),
+        ])
+        .unwrap();
+
+        assert_eq!(re.as_ref(), &expected);
+        assert_eq!(re.null_count(), 4);

Review Comment:
   To test the offset slicing, I also recommend you run the same test with sliced inputs. Something like
   
   ```rust
   let values = values.slice(2, 5);
   let expected = StringArray::from(vec![
               Some("b"),
               None,
               Some("foo"),
               None,
           ]);
   
   
           assert_eq!(re.null_count(), 2);
           let re = _regexp_replace_static_pattern_replace::<i32>(&[
               Arc::new(values),
               Arc::new(patterns),
               Arc::new(replacements),
           ])
           .unwrap();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #3804: Optimize `regexp_replace` when the input is a sparse array

Posted by GitBox <gi...@apache.org>.

alamb merged PR #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org