You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/24 12:21:49 UTC

[GitHub] [arrow-rs] alamb commented on a change in pull request #706: Implement `regexp_matches_utf8`

alamb commented on a change in pull request #706:
URL: https://github.com/apache/arrow-rs/pull/706#discussion_r694791485



##########
File path: arrow/src/compute/kernels/comparison.rs
##########
@@ -450,6 +450,250 @@ pub fn nlike_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
     Ok(BooleanArray::from(data))
 }
 
+/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].

Review comment:
       What is the supported format of the `flags_arrays`? I think we should either document that explicitly here or provide a link to the documentation
   
   It also might make sense to document the fact that an empty value in `regex_array` will return true

##########
File path: arrow/src/compute/kernels/comparison.rs
##########
@@ -450,6 +450,250 @@ pub fn nlike_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
     Ok(BooleanArray::from(data))
 }
 
+/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
+pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(
+    array: &GenericStringArray<OffsetSize>,
+    regex_array: &GenericStringArray<OffsetSize>,
+    flags_array: Option<&GenericStringArray<OffsetSize>>,
+) -> Result<BooleanArray> {
+    if array.len() != regex_array.len() {
+        return Err(ArrowError::ComputeError(
+            "Cannot perform comparison operation on arrays of different length"
+                .to_string(),
+        ));
+    }
+    let null_bit_buffer =
+        combine_option_bitmap(array.data_ref(), regex_array.data_ref(), array.len())?;
+
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+    let mut result = BooleanBufferBuilder::new(array.len());
+
+    let complete_pattern = match flags_array {

Review comment:
       It is unfortunate to do so much `String` allocations here, though I see the reason it is needed (so the iterator you are zipping always has the same type). It is also the same pattern in the `regexp_matches` code.
   
   So no action needed here, but probably worth keeping in mind if we ever see this as a performance bottleneck in a query
   

##########
File path: arrow/src/compute/kernels/comparison.rs
##########
@@ -450,6 +450,250 @@ pub fn nlike_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
     Ok(BooleanArray::from(data))
 }
 
+/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
+pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(

Review comment:
       As mentioned in https://github.com/apache/arrow-rs/issues/697#issue-973700692, the name `regexp_matches_utf8` is pretty similar to [`regexp_match`](https://github.com/b41sh/arrow-rs/blob/regexp_matches_utf8/arrow/src/compute/kernels/regexp.rs#L33) which I think might cause confusion
   
   It looks like the rust regular expression library uses `is_match` so I wonder if following that model might be clearer?
   
   So instead of `regexp_matches_utf8` maybe calling this `regexp_is_match_utf8` (and similarly below).
   
   @seddonm1 / @nevi-me / @jorgecarleitao  do you have any thoughts regarding the naming of a function that checks if values match regular expressions?

##########
File path: arrow/src/compute/kernels/comparison.rs
##########
@@ -1438,6 +1682,82 @@ mod tests {
         };
     }
 
+    macro_rules! test_flag_utf8 {
+        ($test_name:ident, $left:expr, $right:expr, $op:expr, $expected:expr) => {
+            #[test]
+            fn $test_name() {
+                let left = StringArray::from($left);
+                let right = StringArray::from($right);
+                let res = $op(&left, &right, None).unwrap();
+                let expected = $expected;
+                assert_eq!(expected.len(), res.len());
+                for i in 0..res.len() {
+                    let v = res.value(i);
+                    assert_eq!(v, expected[i]);
+                }
+            }
+        };
+        ($test_name:ident, $left:expr, $right:expr, $flag:expr, $op:expr, $expected:expr) => {
+            #[test]
+            fn $test_name() {
+                let left = StringArray::from($left);
+                let right = StringArray::from($right);
+                let flag = Some(StringArray::from($flag));
+                let res = $op(&left, &right, flag.as_ref()).unwrap();
+                let expected = $expected;
+                assert_eq!(expected.len(), res.len());
+                for i in 0..res.len() {
+                    let v = res.value(i);
+                    assert_eq!(v, expected[i]);
+                }
+            }
+        };
+    }
+
+    macro_rules! test_flag_utf8_scalar {
+        ($test_name:ident, $left:expr, $right:expr, $op:expr, $expected:expr) => {
+            #[test]
+            fn $test_name() {
+                let left = StringArray::from($left);
+                let res = $op(&left, $right, None).unwrap();
+                let expected = $expected;
+                assert_eq!(expected.len(), res.len());
+                for i in 0..res.len() {
+                    let v = res.value(i);
+                    assert_eq!(
+                        v,
+                        expected[i],
+                        "unexpected result when comparing {} at position {} to {} ",
+                        left.value(i),
+                        i,
+                        $right
+                    );
+                }
+            }
+        };
+        ($test_name:ident, $left:expr, $right:expr, $flag:expr, $op:expr, $expected:expr) => {

Review comment:
       this is cool (to have a version of the macro with and without matches)

##########
File path: arrow/src/compute/kernels/comparison.rs
##########
@@ -450,6 +450,250 @@ pub fn nlike_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
     Ok(BooleanArray::from(data))
 }
 
+/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
+pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(
+    array: &GenericStringArray<OffsetSize>,
+    regex_array: &GenericStringArray<OffsetSize>,
+    flags_array: Option<&GenericStringArray<OffsetSize>>,
+) -> Result<BooleanArray> {
+    if array.len() != regex_array.len() {
+        return Err(ArrowError::ComputeError(
+            "Cannot perform comparison operation on arrays of different length"
+                .to_string(),
+        ));
+    }
+    let null_bit_buffer =
+        combine_option_bitmap(array.data_ref(), regex_array.data_ref(), array.len())?;
+
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+    let mut result = BooleanBufferBuilder::new(array.len());
+
+    let complete_pattern = match flags_array {
+        Some(flags) => Box::new(regex_array.iter().zip(flags.iter()).map(
+            |(pattern, flags)| {
+                pattern.map(|pattern| match flags {
+                    Some(flag) => format!("(?{}){}", flag, pattern),
+                    None => pattern.to_string(),
+                })
+            },
+        )) as Box<dyn Iterator<Item = Option<String>>>,
+        None => Box::new(
+            regex_array
+                .iter()
+                .map(|pattern| pattern.map(|pattern| pattern.to_string())),
+        ),
+    };
+
+    array
+        .iter()
+        .zip(complete_pattern)
+        .map(|(value, pattern)| {
+            match (value, pattern) {
+                // Required for Postgres compatibility:
+                // SELECT 'foobarbequebaz' ~ ''); = true
+                (Some(_), Some(pattern)) if pattern == *"" => {
+                    result.append(true);
+                }
+                (Some(value), Some(pattern)) => {
+                    let existing_pattern = patterns.get(&pattern);
+                    let re = match existing_pattern {
+                        Some(re) => re.clone(),
+                        None => {
+                            let re = Regex::new(pattern.as_str()).map_err(|e| {
+                                ArrowError::ComputeError(format!(
+                                    "Regular expression did not compile: {:?}",
+                                    e
+                                ))
+                            })?;
+                            patterns.insert(pattern, re.clone());
+                            re
+                        }
+                    };
+                    result.append(re.is_match(value));
+                }
+                _ => result.append(false),
+            }
+            Ok(())
+        })
+        .collect::<Result<Vec<()>>>()?;
+
+    let data = ArrayData::new(
+        DataType::Boolean,
+        array.len(),
+        None,
+        null_bit_buffer,
+        0,
+        vec![result.finish()],
+        vec![],
+    );
+    Ok(BooleanArray::from(data))
+}
+
+/// Perform SQL `array ~ regex_array` operation on [`StringArray`] /
+/// [`LargeStringArray`] and a scalar.
+pub fn regexp_matches_utf8_scalar<OffsetSize: StringOffsetSizeTrait>(
+    array: &GenericStringArray<OffsetSize>,
+    regex: &str,
+    flag: Option<&str>,
+) -> Result<BooleanArray> {
+    let null_bit_buffer = array.data().null_buffer().cloned();
+    let mut result = BooleanBufferBuilder::new(array.len());
+
+    let pattern = match flag {
+        Some(flag) => format!("(?{}){}", flag, regex),
+        None => regex.to_string(),
+    };
+    if pattern == *"" {
+        for _i in 0..array.len() {
+            result.append(true);
+        }
+    } else {
+        let re = Regex::new(pattern.as_str()).map_err(|e| {
+            ArrowError::ComputeError(format!(
+                "Regular expression did not compile: {:?}",
+                e
+            ))
+        })?;
+        for i in 0..array.len() {
+            let value = array.value(i);
+            result.append(re.is_match(value));
+        }
+    }
+
+    let data = ArrayData::new(
+        DataType::Boolean,
+        array.len(),
+        None,
+        null_bit_buffer,
+        0,
+        vec![result.finish()],
+        vec![],
+    );
+    Ok(BooleanArray::from(data))
+}
+
+/// Perform SQL `array !~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
+pub fn regexp_not_matches_utf8<OffsetSize: StringOffsetSizeTrait>(

Review comment:
       I wonder if we need the negative variants (e.g. do we need `regexp_not_matches_utf8`) kernel? Perhaps we could simply use `not(regexp_matches_utf8(..))` 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org