You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/19 13:26:50 UTC

[GitHub] [arrow] lidavidm opened a new pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

lidavidm opened a new pull request #10356:
URL: https://github.com/apache/arrow/pull/10356


   Implements a simple SQL LIKE pattern match kernel by translating it to a regex (or substring) match as appropriate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-845138005


   I filed ARROW-12835. I think this PR should be otherwise good-to-go then, unless we need to also add R bindings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-844548176


   What to do about the options class might be awkward. You could subclass `MatchSubstringOptions`, but I think ultimately we might incorporate an option for case insensitivity into `MatchSubstringOptions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635598720



##########
File path: docs/source/cpp/compute.rst
##########
@@ -529,28 +529,36 @@ Containment tests
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 | Function name             | Arity      | Input types                        | Output type   | Options class                          |
 +===========================+============+====================================+===============+========================================+
-| match_substring           | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
+| match_like                | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| match_substring_regex     | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
+| match_substring           | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (3)     | :struct:`SetLookupOptions`             |
+| match_substring_regex     | Unary      | String-like                        | Boolean (3)   | :struct:`MatchSubstringOptions`        |
++---------------------------+------------+------------------------------------+---------------+----------------------------------------+
+| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (4)     | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (4)   | :struct:`SetLookupOptions`             |
+| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (5)   | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 
-* \(1) Output is true iff :member:`MatchSubstringOptions::pattern`
-  is a substring of the corresponding input element.
+* \(1) Output is true iff the SQL-style LIKE pattern
+  :member:`MatchSubstringOptions::pattern` fully matches the
+  corresponding input element. That is, ``%`` will match any number of
+  characters, ``_`` will match exactly one character, and any other
+  character matches itself.

Review comment:
       Done (I also updated the docstrings in C++/Python). I noticed Impala has a configurable escape character but opted not to implement that (that said it should be reasonably simple if we want to support that).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635582443



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -494,6 +494,93 @@ const FunctionDoc match_substring_regex_doc(
      "position.\n"
      "Null inputs emit null.  The pattern must be given in MatchSubstringOptions."),
     {"strings"}, "MatchSubstringOptions");
+
+// SQL LIKE match
+
+/// Convert a SQL-style LIKE pattern (using '%' and '_') into a regex pattern
+std::string MakeLikeRegex(const MatchSubstringOptions& options) {
+  // Allow . to match \n
+  std::string like_pattern = "(?s:^";
+  like_pattern.reserve(options.pattern.size() + 7);
+  bool escaped = false;
+  for (const char c : options.pattern) {
+    if (!escaped && c == '%') {
+      like_pattern.append(".*");
+    } else if (!escaped && c == '_') {
+      like_pattern.append(".");
+    } else if (!escaped && c == '\\') {
+      escaped = true;
+    } else {
+      switch (c) {
+        case '.':
+        case '?':
+        case '+':
+        case '*':
+        case '^':
+        case '$':
+        case '\\':
+        case '[':
+        case '{':
+        case '(':
+        case ')':
+        case '|': {

Review comment:
       regex101.com lets you play around with RE2 (listed as 'Golang'). It seems escaping isn't necessary but doesn't hurt. Impala's authors may have wanted to be extra-safe.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-844110771


   https://issues.apache.org/jira/browse/ARROW-12715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-844540228


   We could easily add a flag to toggle insensitivity in RE2 as well. (Would be better than allocating a new string and potentially dealing with Unicode.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-844539780


   P.S. case insensitivity can be achieved by using one of the `_upper` or `_lower` kernels, which I think is good enough


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635594889



##########
File path: docs/source/cpp/compute.rst
##########
@@ -529,28 +529,36 @@ Containment tests
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 | Function name             | Arity      | Input types                        | Output type   | Options class                          |
 +===========================+============+====================================+===============+========================================+
-| match_substring           | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
+| match_like                | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| match_substring_regex     | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
+| match_substring           | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (3)     | :struct:`SetLookupOptions`             |
+| match_substring_regex     | Unary      | String-like                        | Boolean (3)   | :struct:`MatchSubstringOptions`        |
++---------------------------+------------+------------------------------------+---------------+----------------------------------------+
+| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (4)     | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (4)   | :struct:`SetLookupOptions`             |
+| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (5)   | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 
-* \(1) Output is true iff :member:`MatchSubstringOptions::pattern`
-  is a substring of the corresponding input element.
+* \(1) Output is true iff the SQL-style LIKE pattern
+  :member:`MatchSubstringOptions::pattern` fully matches the
+  corresponding input element. That is, ``%`` will match any number of
+  characters, ``_`` will match exactly one character, and any other
+  character matches itself.

Review comment:
       Could you document here and in the PyArrow docs that a backslash can be used to escape a `%` or `_`? Maybe something like this:
   >To match a literal percent sign or underscore, precede the character with a backslash.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635613255



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -494,6 +494,95 @@ const FunctionDoc match_substring_regex_doc(
      "position.\n"
      "Null inputs emit null.  The pattern must be given in MatchSubstringOptions."),
     {"strings"}, "MatchSubstringOptions");
+
+// SQL LIKE match
+
+/// Convert a SQL-style LIKE pattern (using '%' and '_') into a regex pattern
+std::string MakeLikeRegex(const MatchSubstringOptions& options) {
+  // Allow . to match \n
+  std::string like_pattern = "(?s:^";
+  like_pattern.reserve(options.pattern.size() + 7);
+  bool escaped = false;
+  for (const char c : options.pattern) {
+    if (!escaped && c == '%') {
+      like_pattern.append(".*");
+    } else if (!escaped && c == '_') {
+      like_pattern.append(".");
+    } else if (!escaped && c == '\\') {
+      escaped = true;
+    } else {
+      switch (c) {
+        case '.':
+        case '?':
+        case '+':
+        case '*':
+        case '^':
+        case '$':
+        case '\\':
+        case '[':
+        case '{':
+        case '(':
+        case ')':
+        case '|': {
+          like_pattern.push_back('\\');
+          like_pattern.push_back(c);
+          escaped = false;
+          break;
+        }
+        default: {
+          like_pattern.push_back(c);
+          escaped = false;
+          break;
+        }
+      }
+    }
+  }
+  like_pattern.append("$)");
+  return like_pattern;
+}
+
+// A LIKE pattern matching this regex can be translated into a substring search.
+static RE2 kLikePatternIsSubstringMatch("%+([^%_])*%+");

Review comment:
       D'oh, good catch. I've fixed this and adjusted the tests to catch this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635580566



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -494,6 +494,93 @@ const FunctionDoc match_substring_regex_doc(
      "position.\n"
      "Null inputs emit null.  The pattern must be given in MatchSubstringOptions."),
     {"strings"}, "MatchSubstringOptions");
+
+// SQL LIKE match
+
+/// Convert a SQL-style LIKE pattern (using '%' and '_') into a regex pattern
+std::string MakeLikeRegex(const MatchSubstringOptions& options) {
+  // Allow . to match \n
+  std::string like_pattern = "(?s:^";
+  like_pattern.reserve(options.pattern.size() + 7);
+  bool escaped = false;
+  for (const char c : options.pattern) {
+    if (!escaped && c == '%') {
+      like_pattern.append(".*");
+    } else if (!escaped && c == '_') {
+      like_pattern.append(".");
+    } else if (!escaped && c == '\\') {

Review comment:
       Can you confirm that this won't catch `\n`, `\t`, and the like?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-845167554


   I think it's good to go. Nothing is needed in the R bindings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-845101882


   We could add it to `MatchSubstringOptions`, if we plan to support for all kernels, and leave it unimplemented for now in the non-regex case (it's pretty easy to support in any regex case). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635582104



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -494,6 +494,93 @@ const FunctionDoc match_substring_regex_doc(
      "position.\n"
      "Null inputs emit null.  The pattern must be given in MatchSubstringOptions."),
     {"strings"}, "MatchSubstringOptions");
+
+// SQL LIKE match
+
+/// Convert a SQL-style LIKE pattern (using '%' and '_') into a regex pattern
+std::string MakeLikeRegex(const MatchSubstringOptions& options) {
+  // Allow . to match \n
+  std::string like_pattern = "(?s:^";
+  like_pattern.reserve(options.pattern.size() + 7);
+  bool escaped = false;
+  for (const char c : options.pattern) {
+    if (!escaped && c == '%') {
+      like_pattern.append(".*");
+    } else if (!escaped && c == '_') {
+      like_pattern.append(".");
+    } else if (!escaped && c == '\\') {

Review comment:
       I added a test for it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635602560



##########
File path: docs/source/cpp/compute.rst
##########
@@ -529,28 +529,36 @@ Containment tests
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 | Function name             | Arity      | Input types                        | Output type   | Options class                          |
 +===========================+============+====================================+===============+========================================+
-| match_substring           | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
+| match_like                | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| match_substring_regex     | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
+| match_substring           | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (3)     | :struct:`SetLookupOptions`             |
+| match_substring_regex     | Unary      | String-like                        | Boolean (3)   | :struct:`MatchSubstringOptions`        |
++---------------------------+------------+------------------------------------+---------------+----------------------------------------+
+| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (4)     | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (4)   | :struct:`SetLookupOptions`             |
+| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (5)   | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 
-* \(1) Output is true iff :member:`MatchSubstringOptions::pattern`
-  is a substring of the corresponding input element.
+* \(1) Output is true iff the SQL-style LIKE pattern
+  :member:`MatchSubstringOptions::pattern` fully matches the
+  corresponding input element. That is, ``%`` will match any number of
+  characters, ``_`` will match exactly one character, and any other
+  character matches itself.

Review comment:
       Whoops, thanks. Now fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635609498



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -494,6 +494,95 @@ const FunctionDoc match_substring_regex_doc(
      "position.\n"
      "Null inputs emit null.  The pattern must be given in MatchSubstringOptions."),
     {"strings"}, "MatchSubstringOptions");
+
+// SQL LIKE match
+
+/// Convert a SQL-style LIKE pattern (using '%' and '_') into a regex pattern
+std::string MakeLikeRegex(const MatchSubstringOptions& options) {
+  // Allow . to match \n
+  std::string like_pattern = "(?s:^";
+  like_pattern.reserve(options.pattern.size() + 7);
+  bool escaped = false;
+  for (const char c : options.pattern) {
+    if (!escaped && c == '%') {
+      like_pattern.append(".*");
+    } else if (!escaped && c == '_') {
+      like_pattern.append(".");
+    } else if (!escaped && c == '\\') {
+      escaped = true;
+    } else {
+      switch (c) {
+        case '.':
+        case '?':
+        case '+':
+        case '*':
+        case '^':
+        case '$':
+        case '\\':
+        case '[':
+        case '{':
+        case '(':
+        case ')':
+        case '|': {
+          like_pattern.push_back('\\');
+          like_pattern.push_back(c);
+          escaped = false;
+          break;
+        }
+        default: {
+          like_pattern.push_back(c);
+          escaped = false;
+          break;
+        }
+      }
+    }
+  }
+  like_pattern.append("$)");
+  return like_pattern;
+}
+
+// A LIKE pattern matching this regex can be translated into a substring search.
+static RE2 kLikePatternIsSubstringMatch("%+([^%_])*%+");

Review comment:
       Shouldn't `*` be inside the parens here so `re2::RE2::FullMatch` will extract everything in between the `%` and `%`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-844542150


   But can `PlainSubstringMatcher` do case-insensitive matching?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635595821



##########
File path: docs/source/cpp/compute.rst
##########
@@ -529,28 +529,36 @@ Containment tests
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 | Function name             | Arity      | Input types                        | Output type   | Options class                          |
 +===========================+============+====================================+===============+========================================+
-| match_substring           | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
+| match_like                | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| match_substring_regex     | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
+| match_substring           | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (3)     | :struct:`SetLookupOptions`             |
+| match_substring_regex     | Unary      | String-like                        | Boolean (3)   | :struct:`MatchSubstringOptions`        |
++---------------------------+------------+------------------------------------+---------------+----------------------------------------+
+| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (4)     | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (4)   | :struct:`SetLookupOptions`             |
+| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (5)   | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 
-* \(1) Output is true iff :member:`MatchSubstringOptions::pattern`
-  is a substring of the corresponding input element.
+* \(1) Output is true iff the SQL-style LIKE pattern
+  :member:`MatchSubstringOptions::pattern` fully matches the
+  corresponding input element. That is, ``%`` will match any number of
+  characters, ``_`` will match exactly one character, and any other
+  character matches itself.

Review comment:
       IMO this is worth mentioning because some SQL engines (e.g. PostgreSQL) use backslashes to escape `%` or `_` whereas others (e.g. Oracle) don't by default.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635601185



##########
File path: docs/source/cpp/compute.rst
##########
@@ -529,28 +529,36 @@ Containment tests
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 | Function name             | Arity      | Input types                        | Output type   | Options class                          |
 +===========================+============+====================================+===============+========================================+
-| match_substring           | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
+| match_like                | Unary      | String-like                        | Boolean (1)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| match_substring_regex     | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
+| match_substring           | Unary      | String-like                        | Boolean (2)   | :struct:`MatchSubstringOptions`        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (3)     | :struct:`SetLookupOptions`             |
+| match_substring_regex     | Unary      | String-like                        | Boolean (3)   | :struct:`MatchSubstringOptions`        |
++---------------------------+------------+------------------------------------+---------------+----------------------------------------+
+| index_in                  | Unary      | Boolean, Null, Numeric, Temporal,  | Int32 (4)     | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
-| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (4)   | :struct:`SetLookupOptions`             |
+| is_in                     | Unary      | Boolean, Null, Numeric, Temporal,  | Boolean (5)   | :struct:`SetLookupOptions`             |
 |                           |            | Binary- and String-like            |               |                                        |
 +---------------------------+------------+------------------------------------+---------------+----------------------------------------+
 
-* \(1) Output is true iff :member:`MatchSubstringOptions::pattern`
-  is a substring of the corresponding input element.
+* \(1) Output is true iff the SQL-style LIKE pattern
+  :member:`MatchSubstringOptions::pattern` fully matches the
+  corresponding input element. That is, ``%`` will match any number of
+  characters, ``_`` will match exactly one character, and any other
+  character matches itself.

Review comment:
       Thanks. I agree it seems unimportant to implement a configurable escape character.
   
   Looks like there's an errant `"To match a literal"` at the end of `FunctionDoc`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm closed pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm closed pull request #10356:
URL: https://github.com/apache/arrow/pull/10356


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635579350



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -494,6 +494,93 @@ const FunctionDoc match_substring_regex_doc(
      "position.\n"
      "Null inputs emit null.  The pattern must be given in MatchSubstringOptions."),
     {"strings"}, "MatchSubstringOptions");
+
+// SQL LIKE match
+
+/// Convert a SQL-style LIKE pattern (using '%' and '_') into a regex pattern
+std::string MakeLikeRegex(const MatchSubstringOptions& options) {
+  // Allow . to match \n
+  std::string like_pattern = "(?s:^";
+  like_pattern.reserve(options.pattern.size() + 7);
+  bool escaped = false;
+  for (const char c : options.pattern) {
+    if (!escaped && c == '%') {
+      like_pattern.append(".*");
+    } else if (!escaped && c == '_') {
+      like_pattern.append(".");
+    } else if (!escaped && c == '\\') {
+      escaped = true;
+    } else {
+      switch (c) {
+        case '.':
+        case '?':
+        case '+':
+        case '*':
+        case '^':
+        case '$':
+        case '\\':
+        case '[':
+        case '{':
+        case '(':
+        case ')':
+        case '|': {

Review comment:
       I'm pretty sure we don't need to escape `]` and `}` but I noticed Impala does:
   https://github.com/apache/impala/blob/9c38568657d62b6f6d7b10aa1c721ba843374dd8/be/src/exprs/like-predicate.cc#L384-L397
   Can you think of any reason we would need to escape those?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on a change in pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on a change in pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#discussion_r635580566



##########
File path: cpp/src/arrow/compute/kernels/scalar_string.cc
##########
@@ -494,6 +494,93 @@ const FunctionDoc match_substring_regex_doc(
      "position.\n"
      "Null inputs emit null.  The pattern must be given in MatchSubstringOptions."),
     {"strings"}, "MatchSubstringOptions");
+
+// SQL LIKE match
+
+/// Convert a SQL-style LIKE pattern (using '%' and '_') into a regex pattern
+std::string MakeLikeRegex(const MatchSubstringOptions& options) {
+  // Allow . to match \n
+  std::string like_pattern = "(?s:^";
+  like_pattern.reserve(options.pattern.size() + 7);
+  bool escaped = false;
+  for (const char c : options.pattern) {
+    if (!escaped && c == '%') {
+      like_pattern.append(".*");
+    } else if (!escaped && c == '_') {
+      like_pattern.append(".");
+    } else if (!escaped && c == '\\') {

Review comment:
       Can you confirm that this will properly handle `\n`, `\t`, and the like?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-845185507


   Merged, thanks for the review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-844542476


   No, but in that case we'd just not dispatch to that path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ianmcook commented on pull request #10356: ARROW-12715: [C++][Python] Add SQL LIKE match kernel

Posted by GitBox <gi...@apache.org>.
ianmcook commented on pull request #10356:
URL: https://github.com/apache/arrow/pull/10356#issuecomment-845135525


   Sounds good. The case-(in)sensitive option should be optional and should default to case-sensitive. We should error in cases where case-insensitive match is specified in the options but is ignored by a kernel or by any path in a kernel. This sounds to me like a separate PR. Should I open a Jira for that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org