You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/06 04:11:23 UTC

[GitHub] [arrow] wjones127 opened a new pull request #8844: ARROW-7205 : [C++][Gandiva] Implemente regexp_like in Gandiva

wjones127 opened a new pull request #8844:
URL: https://github.com/apache/arrow/pull/8844


   Was looking for "rlike" support in Gandiva and found we had nearly implementing it. This PR revives apache/arrow#5860.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] projjal commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
projjal commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-810671294


   Hi @wjones127. Looks like creating a base holder class and two derived class for just these two functions seems overkill to me (I think thats why I closed the original pr temporarily). The only difference in logic between the two functions is that in case of "like" we convert the sql pattern to regex pattern before storing the compiled pattern. Would it make sense if we pass a flag to the holder that says if this is sql pattern or regex patttern?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-739453294


   https://issues.apache.org/jira/browse/ARROW-7205


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 closed pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 closed pull request #8844:
URL: https://github.com/apache/arrow/pull/8844


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#discussion_r606018511



##########
File path: cpp/src/gandiva/like_holder.cc
##########
@@ -62,39 +56,95 @@ const FunctionNode LikeHolder::TryOptimize(const FunctionNode& node) {
   return node;
 }
 
+const FunctionNode SQLLikeHolder::TryOptimize(const FunctionNode& node) {

Review comment:
       `like` will now always optimize into `regexp_matches` or one of the optimizations above. I found this to be the easiest way to reuse the optimization code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-886242695


   @projjal This is now ready for review.
   
   Since we've added `ilike` and special escape_char options for `like`, I made `SQLLikeHolder` inherit from `RegexMatchHolder` instead of sharing the same holder class. I did make sure the `TryOptimize` logic wasn't duplicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-810735907


   Projjal, that's a good point. The SQL like function is basically a wrapper around `regexp_like`. I don't know why I didn't see the optimizations should be the same.
   
   I will consolidate them into a single `LikeHolder`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#discussion_r606019594



##########
File path: cpp/src/gandiva/regex_util.cc
##########
@@ -30,8 +30,14 @@ Status RegexUtil::SqlLikePatternToPcre(const std::string& sql_pattern, char esca
   for (size_t idx = 0; idx < sql_pattern.size(); ++idx) {
     auto cur = sql_pattern.at(idx);
 
+    if (idx == 0 && cur != '%') {
+      pcre_pattern += '^';
+    } else if (idx == 0 && cur == '%') {
+      continue;
+    }
+
     // Escape any char that is special for pcre regex
-    if (pcre_regex_specials_.find(cur) != pcre_regex_specials_.end()) {
+    if (pcre_regex_specials_.find(cur) != pcre_regex_specials_.end() && cur != escape_char) {

Review comment:
       This change enables using `\` as an escape character in LIKE statements. For example, `col like '\_%'` can match anything starting with an underscore.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#discussion_r606019221



##########
File path: cpp/src/gandiva/regex_util.h
##########
@@ -29,7 +29,7 @@ namespace gandiva {
 /// \brief Utility class for converting sql patterns to pcre patterns.
 class GANDIVA_EXPORT RegexUtil {
  public:
-  // Convert an sql pattern to a pcre pattern
+  // Convert an sql pattern to a pcre pattern for use with PartialMatch

Review comment:
       Previously, `like` was using `RE2::FullMatch`. However, `regexp_matches` needs to use `RE2::PartialMatch`, so I've modified this method to make partial match statements. I have added a new set of tests to validate it.
   
   I didn't see any use of this utility method outside of the `like` implementation, so I thought this change would be okay.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-885691165


   > I think we should merge #9700 before this one, since my refactor of `LikeHolder` would cause merge conflicts on that PR. So I'll wait for that merge and then finish this.
   
   @wjones127 #9700 has merged, are you planning to pick this back up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-812350433


   I think we should merge #9700 before this one, since my refactor of `LikeHolder` would cause merge conflicts on that PR. So I'll wait for that merge and then finish this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#discussion_r606018159



##########
File path: cpp/src/gandiva/like_holder.cc
##########
@@ -62,39 +56,95 @@ const FunctionNode LikeHolder::TryOptimize(const FunctionNode& node) {
   return node;
 }
 
+const FunctionNode SQLLikeHolder::TryOptimize(const FunctionNode& node) {
+  Result<std::string> pattern_result = GetPattern(node);
+  if (!pattern_result.ok()) {
+    return node;
+  }
+  std::string sql_pattern = pattern_result.ValueOrDie();
+
+  std::string pcre_pattern;
+  auto conversion_status = RegexUtil::SqlLikePatternToPcre(sql_pattern, pcre_pattern);
+  if (!conversion_status.ok()) {
+    return node;
+  }
+
+  auto literal_type = node.children().at(1)->return_type();
+  auto pcre_node =
+    std::make_shared<LiteralNode>(literal_type, LiteralHolder(pcre_pattern), false);
+  auto new_node = FunctionNode("regexp_matches", {node.children().at(0), pcre_node},
+                               node.return_type());
+
+  return RegexpMatchesHolder::TryOptimize(new_node);
+}
+
 static bool IsArrowStringLiteral(arrow::Type::type type) {
   return type == arrow::Type::STRING || type == arrow::Type::BINARY;
 }
 
-Status LikeHolder::Make(const FunctionNode& node, std::shared_ptr<LikeHolder>* holder) {
+Status RegexpMatchesHolder::ValidateArguments(const FunctionNode& node) {
   ARROW_RETURN_IF(node.children().size() != 2,
-                  Status::Invalid("'like' function requires two parameters"));
+                  Status::Invalid("'" + node.descriptor()->name() +
+                                  "' function requires two parameters"));
 
   auto literal = dynamic_cast<LiteralNode*>(node.children().at(1).get());
   ARROW_RETURN_IF(
       literal == nullptr,
-      Status::Invalid("'like' function requires a literal as the second parameter"));
+      Status::Invalid("'" + node.descriptor()->name() +
+                      "' function requires a literal as the second parameter"));
 
   auto literal_type = literal->return_type()->id();
   ARROW_RETURN_IF(
       !IsArrowStringLiteral(literal_type),
-      Status::Invalid(
-          "'like' function requires a string literal as the second parameter"));
+      Status::Invalid("'" + node.descriptor()->name() +
+                      "' function requires a string literal as the second parameter"));
 
-  return Make(arrow::util::get<std::string>(literal->holder()), holder);
+  return Status::OK();
 }
 
-Status LikeHolder::Make(const std::string& sql_pattern,
-                        std::shared_ptr<LikeHolder>* holder) {
-  std::string pcre_pattern;
-  ARROW_RETURN_NOT_OK(RegexUtil::SqlLikePatternToPcre(sql_pattern, pcre_pattern));
+Result<std::string> RegexpMatchesHolder::GetPattern(const FunctionNode& node) {

Review comment:
       FYI This looks to be the first use of `Arrow::Result` in the Gandiva codebase, but this seems to be what the main arrow C++ codebase is using recently. Let me know if you object.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-885760199


   > 
   > 
   > > I think we should merge #9700 before this one, since my refactor of `LikeHolder` would cause merge conflicts on that PR. So I'll wait for that merge and then finish this.
   > 
   > @wjones127 #9700 has merged, are you planning to pick this back up?
   
   Yes, I'll work on it this weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#discussion_r606018511



##########
File path: cpp/src/gandiva/like_holder.cc
##########
@@ -62,39 +56,95 @@ const FunctionNode LikeHolder::TryOptimize(const FunctionNode& node) {
   return node;
 }
 
+const FunctionNode SQLLikeHolder::TryOptimize(const FunctionNode& node) {

Review comment:
       `like` will now always optimize into `regexp_matches` or one of the optimizations above. I found this to be the easiest way to reuse the optimization code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on pull request #8844: ARROW-7205 : [C++][Gandiva] Implement regexp_like in Gandiva

Posted by GitBox <gi...@apache.org>.
wjones127 commented on pull request #8844:
URL: https://github.com/apache/arrow/pull/8844#issuecomment-753382338


   @pravindra @Praveen2112 @projjal Got some time for a review? 😁


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org