You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/14 12:10:28 UTC

[GitHub] [arrow] maartenbreddels opened a new pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

maartenbreddels opened a new pull request #8459:
URL: https://github.com/apache/arrow/pull/8459


   The second commit adds re2 to the linked libraries. @xhochy how should this be done, should I open a separate issue for this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#discussion_r506729025



##########
File path: cpp/CMakeLists.txt
##########
@@ -735,6 +735,7 @@ endif()
 
 if(ARROW_WITH_UTF8PROC)
   list(APPEND ARROW_LINK_LIBS utf8proc::utf8proc)
+  list(APPEND ARROW_LINK_LIBS re2)

Review comment:
       Rather than opting out of dependencies, would it make more sense to have an `ARROW_COMPUTE_STRINGS` (feel free to pick a better name) feature flag that, if enabled, requires utf8proc and re2? Unlike the compression libraries that have the `ARROW_WITH_X` flags, these dependencies are tied to specific functionality that can't really exist without them, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#discussion_r504719502



##########
File path: cpp/CMakeLists.txt
##########
@@ -735,6 +735,7 @@ endif()
 
 if(ARROW_WITH_UTF8PROC)
   list(APPEND ARROW_LINK_LIBS utf8proc::utf8proc)
+  list(APPEND ARROW_LINK_LIBS re2)

Review comment:
       So you say re2 should be a hard dependency now? I'm ok with that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#discussion_r506691218



##########
File path: cpp/CMakeLists.txt
##########
@@ -735,6 +735,7 @@ endif()
 
 if(ARROW_WITH_UTF8PROC)
   list(APPEND ARROW_LINK_LIBS utf8proc::utf8proc)
+  list(APPEND ARROW_LINK_LIBS re2)

Review comment:
       We might need an `ARROW_WITH_RE2` CMake option in keeping with our minimal build configuration mantra




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou edited a comment on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou edited a comment on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-823514795


   Naming nit @jorisvandenbossche @jonkeane @ianmcook  .
   Do you think "extract_regex" is the right term for this function? Something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-823545997


   > When having an optional group like the above, I would maybe expect a null instead of an empty string in the first child?
   
   Agreed, but unfortunately re2 doesn't give us that information...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-808084039


   Rebased, but there are two issues left:
    * null values comparison fail, no idea what the issue is there, quite difficult for me to debug
    * the compute kernel does not get called on empty input, so i skip that test for now. But I think kernels should have the option to opt-in at least, so we can return an empty struct (all its arrays of length 0), but with field names.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-824089365


   Thanks for finishing this Antoine!
   
   > This was solved by defining a `Resolver` function so that the FunctionOptions instance can participate in output type resolution (instead of specifying an empty struct type when declaring the kernel).
   
   Good to know, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] xhochy commented on a change in pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
xhochy commented on a change in pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#discussion_r504670998



##########
File path: cpp/CMakeLists.txt
##########
@@ -735,6 +735,7 @@ endif()
 
 if(ARROW_WITH_UTF8PROC)
   list(APPEND ARROW_LINK_LIBS utf8proc::utf8proc)
+  list(APPEND ARROW_LINK_LIBS re2)

Review comment:
       That's the correct approach but maybe outside of the `ARROW_WITH_UTF8PROC` scope as we would always link `re2` independently of the `utf8proc`-based kernels?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-823536092


   @maartenbreddels For the record:
   
   > null values comparison fail, no idea what the issue is there, quite difficult for me to debug
   
   I am not sure what was causing this, but my small rewrite of the kernel fixed it.
   
   > I think kernels should have the option to opt-in at least, so we can return an empty struct (all its arrays of length 0), but with field names.
   
   This was solved by defining a `Resolver` function so that the FunctionOptions instance can participate in output type resolution (instead of specifying an empty struct type when declaring the kernel).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-708378089


   https://issues.apache.org/jira/browse/ARROW-10195


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#discussion_r517867055



##########
File path: cpp/CMakeLists.txt
##########
@@ -735,6 +735,7 @@ endif()
 
 if(ARROW_WITH_UTF8PROC)
   list(APPEND ARROW_LINK_LIBS utf8proc::utf8proc)
+  list(APPEND ARROW_LINK_LIBS re2)

Review comment:
       Not all of them need the libraries, all the ASCII kernels run without. But for simplicity, I don't think it's a bad idea. 
   
   Could it be an option that enabling ARROW_COMPUTE_STRINGS by default enables ARROW_WITH_RE2 and ARROW_WITH_UTF8PROC, but still allow you to override ARROW_WITH_RE2/ARROW_WITH_UTF8PROC. This would allow someone to build with just the ASCII kernels, or just ASCII and ARROW_WITH_UTF8PROC based kernels.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] xhochy commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
xhochy commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-732961247


   @maartenbreddels Can you rebase this? That would be helpful for me setting up `re2` in the toolchain.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-823338133


   Ok, I'm looking at this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-823545315


   One thing I noted while quickly trying it out:
   
   ```
   In [6]: pc.extract_regex(pa.array(['a1', 'b2', 'c3']), regex=r'(?P<letter>[ab])?(?P<digit>\d)')
   Out[6]: 
   <pyarrow.lib.StructArray object at 0x7f8c1bdf9760>
   -- is_valid: all not null
   -- child 0 type: string
     [
       "a",
       "b",
       ""
     ]
   -- child 1 type: string
     [
       "1",
       "2",
       "3"
     ]
   ```
   
   When having an optional group like the above, I would maybe expect a null instead of an empty string in the first child?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-820611035


   @maartenbreddels Will you need help here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-820629220


   Yes, that would be appriciated, the null values I find difficult to debug, and the second issue I'm not sure that is good solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-733739638


   There are two issues with this PR:
   ## empty slices
   In #8728 https://github.com/apache/arrow/pull/8728/files#diff-7771ecc138c4ecc2dc8498affe04f5b7f182c4b77b18a512c3dd07a82d45aa3dR116 was added, which adds an empty slice to the test, which breaks my test (which is good). However, my kernel will not be called, meaning i cannot set the output's type (which depends on the input regex). Shall I open a JIRA for that?
   
   The error is:
   ```
   # Array types differed: struct<letter: string, digit: string> vs struct<>
   Null counts differ. Expected -1 but was 0
   Expected:
     -- is_valid: all not null
     -- child 0 type: string
       []
     -- child 1 type: string
       []
   Actual:
     -- is_valid: all not null
   ```
   (showing the struct mismatch)
   
   ##  ChunkedArray fails
   
   Commenting out the empty slice test, the next failure is with
   https://github.com/apache/arrow/blob/db20c7a611adac7be5cdd9350792852345f5b6b4/cpp/src/arrow/compute/kernels/test_util.cc#L133
   
   ```
   Failed
   Got: 
     [
       -- is_valid: all not null
       -- child 0 type: string
         [
           "a"
         ]
       -- child 1 type: string
         [
           "1"
         ],
       -- is_valid:
             [
           true,
           false,
           false
         ]
       -- child 0 type: string
         [
           "b",
           "",
           ""
         ]
       -- child 1 type: string
         [
           "2",
           "",
           ""
         ]
     ]
   Expected: 
     [
       -- is_valid: all not null
       -- child 0 type: string
         [
           "a"
         ]
       -- child 1 type: string
         [
           "1"
         ],
       -- is_valid:
             [
           true,
           false,
           false
         ]
       -- child 0 type: string
         [
           "b",
           "",
           ""
         ]
       -- child 1 type: string
         [
           "2",
           "",
           ""
         ]
     ]
   ```
   I fail to see a difference in this output (maybe a null value in a child?), and this is difficult to debug in gdb. Maybe an opportunity to improve the printing of differences (since I fail to see any).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-708561781


   should re2 be available on all platform? appveyor for instance gives: `'re2/re2.h': No such file or directory`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8459:
URL: https://github.com/apache/arrow/pull/8459


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-824096987


   @maartenbreddels You're welcome :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-807884456


   This needs rebase now that #8468 has merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on a change in pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on a change in pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#discussion_r506696199



##########
File path: cpp/CMakeLists.txt
##########
@@ -735,6 +735,7 @@ endif()
 
 if(ARROW_WITH_UTF8PROC)
   list(APPEND ARROW_LINK_LIBS utf8proc::utf8proc)
+  list(APPEND ARROW_LINK_LIBS re2)

Review comment:
       I guess that also means it is better we have kernels that refer to RE2 being used as I did in #8468 (replace_substring and replace_substring_re2) instead of having a kernel that will bail out at runtime because re2 was not enabled.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-806719137


   @maartenbreddels what's the status of this PR? Ready for review, or are there still open issues? (you listed some above at https://github.com/apache/arrow/pull/8459#issuecomment-733739638)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-823514795


   Naming nit @jorisvandenbossche @jonkeane  .
   Do you think "extract_regex" is the right term for this function? Something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maartenbreddels commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
maartenbreddels commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-733013800


   I cleaned it up a bit, but it might fail running the test, hope this is enough for you for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#issuecomment-823544553


   >  Do you think "extract_regex" is the right term for this function? Something else?
   
   In pandas, this is called "extract" (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html), so from that point of view this is a good name. 
   In `re`, I think one would use `re.search`, but that's basically both used for "match" as for "extract" purposes, so I think "extract" is a better name for the kernel in this PR. 
   
   The `_regex` might be a bit superfluous, as there is no variant without regex (but on the other hand it explicitly signals it is using regular expressions).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8459: ARROW-10195: [C++] Add string struct extract kernel using re2

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8459:
URL: https://github.com/apache/arrow/pull/8459#discussion_r616977780



##########
File path: cpp/src/arrow/compute/api_scalar.h
##########
@@ -81,6 +83,13 @@ struct ARROW_EXPORT ReplaceSubstringOptions : public FunctionOptions {
   int64_t max_replacements;
 };
 
+struct ARROW_EXPORT ExtractRegexOptions : public FunctionOptions {
+  explicit ExtractRegexOptions(std::string regex) : regex(std::move(regex)) {}
+
+  /// Regular expression with named capture fields
+  std::string regex;

Review comment:
       
   
   Naming nit: I would maybe call this "pattern" instead of "regex"
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org