You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by ottobackwards <gi...@git.apache.org> on 2018/05/30 21:36:03 UTC

[GitHub] nifi pull request #2748: NIFI-4272 support multiple captures when el is pres...

GitHub user ottobackwards opened a pull request:

    https://github.com/apache/nifi/pull/2748

    NIFI-4272 support multiple captures when el is present

    From the Jira Statement:
    ```
    I am using the ReplaceText processor to take a string input (example:  
    {"name":"Smith","middle":"nifi","firstname":"John"}
    ) and change all the filed names to all uppercase.
    Using above input as an example, I expect output like
    {"NAME":"Smith","MIDDLE":"nifi","FIRSTNAME":"John"}
    I expect I should be able to do this with ReplaceText processor; however, I see some unexpected behavior:
    -------
    Test 1: (uses EL in the replacement value property)
    Search value:  \"([a-z]?)\":\"(.?)\"
    Replacement Value: \"${'$1':toUpper()}":\"$2\"
    Result:
    {"NAME":"Smith","NAME":"nifi","NAME":"John"}
    -------
    Test 2: (Does not use EL in the replacement Value property)
    Search value:  \"([a-z]?)\":\"(.?)\"
    Replacement Value: \"new$1":\"$2\"
    Result:
    {"newname":"Smith","newmiddle":"nifi","newfirstname":"John"}
    ```
    
    The issue is that the processor evaluates the expression language before executing the regex and capture replacement.  The expression replaces the capture with the first value, and that is why the user was seeing that value repeated.
    
    This pr changes the Regex evaluation of the processor to evaluate the regex first, and then run the expression on the result.
    
    Some changes where required for escaping values.
    
    I added tests for the reported issue and for escaping newlines etc that would break EL even if '' in a literal.
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? Is it referenced 
         in the commit message?
    
    - [x] Does your PR title start with NIFI-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
    
    - [x] Has your PR been rebased against the latest commit within the target branch (typically master)?
    
    - [x] Is your initial contribution a single, squashed commit?
    
    ### For code changes:
    - [x] Have you ensured that the full suite of tests is executed via mvn -Pcontrib-check clean install at the root nifi folder?
    - [x] Have you written or updated unit tests to verify your changes?
    - [-] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? 
    - [-] If applicable, have you updated the LICENSE file, including the main LICENSE file under nifi-assembly?
    - [-] If applicable, have you updated the NOTICE file, including the main NOTICE file found under nifi-assembly?
    - [ ] If adding new Properties, have you added .displayName in addition to .name (programmatic access) for each of the new properties?
    
    ### For documentation related changes:
    - [-] Have you ensured that format looks appropriate for the output in which it is rendered?
    
    ### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ottobackwards/nifi replace-text-el

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/2748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2748
    
----
commit 75638bc201d7a492fa09cdb8358201398cebe586
Author: Otto Fowler <ot...@...>
Date:   2018-05-30T20:53:55Z

    NIFI-4272 support multiple captures when el is present

----


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193423188
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestReplaceText.java ---
    @@ -1074,12 +1090,11 @@ public void testRegexWithBadCaptureGroup() throws IOException {
             runner.setProperty(ReplaceText.REPLACEMENT_STRATEGY, ReplaceText.REGEX_REPLACE);
             runner.setProperty(ReplaceText.EVALUATION_MODE, ReplaceText.ENTIRE_TEXT);
     
    +        exception.expect(AssertionError.class);
    +        exception.expectMessage("java.lang.IndexOutOfBoundsException: No group 1");
    --- End diff --
    
    so the issue is this:
    ```java
     // If we find a back reference that is not valid, then we will treat it as a literal string. For example, if we have 3 capturing
        // groups and the Replacement Value has the value is "I owe $8 to him", then we want to treat the $8 as a literal "$8", rather
        // than attempting to use it as a back reference.
        private static String escapeLiteralBackReferences(final String unescaped, final int numCapturingGroups) {
            if (numCapturingGroups == 0) {
                return unescaped;
            }
    ```
    
    If there are no capture groups, we don't escape all the findings


---

[GitHub] nifi issue #2748: NIFI-4272 ReplaceText support multiple captures when el is...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on the issue:

    https://github.com/apache/nifi/pull/2748
  
    thanks for the review @mosermw 


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by mosermw <gi...@git.apache.org>.
Github user mosermw commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193230207
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestReplaceText.java ---
    @@ -81,15 +81,16 @@ public void testSimple() throws IOException {
         @Test
         public void testWithEscaped$InReplacement() throws IOException {
             final TestRunner runner = getRunner();
    -        runner.setProperty(ReplaceText.SEARCH_VALUE, "(?s:^.*$)");
    +        //runner.setProperty(ReplaceText.SEARCH_VALUE, "(?s:^.*$)");
    +        runner.setProperty(ReplaceText.SEARCH_VALUE, "(?s)(^.*$)");
             runner.setProperty(ReplaceText.REPLACEMENT_VALUE, "a\\$b");
     
             runner.enqueue("a$a,b,c,d");
             runner.run();
     
             runner.assertAllFlowFilesTransferred(ReplaceText.REL_SUCCESS, 1);
             final MockFlowFile out = runner.getFlowFilesForRelationship(ReplaceText.REL_SUCCESS).get(0);
    -        out.assertContentEquals("a\\$b".getBytes("UTF-8"));
    +        out.assertContentEquals("a$b".getBytes("UTF-8"));
    --- End diff --
    
    Was this is a change in the way that ReplaceText behaves or was this unit test incorrect before?  Is there a way to obtain t he output "a\$b" from ReplaceText now?


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193428577
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestReplaceText.java ---
    @@ -1074,12 +1090,11 @@ public void testRegexWithBadCaptureGroup() throws IOException {
             runner.setProperty(ReplaceText.REPLACEMENT_STRATEGY, ReplaceText.REGEX_REPLACE);
             runner.setProperty(ReplaceText.EVALUATION_MODE, ReplaceText.ENTIRE_TEXT);
     
    +        exception.expect(AssertionError.class);
    +        exception.expectMessage("java.lang.IndexOutOfBoundsException: No group 1");
    --- End diff --
    
    OK, I made it escape even if there are no named captures, reverted the changed test.


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/nifi/pull/2748


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193422077
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestReplaceText.java ---
    @@ -1074,12 +1090,11 @@ public void testRegexWithBadCaptureGroup() throws IOException {
             runner.setProperty(ReplaceText.REPLACEMENT_STRATEGY, ReplaceText.REGEX_REPLACE);
             runner.setProperty(ReplaceText.EVALUATION_MODE, ReplaceText.ENTIRE_TEXT);
     
    +        exception.expect(AssertionError.class);
    +        exception.expectMessage("java.lang.IndexOutOfBoundsException: No group 1");
    --- End diff --
    
    I'm looking into it, but the processor already has handing for this at runtime not at validation time, where it escapes back references that are over the actual capture count


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by mosermw <gi...@git.apache.org>.
Github user mosermw commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193223710
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ReplaceText.java ---
    @@ -79,7 +80,8 @@
     @SystemResourceConsideration(resource = SystemResource.MEMORY)
     public class ReplaceText extends AbstractProcessor {
     
    -    private static Pattern REPLACEMENT_NORMALIZATION_PATTERN = Pattern.compile("(\\$\\D)");
    +    private static Pattern QUOTED_GROUP_REF_PATTERN = Pattern.compile("('\\$\\d+')");
    +    private static Pattern LITERAL_QUOTED_PATTERN = Pattern.compile("literal\\(('.*?')\\)",Pattern.DOTALL);
    --- End diff --
    
    Expression Language allows single quotes and double quotes to be used interchangeably. Can this pattern be updated to handle both?


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193368464
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ReplaceText.java ---
    @@ -79,7 +80,8 @@
     @SystemResourceConsideration(resource = SystemResource.MEMORY)
     public class ReplaceText extends AbstractProcessor {
     
    -    private static Pattern REPLACEMENT_NORMALIZATION_PATTERN = Pattern.compile("(\\$\\D)");
    +    private static Pattern QUOTED_GROUP_REF_PATTERN = Pattern.compile("('\\$\\d+')");
    +    private static Pattern LITERAL_QUOTED_PATTERN = Pattern.compile("literal\\(('.*?')\\)",Pattern.DOTALL);
    --- End diff --
    
    done


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by mosermw <gi...@git.apache.org>.
Github user mosermw commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193232920
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestReplaceText.java ---
    @@ -1074,12 +1090,11 @@ public void testRegexWithBadCaptureGroup() throws IOException {
             runner.setProperty(ReplaceText.REPLACEMENT_STRATEGY, ReplaceText.REGEX_REPLACE);
             runner.setProperty(ReplaceText.EVALUATION_MODE, ReplaceText.ENTIRE_TEXT);
     
    +        exception.expect(AssertionError.class);
    +        exception.expectMessage("java.lang.IndexOutOfBoundsException: No group 1");
    --- End diff --
    
    At first I was really concerned about this change to IndexOutOfBoundsException, because it means the flowfile will rollback and admin yield the ReplaceText.  But if this is simply verifying behavior when the SEARCH_VALUE is incorrectly configured, then I think this is OK.  Can you confirm?  Normally we would prefer a customValidate() method catch this scenario (REPLACEMENT_VALUE references a capture group that does exist in the SEARCH_VALUE) but perhaps that isn't viable at validation time.


---

[GitHub] nifi pull request #2748: NIFI-4272 ReplaceText support multiple captures whe...

Posted by ottobackwards <gi...@git.apache.org>.
Github user ottobackwards commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/2748#discussion_r193371373
  
    --- Diff: nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestReplaceText.java ---
    @@ -81,15 +81,16 @@ public void testSimple() throws IOException {
         @Test
         public void testWithEscaped$InReplacement() throws IOException {
             final TestRunner runner = getRunner();
    -        runner.setProperty(ReplaceText.SEARCH_VALUE, "(?s:^.*$)");
    +        //runner.setProperty(ReplaceText.SEARCH_VALUE, "(?s:^.*$)");
    +        runner.setProperty(ReplaceText.SEARCH_VALUE, "(?s)(^.*$)");
             runner.setProperty(ReplaceText.REPLACEMENT_VALUE, "a\\$b");
     
             runner.enqueue("a$a,b,c,d");
             runner.run();
     
             runner.assertAllFlowFilesTransferred(ReplaceText.REL_SUCCESS, 1);
             final MockFlowFile out = runner.getFlowFilesForRelationship(ReplaceText.REL_SUCCESS).get(0);
    -        out.assertContentEquals("a\\$b".getBytes("UTF-8"));
    +        out.assertContentEquals("a$b".getBytes("UTF-8"));
    --- End diff --
    
    So, I think the behavior was incorrect before.
    
    Since we do the Regex first now, and then the expression, what we end up with is
    a .replaceAll((?s)(^.*$)),"a\$b")
    passing \$ to the regex evaluates as a literal '$', so the results it correctly a$b.
    In other words, it is working as the java regex works.
    
    To get what you are looking for, further escaping is required:
    ```java
    runner.setProperty(ReplaceText.REPLACEMENT_VALUE, "a\\\\\\$b");
    ```
    
    I'll add the test
    



---

[GitHub] nifi issue #2748: NIFI-4272 ReplaceText support multiple captures when el is...

Posted by mosermw <gi...@git.apache.org>.
Github user mosermw commented on the issue:

    https://github.com/apache/nifi/pull/2748
  
    +1 this looks good to me.  I will squash and merge.  Thanks @ottobackwards!


---