You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Sumanth Chinthagunta <xm...@gmail.com> on 2015/10/03 02:21:56 UTC

ExtractText regex help

I am building a flow:
GetFile —> SplitText —> ExtractText —> LogAttribute 
the input is standard apache log file. My ultimate goal is to create JSON for each line from the log. 

I use to have code like this, now I am trying to build RegEx for ExtractText to extract ip, time, method etc as attributes.
somehow I am not able to get it working. any help will be greatly appreciated. 


//127.0.0.1 - - [07/Mar/2012:23:21:47 +0100] "GET / HTTP/1.0" 200 454 "-" "ApacheBench/2.3"

def regex = ~/^([^ ]*).+\[(.+)\] "(\w+) ([^ ]+) .*" (\w+) ([^ ]*)/

def matcher = regex.matcher(line)
if (matcher.find()) {
	def data = [ip: matcher.group(1), time: matcher.group(2), method: matcher.group(3), path: matcher.group(4), result: matcher.group(5), size: matcher.group(6)]
} else {
    println "no match: " + line
}


https://github.com/xmlking/vertx-stalgia/blob/master/grails-app/conf/BootStrap.groovy <https://github.com/xmlking/vertx-stalgia/blob/master/grails-app/conf/BootStrap.groovy>

Thanks
Sumo

Re: ExtractText regex help

Posted by Sumanth Chinthagunta <xm...@gmail.com>.
Hi Joe,
Thanks. that really helped.  I was not clear from ExtractText's documentation,
where  I should set regex and how I get the matching results back.
now I got it working.

On Fri, Oct 2, 2015 at 5:52 PM, Joe Percivall <jo...@yahoo.com>
wrote:

> Hey Sumo,
>
> To explain how the ExtractText processor works for this case I created a
> unit test based on your regex and example string:
>
>
>
>     final TestRunner testRunner = TestRunners.newTestRunner(new
> ExtractText());
>     testRunner.setProperty("regex.result", "^([^ ]*).+\\[(.+)\\] \"(\\w+)
> ([^ ]+) .*\" (\\w+) ([^ ]*)");
>
>     testRunner.enqueue("127.0.0.1 - - [07/Mar/2012:23:21:47 +0100] \"GET /
> HTTP/1.0\" 200 454 \"-\" \"ApacheBench/2.3\"".getBytes("UTF-8"));
>     testRunner.run();
>
>     testRunner.assertAllFlowFilesTransferred(ExtractText.REL_MATCH, 1);
>     final MockFlowFile out =
> testRunner.getFlowFilesForRelationship(ExtractText.REL_MATCH).get(0);
>     out.assertAttributeEquals("regex.result.0", "127.0.0.1 - -
> [07/Mar/2012:23:21:47 +0100] \"GET / HTTP/1.0\" 200 454");
>     out.assertAttributeEquals("regex.result.1", "127.0.0.1");
>     out.assertAttributeEquals("regex.result.2", "07/Mar/2012:23:21:47
> +0100");
>     out.assertAttributeEquals("regex.result.3", "GET");
>     out.assertAttributeEquals("regex.result.4", "/");
>     out.assertAttributeEquals("regex.result.5", "200");
>     out.assertAttributeEquals("regex.result.6", "454");
>
>
> You can see your regex matches each capture group to an attribute for the
> flowfile. To get IP address you reference attrivute "regex.result.1".
>
> Hope this helps,
> Joe
> - - - - - -
> Joseph Percivall
> linkedin.com/in/Percivall
> e: joepercivall@yahoo.com
>
>
>
>
> On Friday, October 2, 2015 8:22 PM, Sumanth Chinthagunta <
> xmlking@gmail.com> wrote:
>
>
>
> I am building a flow:
> GetFile —> SplitText —> ExtractText —> LogAttribute
> the input is standard apache log file. My ultimate goal is to create JSON
> for each line from the log.
>
> I use to have code like this, now I am trying to build RegEx for
> ExtractText to extract ip, time, method etc as attributes.
> somehow I am not able to get it working. any help will be greatly
> appreciated.
>
>
> //127.0.0.1 - - [07/Mar/2012:23:21:47 +0100] "GET / HTTP/1.0" 200 454 "-"
> "ApacheBench/2.3"
>
> defregex =~/^([^ ]*).+\[(.+)\] "(\w+) ([^ ]+) .*" (\w+) ([^ ]*)/
>
> defmatcher =regex.matcher(line)
> if(matcher.find()) {
> def data = [ip: matcher.group(1), time: matcher.group(2), method:
> matcher.group(3), path: matcher.group(4), result: matcher.group(5), size:
> matcher.group(6)]
> }else{
> println"no match: "+line
> }
>
>
>
> https://github.com/xmlking/vertx-stalgia/blob/master/grails-app/conf/BootStrap.groovy
>
> Thanks
> Sumo
>



-- 
Thanks,
Sumanth

Re: ExtractText regex help

Posted by Joe Percivall <jo...@yahoo.com>.
Hey Sumo,

To explain how the ExtractText processor works for this case I created a unit test based on your regex and example string:



    final TestRunner testRunner = TestRunners.newTestRunner(new ExtractText());
    testRunner.setProperty("regex.result", "^([^ ]*).+\\[(.+)\\] \"(\\w+) ([^ ]+) .*\" (\\w+) ([^ ]*)");
    
    testRunner.enqueue("127.0.0.1 - - [07/Mar/2012:23:21:47 +0100] \"GET / HTTP/1.0\" 200 454 \"-\" \"ApacheBench/2.3\"".getBytes("UTF-8"));
    testRunner.run();
    
    testRunner.assertAllFlowFilesTransferred(ExtractText.REL_MATCH, 1);
    final MockFlowFile out = testRunner.getFlowFilesForRelationship(ExtractText.REL_MATCH).get(0);
    out.assertAttributeEquals("regex.result.0", "127.0.0.1 - - [07/Mar/2012:23:21:47 +0100] \"GET / HTTP/1.0\" 200 454");
    out.assertAttributeEquals("regex.result.1", "127.0.0.1");
    out.assertAttributeEquals("regex.result.2", "07/Mar/2012:23:21:47 +0100");
    out.assertAttributeEquals("regex.result.3", "GET");
    out.assertAttributeEquals("regex.result.4", "/");
    out.assertAttributeEquals("regex.result.5", "200");
    out.assertAttributeEquals("regex.result.6", "454");


You can see your regex matches each capture group to an attribute for the flowfile. To get IP address you reference attrivute "regex.result.1".

Hope this helps,
Joe 
- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joepercivall@yahoo.com




On Friday, October 2, 2015 8:22 PM, Sumanth Chinthagunta <xm...@gmail.com> wrote:



I am building a flow:
GetFile —> SplitText —> ExtractText —> LogAttribute 
the input is standard apache log file. My ultimate goal is to create JSON for each line from the log. 

I use to have code like this, now I am trying to build RegEx for ExtractText to extract ip, time, method etc as attributes.
somehow I am not able to get it working. any help will be greatly appreciated. 


//127.0.0.1 - - [07/Mar/2012:23:21:47 +0100] "GET / HTTP/1.0" 200 454 "-" "ApacheBench/2.3"

defregex =~/^([^ ]*).+\[(.+)\] "(\w+) ([^ ]+) .*" (\w+) ([^ ]*)/

defmatcher =regex.matcher(line)
if(matcher.find()) {
def data = [ip: matcher.group(1), time: matcher.group(2), method: matcher.group(3), path: matcher.group(4), result: matcher.group(5), size: matcher.group(6)]
}else{
println"no match: "+line
}


https://github.com/xmlking/vertx-stalgia/blob/master/grails-app/conf/BootStrap.groovy

Thanks
Sumo