You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by John Burns <jz...@gmail.com> on 2016/02/21 21:04:09 UTC

ExtractText Processor

Hi,

I'm using ExtractText processor to monitor a website for specific content
terms and log matches to a database. However, according to the documents on
ExtractText ".....If the Regular Expression matches more than once, only
the first match will be used"

Do I understand this correctly as meaning that only the first regex match
of a given term will be captured (as opposed to how grep works for
example). I want to capture all occurrences of the match not just the first.

Any help would be appreciated.

Many thanks

John

Re: ExtractText Processor

Posted by Conrad Crampton <co...@SecData.com>.

Hi,
I don’t think you can do what you want to using ExtractText processor.
The relevant section of the code

if (matcher.find())  Line 320 (v0.4.1) ExtractText.java (I would have included more of this to put in context but got blocked by email filtering)

Because matcher.find() is used it will only match once. To get each match of the repeated group, it would have to be in a while (matcher.find()) …. with each matching group returned with matcher.group() call.

Unless someone else can suggest anything different, I would say you would have to write your own custom processor for this (or extend ExtractText processor with another property for repeating groups and have a different part of code run if set which uses while matcher.find()

HTH,
Conrad


From: John Burns <jz...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Thursday, 25 February 2016 at 09:44
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: Re: ExtractText Processor

Hi,

Thank you for the reply. I am trying to solve something I thought would be fairly simple but not having much success:

Consider the string "my friend and I went for a long walk. It was raining and it was very cold". When tested against one single Java regex (.{9}and.{9})+ results in two matches: "y friend and I went f" and "raining and it was v".

In NiFi I wish to do something similar, ie, capture all the matching strings for a given regex (similar to grep). When I run the above regex in NiFi I see only the first match but not the second.

Could you advise how I can access all matches for the regex. The use case here is to monitor websites for specific a word and extract (say) 10 characters either side of the matching word - for all matches on the site.

Thanks again

John


On Mon, Feb 22, 2016 at 7:05 AM, Conrad Crampton <co...@secdata.com>> wrote:
Hi John,
If you use a property for your regexp called matches for example that has many capture groups in it e.g.
matches (?:^(.+) (\d+)$)
If this matches the incoming flow file, then you will end up after processing with 3 attributes.
matches
matches.1
matches.2

With the matches and matches.1 being the same value (of the first capture group). If you set the ‘Include Capture Group 0’ to be true you get an additional attribute matches.0 that is the whole match group (as with Java RegExp class.

HTH,
Conrad

From: John Burns <jz...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Sunday, 21 February 2016 at 20:04
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: ExtractText Processor

Hi,

I'm using ExtractText processor to monitor a website for specific content terms and log matches to a database. However, according to the documents on ExtractText ".....If the Regular Expression matches more than once, only the first match will be used"

Do I understand this correctly as meaning that only the first regex match of a given term will be captured (as opposed to how grep works for example). I want to capture all occurrences of the match not just the first.

Any help would be appreciated.

Many thanks

John



***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email as spam.


SecureData, combating cyber threats

________________________________

The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT

Re: ExtractText Processor

Posted by John Burns <jz...@gmail.com>.

Hi,

Thank you for the reply. I am trying to solve something I thought would be
fairly simple but not having much success:

Consider the string "my friend and I went for a long walk. It was raining
and it was very cold". When tested against one single Java regex
(.{9}and.{9})+ results in two matches: "y friend and I went f" and "raining
and it was v".

In NiFi I wish to do something similar, ie, capture all the matching
strings for a given regex (similar to grep). When I run the above regex in
NiFi I see only the first match but not the second.

Could you advise how I can access all matches for the regex. The use case
here is to monitor websites for specific a word and extract (say) 10
characters either side of the matching word - for all matches on the site.

Thanks again

John


On Mon, Feb 22, 2016 at 7:05 AM, Conrad Crampton <
conrad.crampton@secdata.com> wrote:

> Hi John,
> If you use a property for your regexp called matches for example that has
> many capture groups in it e.g.
> matches (?:^(.+) (\d+)$)
> If this matches the incoming flow file, then you will end up after
> processing with 3 attributes.
> matches
> matches.1
> matches.2
>
> With the matches and matches.1 being the same value (of the first capture
> group). If you set the ‘Include Capture Group 0’ to be true you get an
> additional attribute matches.0 that is the whole match group (as with Java
> RegExp class.
>
> HTH,
> Conrad
>
> From: John Burns <jz...@gmail.com>
> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Date: Sunday, 21 February 2016 at 20:04
> To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Subject: ExtractText Processor
>
> Hi,
>
> I'm using ExtractText processor to monitor a website for specific content
> terms and log matches to a database. However, according to the documents on
> ExtractText ".....If the Regular Expression matches more than once, only
> the first match will be used"
>
> Do I understand this correctly as meaning that only the first regex match
> of a given term will be captured (as opposed to how grep works for
> example). I want to capture all occurrences of the match not just the first.
>
> Any help would be appreciated.
>
> Many thanks
>
> John
>
>
> ***This email originated outside SecureData***
>
> Click here <https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to
> report this email as spam.
>
>
> SecureData, combating cyber threats
>
> ------------------------------
>
> The information contained in this message or any of its attachments may be
> privileged and confidential and intended for the exclusive use of the
> intended recipient. If you are not the intended recipient any disclosure,
> reproduction, distribution or other dissemination or use of this
> communications is strictly prohibited. The views expressed in this email
> are those of the individual and not necessarily of SecureData Europe Ltd.
> Any prices quoted are only valid if followed up by a formal written quote.
>
> SecureData Europe Limited. Registered in England & Wales 04365896.
> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
> Maidstone, Kent, ME16 9NT
>

Re: ExtractText Processor

Posted by Conrad Crampton <co...@SecData.com>.

Hi John,
If you use a property for your regexp called matches for example that has many capture groups in it e.g.
matches (?:^(.+) (\d+)$)
If this matches the incoming flow file, then you will end up after processing with 3 attributes.
matches
matches.1
matches.2

With the matches and matches.1 being the same value (of the first capture group). If you set the ‘Include Capture Group 0’ to be true you get an additional attribute matches.0 that is the whole match group (as with Java RegExp class.

HTH,
Conrad

From: John Burns <jz...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Sunday, 21 February 2016 at 20:04
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: ExtractText Processor

Hi,

I'm using ExtractText processor to monitor a website for specific content terms and log matches to a database. However, according to the documents on ExtractText ".....If the Regular Expression matches more than once, only the first match will be used"

Do I understand this correctly as meaning that only the first regex match of a given term will be captured (as opposed to how grep works for example). I want to capture all occurrences of the match not just the first.

Any help would be appreciated.

Many thanks

John



***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email as spam.


SecureData, combating cyber threats
______________________________________________________________________ 
The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT