You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Eric Secules <es...@gmail.com> on 2020/12/16 22:06:53 UTC

ExtractText Improvement

Hello everyone,

I was wondering if there could be an improvement to ExtractText so that the
entire content of the flowfile is scanned for matches in chunks of
MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can
do pattern extraction over arbitrary size files while keeping memory
consumption limited.

Consider the use case where I am looking to extract a small pattern of
maybe 100 bytes from files that could be 1MB or 500MB. Looking at the
ExtractText source code, it always allocates a byte array of the maximum
size, so it probably wouldn't be appropriate to set that parameter too
high. It's essential to have the chunks overlap by the maximum length of
the capture group because the match may straddle two chunks. For the same
reason it's not advisable to split the flowfile into chunks of
MAX_BUFFER_SIZE using existing processors.

Thanks,
Eric

Re: ExtractText Improvement

Posted by Pierre Villard <pi...@gmail.com>.
Hi Eric,

I do think this would be interesting. Please submit a PR when you feel this
is ready for a review.

Thanks,
Pierre

Le lun. 21 déc. 2020 à 21:56, Eric Secules <es...@gmail.com> a écrit :

> Would this improvement be worthwhile? I'd also like to apply it to all the
> regex search/replacement processors, route on content for instance. I have
> a working POC in my environment. I just need to clean things up, create a
> ticket and get my contribution access approved.
>
> On Wed., Dec. 16, 2020, 2:06 p.m. Eric Secules, <es...@gmail.com>
> wrote:
>
>> Hello everyone,
>>
>> I was wondering if there could be an improvement to ExtractText so that
>> the entire content of the flowfile is scanned for matches in chunks of
>> MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can
>> do pattern extraction over arbitrary size files while keeping memory
>> consumption limited.
>>
>> Consider the use case where I am looking to extract a small pattern of
>> maybe 100 bytes from files that could be 1MB or 500MB. Looking at the
>> ExtractText source code, it always allocates a byte array of the maximum
>> size, so it probably wouldn't be appropriate to set that parameter too
>> high. It's essential to have the chunks overlap by the maximum length of
>> the capture group because the match may straddle two chunks. For the same
>> reason it's not advisable to split the flowfile into chunks of
>> MAX_BUFFER_SIZE using existing processors.
>>
>> Thanks,
>> Eric
>>
>

Re: ExtractText Improvement

Posted by Eric Secules <es...@gmail.com>.
Would this improvement be worthwhile? I'd also like to apply it to all the
regex search/replacement processors, route on content for instance. I have
a working POC in my environment. I just need to clean things up, create a
ticket and get my contribution access approved.

On Wed., Dec. 16, 2020, 2:06 p.m. Eric Secules, <es...@gmail.com> wrote:

> Hello everyone,
>
> I was wondering if there could be an improvement to ExtractText so that
> the entire content of the flowfile is scanned for matches in chunks of
> MAX_BUFFER_SIZE which overlap by MAX_CAPTURE_GROUP_LENGTH. That way we can
> do pattern extraction over arbitrary size files while keeping memory
> consumption limited.
>
> Consider the use case where I am looking to extract a small pattern of
> maybe 100 bytes from files that could be 1MB or 500MB. Looking at the
> ExtractText source code, it always allocates a byte array of the maximum
> size, so it probably wouldn't be appropriate to set that parameter too
> high. It's essential to have the chunks overlap by the maximum length of
> the capture group because the match may straddle two chunks. For the same
> reason it's not advisable to split the flowfile into chunks of
> MAX_BUFFER_SIZE using existing processors.
>
> Thanks,
> Eric
>