You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Adam Lamar <ad...@gmail.com> on 2017/09/25 22:46:10 UTC

Processing multiple lines per flowfile with ExtractGrok

Hi there,

I've been playing with the ExtractGrok processor and noticed I was missing
some data that I expected to be extracted. After some investigation, it
seems that ExtractGrok extracts only the first line of the flowfile
content, and ignores the rest.

Is this expected behavior? I should be able to use SplitText to break up
the records, but it surprised me because other grok tools I've used have
been line-oriented by default (at least from the perspective of the user).

Cheers,
Adam

Re: Processing multiple lines per flowfile with ExtractGrok

Posted by Adam Lamar <ad...@gmail.com>.

Thanks Joe and Bryan. The setup is little more involved but I was able to
get ConvertRecord running with a grok reader and a json writer. And I can
confirm that setup splits records as expected by newline. Nice touch to
have multiple records contained in the same flow file! Thanks for the tip
and excuse to finally play with the record oriented processors.

Cheers,
Adam

Re: Processing multiple lines per flowfile with ExtractGrok

Posted by Bryan Bende <bb...@gmail.com>.

Adam,

I'm only a little bit familiar with Grok, but the ExtactGrok processor
reads the entire content of the flow file into memory and then
performs the match with the grok expression against the entire
content, so it seems like this processor wasn't intended to perform
the match line-by-line.

The likely reason is because when you are extracting information into
flow file attributes, typically you are doing this to then make some
kind of routing decision. So for example, if you have a log message in
the flow file content and then you want to extract the log-level
(warn, error, etc), and then route all the logs of a given level
somewhere. This works when there is one log message per flow file, but
doesn't really work when there are thousands of log messages per flow
because then you would get thousands of flow file attributes and it
would be unclear what to route on.

What Joe pointed out with the record processors and the GrokReader is
a different approach where you should be able to avoid splitting up
your data. For example, in the above scenario you could use
PartitionRecord with a GrokReader to separate a flow file of log
messages into a flow file per log-level, without having to split into
thousands of individual flow files.

Hopefully that helps. Let us know if you have any other questions.

-Bryan

On Mon, Sep 25, 2017 at 9:05 PM, Joe Witt <jo...@gmail.com> wrote:
> Adam,
>
> I'm not very familiar with that specific processor but I think you'll
> find your case is probably far better handled using the Record
> reader/writer processors anyway.  There is a GrokReader which you can
> use to read each line of a given input as grok expressions to parse
> out key fields against your desired schema.  Then there are writers
> for csv, json, avro, etc..  They provide processors to partition based
> on like records, validate records match expected structure, merge
> records, convert, transfer to/from Kafka, Split records, etc..
>
> Thanks
> Joe
>
> On Mon, Sep 25, 2017 at 6:46 PM, Adam Lamar <ad...@gmail.com> wrote:
>> Hi there,
>>
>> I've been playing with the ExtractGrok processor and noticed I was missing
>> some data that I expected to be extracted. After some investigation, it
>> seems that ExtractGrok extracts only the first line of the flowfile content,
>> and ignores the rest.
>>
>> Is this expected behavior? I should be able to use SplitText to break up the
>> records, but it surprised me because other grok tools I've used have been
>> line-oriented by default (at least from the perspective of the user).
>>
>> Cheers,
>> Adam

Re: Processing multiple lines per flowfile with ExtractGrok

Posted by Joe Witt <jo...@gmail.com>.

Adam,

I'm not very familiar with that specific processor but I think you'll
find your case is probably far better handled using the Record
reader/writer processors anyway.  There is a GrokReader which you can
use to read each line of a given input as grok expressions to parse
out key fields against your desired schema.  Then there are writers
for csv, json, avro, etc..  They provide processors to partition based
on like records, validate records match expected structure, merge
records, convert, transfer to/from Kafka, Split records, etc..

Thanks
Joe

On Mon, Sep 25, 2017 at 6:46 PM, Adam Lamar <ad...@gmail.com> wrote:
> Hi there,
>
> I've been playing with the ExtractGrok processor and noticed I was missing
> some data that I expected to be extracted. After some investigation, it
> seems that ExtractGrok extracts only the first line of the flowfile content,
> and ignores the rest.
>
> Is this expected behavior? I should be able to use SplitText to break up the
> records, but it surprised me because other grok tools I've used have been
> line-oriented by default (at least from the perspective of the user).
>
> Cheers,
> Adam