You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Andre <an...@fucs.org> on 2016/02/07 14:35:55 UTC

MergeContent: Correlation Attribute Name syntax for matching syslog events

Hi there,

I was playing with the ListenSyslog processor and hit something I wanted to
confirm is the expected behavior:

ListenSyslog (parse = false)

connects success to :

ParseSyslog

connects success to:

MergeContent ("Correlation Attribute Name" set to ${syslog.hostname} )

connects merged to:

PutFile



It has been ages since I used MergeContent but I was wondering, wasn't
Correlation Attribute Name supposed to create the bins so dataflows
matching that attribute get bundled together?

If yes, Is ${syslog.hostname} the value I want or am I once again being
beaten by MergeContent and its black magic?


I ask because my dataflows are being bundled in accordance to size and age
of the bin but not binned in accordance to Is ${syslog.hostname}



Cheers

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Andre <an...@fucs.org>.
Bryan,

> There is an attribute called "syslog.sender" which is the host that the
message was received from, the
> value is populated from the incoming connection in Java code, not from
anything in the syslog message.
> This should essentially be the host of the syslog server/forwarder.

Correct.I saw that when I was using your code as the basis for
ListenLumberjack. :-)

> There is an attribute called "syslog.hostname" which is the hostname in
the syslog message itself,
> which should be the host that produced that message and sent it to a
syslog server.

Saw that as well.

This is particularly handy when fitting NiFi into "existent syslog server"
scenarios.

Syslog Producers --> existing rsyslog / syslog-ng server --> log shipping
mechanism (e.g. Flume, filebeat, heka, MiNiFi)   --> NiFi

(Same can be said of the RouteText suggestion together with batching as
suggested below)


> By default ListenSyslog has parse set to true and batch size set to 1. If
you set parse to false
> and increase the batch size to say 100, it will try to grab a maximum of
100 messages in each
>  execution of the processor (could be less depending on timing and what
is available), and for
> those 100 messages it groups them by the "sender" (described above) and
outputs a flow file
> per sender.

I saw those options and while inclined to use I was wondering, what happens
to ordering in this case?

If I take the paper Rainer Gerhards (of rsyslog fame) wrote in 2010 (
http://www.gerhards.net/download/LinuxKongress2010rsyslog.pdf ), message
ordering under multi-threaded environments can be particularly hard
(rsyslog itself doesn't seem to provide hard ordering guarantees).

To a point where the author clearly states: "so it is safe to assume that
in almost all practical cases, the sequence in which messages are stored or
emitted is not a proper indication of the order of events."
(I wander if anyone ever tried to use this statement in court in an attempt
to invalidate evidence :-) )

I still haven't tested but I would imagine that under multi threaded
configurations, batching followed by RouteText would result into flowfiles
reasonably out of order?

> Batching can definitely get much higher through put on ListenSyslog, but
if you have to parse
> them later in the flow with ParseSyslog then you still need to get each
message into its own
> FlowFile, which most likely entails SplitText with a line count of 1 and
then ParseSyslog.
> I don't know if this turns out much better then just letting ListenSyslog
parse them in the
> first place. If you are letting ListenSyslog do the parsing then you can
increase the concurrent
> tasks on the processor which means more threads parsing syslog messages
and outputing
> FlowFiles.

Correct, Another scenario is to process GetKafka, ListenHttp,
ListenLumberjack, etc flowfiles containing syslog formatted messages.
(Which I happen to be testing, hence the strange setup described
previously).


Cheers

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Bryan Bende <bb...@gmail.com>.
I believe what Joe was referring to with RouteText was that it can take a
regular expression with a capture group, and output a FlowFile per unique
value of the capturing group. So if the incoming data is a FlowFile with a
bunch of syslog messages and you provide a regex that captures hostname, it
can produced a FlowFile per unique hostname with all the messages that go
with that hostname.

I don't want to side track the conversation about how to use MergeContent
properly, but wanted to add a couple of things about how ListenSyslog
works...

There is an attribute called "syslog.sender" which is the host that the
message was received from, the value is populated from the incoming
connection in Java code, not from anything in the syslog message. This
should essentially be the host of the syslog server/forwarder.

There is an attribute called "syslog.hostname" which is the hostname in the
syslog message itself, which should be the host that produced that message
and sent it to a syslog server.

By default ListenSyslog has parse set to true and batch size set to 1. If
you set parse to false and increase the batch size to say 100, it will try
to grab a maximum of 100 messages in each execution of the processor (could
be less depending on timing and what is available), and for those 100
messages it groups them by the "sender" (described above) and outputs a
flow file per sender.

Batching can definitely get much higher through put on ListenSyslog, but if
you have to parse them later in the flow with ParseSyslog then you still
need to get each message into its own FlowFile, which most likely entails
SplitText with a line count of 1 and then ParseSyslog. I don't know if this
turns out much better then just letting ListenSyslog parse them in the
first place. If you are letting ListenSyslog do the parsing then you can
increase the concurrent tasks on the processor which means more threads
parsing syslog messages and outputing FlowFiles.

I think the batching concept makes the most sense when you don't need to
parse the messages and just want to deliver the raw messages somewhere like
HDFS, or Kafka.

-Bryan


On Sun, Feb 7, 2016 at 10:03 AM, Andre <an...@fucs.org> wrote:

>
> > You can use RouteText to group (rather than split) on some shared
> pattern such as the hostname.  Will be far more efficient than splitting
> each line then grouping on that hostname.
>
> Not sure I understand?
>
>
>

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Andre <an...@fucs.org>.
> You can use RouteText to group (rather than split) on some shared pattern
such as the hostname.  Will be far more efficient than splitting each line
then grouping on that hostname.

Not sure I understand?

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Joe Witt <jo...@gmail.com>.
Hmm...will do some digging.

You can use RouteText to group (rather than split) on some shared pattern
such as the hostname.  Will be far more efficient than splitting each line
then grouping on that hostname.

We still need to make sure merge is good to go though.  Aldrin was unable
to replicate what was being seen.

If you can share your dataflow config as a template that would be cool.

Thanks
Joe
On Feb 7, 2016 9:11 AM, "Andre" <an...@fucs.org> wrote:

> Joe,
>
> It sure does.
>
> However I am using 0.4.2 snapshot that in theory should be based on 0.5.0
> ? (e.g. my current dev instance already has the ListenRELP processor Bryan
> put together).
>
> So far I tried:
>
> syslog.hostname
> 'syslog.hostname'
> ${syslog.hostname}
> ${'syslog.hostname'}
> ${"syslog.hostname"}
>
> All with the same result.
>
> I wonder if this is linked to
> https://issues.apache.org/jira/browse/NIFI-1438 ?
>
>
>
>
>
> On Mon, Feb 8, 2016 at 12:55 AM, Joe Witt <jo...@gmail.com> wrote:
>
>> Andre
>>
>> I believe until this next release which is 0.5.0 merge content did not
>> allow expression language statements as the correlation attribute.  By
>> using an expression language statement there it is matching everything
>> basically.
>>
>> For now you just put 'syslog.hostname' there instead.
>>
>> Make sense?
>>
>> Thanks
>> Joe
>> On Feb 7, 2016 8:51 AM, "Andrew Grande" <ag...@hortonworks.com> wrote:
>>
>>> Hi,
>>>
>>> Are you doing anything special between Listen and Parse? Trying to
>>> understand the reasoning for why you split those.
>>>
>>> E.g. in default ListenSyslog mode I can see syslog.hostname correctly
>>> set. Could it be MergeContent settings? Maybe worth sharing its config.
>>>
>>> Andrew
>>>
>>> From: Andre <an...@fucs.org>
>>> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
>>> Date: Sunday, February 7, 2016 at 8:35 AM
>>> To: "users@nifi.apache.org" <us...@nifi.apache.org>
>>> Subject: MergeContent: Correlation Attribute Name syntax for matching
>>> syslog events
>>>
>>> Hi there,
>>>
>>> I was playing with the ListenSyslog processor and hit something I wanted
>>> to confirm is the expected behavior:
>>>
>>> ListenSyslog (parse = false)
>>>
>>> connects success to :
>>>
>>> ParseSyslog
>>>
>>> connects success to:
>>>
>>> MergeContent ("Correlation Attribute Name" set to ${syslog.hostname} )
>>>
>>> connects merged to:
>>>
>>> PutFile
>>>
>>>
>>>
>>> It has been ages since I used MergeContent but I was wondering, wasn't
>>> Correlation Attribute Name supposed to create the bins so dataflows
>>> matching that attribute get bundled together?
>>>
>>> If yes, Is ${syslog.hostname} the value I want or am I once again being
>>> beaten by MergeContent and its black magic?
>>>
>>>
>>> I ask because my dataflows are being bundled in accordance to size and
>>> age of the bin but not binned in accordance to Is ${syslog.hostname}
>>>
>>>
>>>
>>> Cheers
>>>
>>>
>>>
>>>
>>>
>>>
>

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Andre <an...@fucs.org>.
Joe,

It sure does.

However I am using 0.4.2 snapshot that in theory should be based on 0.5.0 ?
(e.g. my current dev instance already has the ListenRELP processor Bryan
put together).

So far I tried:

syslog.hostname
'syslog.hostname'
${syslog.hostname}
${'syslog.hostname'}
${"syslog.hostname"}

All with the same result.

I wonder if this is linked to
https://issues.apache.org/jira/browse/NIFI-1438 ?





On Mon, Feb 8, 2016 at 12:55 AM, Joe Witt <jo...@gmail.com> wrote:

> Andre
>
> I believe until this next release which is 0.5.0 merge content did not
> allow expression language statements as the correlation attribute.  By
> using an expression language statement there it is matching everything
> basically.
>
> For now you just put 'syslog.hostname' there instead.
>
> Make sense?
>
> Thanks
> Joe
> On Feb 7, 2016 8:51 AM, "Andrew Grande" <ag...@hortonworks.com> wrote:
>
>> Hi,
>>
>> Are you doing anything special between Listen and Parse? Trying to
>> understand the reasoning for why you split those.
>>
>> E.g. in default ListenSyslog mode I can see syslog.hostname correctly
>> set. Could it be MergeContent settings? Maybe worth sharing its config.
>>
>> Andrew
>>
>> From: Andre <an...@fucs.org>
>> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
>> Date: Sunday, February 7, 2016 at 8:35 AM
>> To: "users@nifi.apache.org" <us...@nifi.apache.org>
>> Subject: MergeContent: Correlation Attribute Name syntax for matching
>> syslog events
>>
>> Hi there,
>>
>> I was playing with the ListenSyslog processor and hit something I wanted
>> to confirm is the expected behavior:
>>
>> ListenSyslog (parse = false)
>>
>> connects success to :
>>
>> ParseSyslog
>>
>> connects success to:
>>
>> MergeContent ("Correlation Attribute Name" set to ${syslog.hostname} )
>>
>> connects merged to:
>>
>> PutFile
>>
>>
>>
>> It has been ages since I used MergeContent but I was wondering, wasn't
>> Correlation Attribute Name supposed to create the bins so dataflows
>> matching that attribute get bundled together?
>>
>> If yes, Is ${syslog.hostname} the value I want or am I once again being
>> beaten by MergeContent and its black magic?
>>
>>
>> I ask because my dataflows are being bundled in accordance to size and
>> age of the bin but not binned in accordance to Is ${syslog.hostname}
>>
>>
>>
>> Cheers
>>
>>
>>
>>
>>
>>

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Joe Witt <jo...@gmail.com>.
Andre

I believe until this next release which is 0.5.0 merge content did not
allow expression language statements as the correlation attribute.  By
using an expression language statement there it is matching everything
basically.

For now you just put 'syslog.hostname' there instead.

Make sense?

Thanks
Joe
On Feb 7, 2016 8:51 AM, "Andrew Grande" <ag...@hortonworks.com> wrote:

> Hi,
>
> Are you doing anything special between Listen and Parse? Trying to
> understand the reasoning for why you split those.
>
> E.g. in default ListenSyslog mode I can see syslog.hostname correctly set.
> Could it be MergeContent settings? Maybe worth sharing its config.
>
> Andrew
>
> From: Andre <an...@fucs.org>
> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Date: Sunday, February 7, 2016 at 8:35 AM
> To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Subject: MergeContent: Correlation Attribute Name syntax for matching
> syslog events
>
> Hi there,
>
> I was playing with the ListenSyslog processor and hit something I wanted
> to confirm is the expected behavior:
>
> ListenSyslog (parse = false)
>
> connects success to :
>
> ParseSyslog
>
> connects success to:
>
> MergeContent ("Correlation Attribute Name" set to ${syslog.hostname} )
>
> connects merged to:
>
> PutFile
>
>
>
> It has been ages since I used MergeContent but I was wondering, wasn't
> Correlation Attribute Name supposed to create the bins so dataflows
> matching that attribute get bundled together?
>
> If yes, Is ${syslog.hostname} the value I want or am I once again being
> beaten by MergeContent and its black magic?
>
>
> I ask because my dataflows are being bundled in accordance to size and age
> of the bin but not binned in accordance to Is ${syslog.hostname}
>
>
>
> Cheers
>
>
>
>
>
>

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Andre <an...@fucs.org>.
> Are you doing anything special between Listen and Parse? Trying to
understand the reasoning for why you split those.

No. But I will... hence the split. :-)

I also have the impression splitting both seems to increase a bit the
number of events per second reaching NiFi.

> E.g. in default ListenSyslog mode I can see syslog.hostname correctly
set. Could it be MergeContent settings? Maybe worth sharing its config.

Same thing here.

I can see the parsing occurring (via Provenance)  but no luck with the
merge.

Settings are:

Strategy - bin packing
Format - Binary concat
Attribute strategy - Keep only common
Correlation attribute - ${syslog.hostname}
Minimum number of entries - 100
Maximum number of entries - no value set
Minimum Group size - 10MB
maximum group size - no value set
max Bin Age - 30s
Max number of bins 100
Delimiter Strategy - text
Header - no value set
Footer - no value set
Demarcator - no value set
Compression - 0
keep path - false

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

Posted by Andrew Grande <ag...@hortonworks.com>.
Hi,

Are you doing anything special between Listen and Parse? Trying to understand the reasoning for why you split those.

E.g. in default ListenSyslog mode I can see syslog.hostname correctly set. Could it be MergeContent settings? Maybe worth sharing its config.

Andrew

From: Andre <an...@fucs.org>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Sunday, February 7, 2016 at 8:35 AM
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: MergeContent: Correlation Attribute Name syntax for matching syslog events

Hi there,

I was playing with the ListenSyslog processor and hit something I wanted to confirm is the expected behavior:

ListenSyslog (parse = false)

connects success to :

ParseSyslog

connects success to:

MergeContent ("Correlation Attribute Name" set to ${syslog.hostname} )

connects merged to:

PutFile



It has been ages since I used MergeContent but I was wondering, wasn't Correlation Attribute Name supposed to create the bins so dataflows matching that attribute get bundled together?

If yes, Is ${syslog.hostname} the value I want or am I once again being beaten by MergeContent and its black magic?


I ask because my dataflows are being bundled in accordance to size and age of the bin but not binned in accordance to Is ${syslog.hostname}



Cheers