You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Emre Bastuz <in...@emre.de> on 2012/06/14 10:37:39 UTC

Syslog messages incomplete - date and host missing

Hello Flume Users!

I can see some odd behaviour on my Flume + HDFS setup:
In log entries from a mailserver the entries for date, time and host are missing - only the message body appears.

This is how the "broken" entries look like (Cyrus IMAP server via rsyslog with mail.* @<flume_host>):
cyrus/master[1120]: process 20155 exited, status 0

This is how the same message appears localy on the mailserver (/var/log/mail.log):
Jun 13 17:11:01 mail cyrus/master[1120]: process 20155 exited, status 0

Log entries from a BSD packetfilter however, seem to be saved correctly via Flume:
Jun 13 17:11:21 packetfilter-host: 00:00:28.819270 rule 1/0(match): block in on em0: ...

So: mail only the message body, packetfilter with whole message.

By doing a packet capture on the Flume box I could verify that the mailserver is indeed sending the information but somewhere along the Flume way the
information is lost.

Any idea what´s going wrong?

Cheers,

Emre

Flume version: 1.1.0 installed via Cloudera repository (1.1.0+120-1.cdh4.0.0.p0.14~squeeze-cdh4.0.0)

Flume config:
agent.sources = syslogSource
agent.channels = memoryChannel
agent.sinks = hadoopSink
agent.sources.syslogSource.type = syslogudp
agent.sources.syslogSource.port = 5514
agent.sources.syslogSource.host = 1.1.1.1
agent.sources.syslogSource.channels = memoryChannel
agent.sinks.hadoopSink.type = hdfs
agent.sinks.hadoopSink.hdfs.path = hdfs://hdfs-node1/tmp/flume/
agent.sinks.hadoopSink.hdfs.filePrefix = TESTDATA
agent.sinks.hadoopSink.hdfs.fileType = DataStream
agent.sinks.hadoopSink.hdfs.writeFormat = Text
agent.sinks.hadoopSink.channel = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

Host system: Debian Squeeze amd64

Re: Syslog messages incomplete - date and host missing

Posted by Hari Shreedharan <hs...@cloudera.com>.
Hi Emre,  

Looking at the 2 log formats provided above:

Jun 13 17:11:01 mail cyrus/master[1120]: process 20155 exited, status 0


Jun 13 17:11:21 packetfilter-host: 00:00:28.819270 rule 1/0(match): block in on em0: ...

I think the Syslog source is probable unable to parse a hostname from this, because it is looking for a space after the hostname. If you check the headers you should be seeing an "Invalid" header in your event. As Arvind mentioned,  you should be able to use the HDFS sink's escaping capability or the multiplexing channel selector for routing.  

Also an invalid event parsed would cause the Syslog source to log it. Can you check if it is logging the packetfilter-host events as invalid?

Thanks
Hari


--  
Hari Shreedharan


On Thursday, June 14, 2012 at 3:07 AM, Emre Bastuz wrote:

> Hello Hari!
>  
> Thanks for the input! I´m afraid I don´t get it :)
>  
> > The Syslog source parses the date and host information into flume event headers and only the message itself is sent as the event body
>  
> This would explain why only the body appears for my mailserver logs. However, my packetfilter logs are being saved with **all** the data, including hostname and
> date-time.
>  
> > If you check the event headers, you will see headers which have the host information and the timestamp
>  
> Is there any way to "query" these event headers or are they used only with more "sophisticated" sources like Avro etc. (I am using syslogudp source)?
>  
> Cheers,
>  
> Emre
>  
> Am 14.06.12 11:21, schrieb Hari Shreedharan:
> > Hi Emre,
> >  
> > The Syslog source parses the date and host information into flume event headers and only the message itself is sent as the event body. Using the text serializer
> > (which is the default) in the HDFS Sink causes the body to be written out to HDFS. If you check the event headers, you will see headers which have the host
> > information and the timestamp. If you want this data to be written out to HDFS, you can write a serializer(you will need to
> > implement org.apache.flume.serialization.EventSerializer interface, and add the implementation to your classpath and mention that class's fully qualified class
> > name in the configuration hadoopSink.serializer) that dumps it out too - we do have an AvroSerializer that does this in Avro format.
> >  
> >  
> > Thanks
> > Hari
> >  
> > --  
> > Hari Shreedharan
> >  
> > On Thursday, June 14, 2012 at 1:37 AM, Emre Bastuz wrote:
> >  
> > > Hello Flume Users!
> > >  
> > > I can see some odd behaviour on my Flume + HDFS setup:
> > > In log entries from a mailserver the entries for date, time and host are missing - only the message body appears.
> > >  
> > > This is how the "broken" entries look like (Cyrus IMAP server via rsyslog with mail.* @<flume_host>):
> > > cyrus/master[1120]: process 20155 exited, status 0
> > >  
> > > This is how the same message appears localy on the mailserver (/var/log/mail.log):
> > > Jun 13 17:11:01 mail cyrus/master[1120]: process 20155 exited, status 0
> > >  
> > > Log entries from a BSD packetfilter however, seem to be saved correctly via Flume:
> > > Jun 13 17:11:21 packetfilter-host: 00:00:28.819270 rule 1/0(match): block in on em0: ...
> > >  
> > > So: mail only the message body, packetfilter with whole message.
> > >  
> > > By doing a packet capture on the Flume box I could verify that the mailserver is indeed sending the information but somewhere along the Flume way the
> > > information is lost.
> > >  
> > > Any idea what´s going wrong?
> > >  
> > > Cheers,
> > >  
> > > Emre
> > >  
> > > Flume version: 1.1.0 installed via Cloudera repository (1.1.0+120-1.cdh4.0.0.p0.14~squeeze-cdh4.0.0)
> > >  
> > > Flume config:
> > > agent.sources = syslogSource
> > > agent.channels = memoryChannel
> > > agent.sinks = hadoopSink
> > > agent.sources.syslogSource.type = syslogudp
> > > agent.sources.syslogSource.port = 5514
> > > agent.sources.syslogSource.host = 1.1.1.1
> > > agent.sources.syslogSource.channels = memoryChannel
> > > agent.sinks.hadoopSink.type = hdfs
> > > agent.sinks.hadoopSink.hdfs.path = hdfs://hdfs-node1/tmp/flume/
> > > agent.sinks.hadoopSink.hdfs.filePrefix = TESTDATA
> > > agent.sinks.hadoopSink.hdfs.fileType = DataStream
> > > agent.sinks.hadoopSink.hdfs.writeFormat = Text
> > > agent.sinks.hadoopSink.channel = memoryChannel
> > > agent.channels.memoryChannel.type = memory
> > > agent.channels.memoryChannel.capacity = 100
> > >  
> > > Host system: Debian Squeeze amd64  


Re: Syslog messages incomplete - date and host missing

Posted by Arvind Prabhakar <ar...@apache.org>.
Hi,

On Thu, Jun 14, 2012 at 3:07 AM, Emre Bastuz <in...@emre.de> wrote:

> > The Syslog source parses the date and host information into flume event
> headers and only the message itself is sent as the event body
>
> This would explain why only the body appears for my mailserver logs.
> However, my packetfilter logs are being saved with **all** the data,
> including hostname and
> date-time.
>

When the Syslog source is unable to parse the event correctly, it will
retain the entire event and add a special header to it by the name
flume.syslog.status with the value "invalid". You can use a multiplexing
channel selector to route these events to a different destination.

As far as the parsing capability of the Syslog source, it supports RFC-5424
and RFC-3164 formats. Unfortunately these specifications do have ambiguity
which may be causing your events to be not parsed out correctly. Look at
the flume.log for details.


>
> > If you check the event headers, you will see headers which have the host
> information and the timestamp
>
> Is there any way to "query" these event headers or are they used only with
> more "sophisticated" sources like Avro etc. (I am using syslogudp source)?
>

Event headers are available for routing the events. If you are using an
HDFS sink, you could use the syntax %{header} in the destination path to
route them separately. You could also use your own serializer to do custom
formatting and extraction if the terminal sink supports it.

Regards,
Arvind Prabhakar


>
> Cheers,
>
> Emre
>
> Am 14.06.12 11:21, schrieb Hari Shreedharan:
> > Hi Emre,
> >
> > The Syslog source parses the date and host information into flume event
> headers and only the message itself is sent as the event body. Using the
> text serializer
> > (which is the default) in the HDFS Sink causes the body to be written
> out to HDFS. If you check the event headers, you will see headers which
> have the host
> > information and the timestamp. If you want this data to be written out
> to HDFS, you can write a serializer(you will need to
> > implement org.apache.flume.serialization.EventSerializer interface, and
> add the implementation to your classpath and mention that class's fully
> qualified class
> > name in the configuration hadoopSink.serializer) that dumps it out too -
> we do have an AvroSerializer that does this in Avro format.
> >
> >
> > Thanks
> > Hari
> >
> > --
> > Hari Shreedharan
> >
> > On Thursday, June 14, 2012 at 1:37 AM, Emre Bastuz wrote:
> >
> >> Hello Flume Users!
> >>
> >> I can see some odd behaviour on my Flume + HDFS setup:
> >> In log entries from a mailserver the entries for date, time and host
> are missing - only the message body appears.
> >>
> >> This is how the "broken" entries look like (Cyrus IMAP server via
> rsyslog with mail.* @<flume_host>):
> >> cyrus/master[1120]: process 20155 exited, status 0
> >>
> >> This is how the same message appears localy on the mailserver
> (/var/log/mail.log):
> >> Jun 13 17:11:01 mail cyrus/master[1120]: process 20155 exited, status 0
> >>
> >> Log entries from a BSD packetfilter however, seem to be saved correctly
> via Flume:
> >> Jun 13 17:11:21 packetfilter-host: 00:00:28.819270 rule 1/0(match):
> block in on em0: ...
> >>
> >> So: mail only the message body, packetfilter with whole message.
> >>
> >> By doing a packet capture on the Flume box I could verify that the
> mailserver is indeed sending the information but somewhere along the Flume
> way the
> >> information is lost.
> >>
> >> Any idea what´s going wrong?
> >>
> >> Cheers,
> >>
> >> Emre
> >>
> >> Flume version: 1.1.0 installed via Cloudera repository
> (1.1.0+120-1.cdh4.0.0.p0.14~squeeze-cdh4.0.0)
> >>
> >> Flume config:
> >> agent.sources = syslogSource
> >> agent.channels = memoryChannel
> >> agent.sinks = hadoopSink
> >> agent.sources.syslogSource.type = syslogudp
> >> agent.sources.syslogSource.port = 5514
> >> agent.sources.syslogSource.host = 1.1.1.1
> >> agent.sources.syslogSource.channels = memoryChannel
> >> agent.sinks.hadoopSink.type = hdfs
> >> agent.sinks.hadoopSink.hdfs.path = hdfs://hdfs-node1/tmp/flume/
> >> agent.sinks.hadoopSink.hdfs.filePrefix = TESTDATA
> >> agent.sinks.hadoopSink.hdfs.fileType = DataStream
> >> agent.sinks.hadoopSink.hdfs.writeFormat = Text
> >> agent.sinks.hadoopSink.channel = memoryChannel
> >> agent.channels.memoryChannel.type = memory
> >> agent.channels.memoryChannel.capacity = 100
> >>
> >> Host system: Debian Squeeze amd64
> >
>
>

Re: Syslog messages incomplete - date and host missing

Posted by Emre Bastuz <in...@emre.de>.
Hello Hari!

Thanks for the input! I´m afraid I don´t get it :)

> The Syslog source parses the date and host information into flume event headers and only the message itself is sent as the event body

This would explain why only the body appears for my mailserver logs. However, my packetfilter logs are being saved with **all** the data, including hostname and
date-time.

> If you check the event headers, you will see headers which have the host information and the timestamp

Is there any way to "query" these event headers or are they used only with more "sophisticated" sources like Avro etc. (I am using syslogudp source)?

Cheers,

Emre

Am 14.06.12 11:21, schrieb Hari Shreedharan:
> Hi Emre,
> 
> The Syslog source parses the date and host information into flume event headers and only the message itself is sent as the event body. Using the text serializer
> (which is the default) in the HDFS Sink causes the body to be written out to HDFS. If you check the event headers, you will see headers which have the host
> information and the timestamp. If you want this data to be written out to HDFS, you can write a serializer(you will need to
> implement org.apache.flume.serialization.EventSerializer interface, and add the implementation to your classpath and mention that class's fully qualified class
> name in the configuration hadoopSink.serializer) that dumps it out too - we do have an AvroSerializer that does this in Avro format.
> 
> 
> Thanks
> Hari
> 
> -- 
> Hari Shreedharan
> 
> On Thursday, June 14, 2012 at 1:37 AM, Emre Bastuz wrote:
> 
>> Hello Flume Users!
>>
>> I can see some odd behaviour on my Flume + HDFS setup:
>> In log entries from a mailserver the entries for date, time and host are missing - only the message body appears.
>>
>> This is how the "broken" entries look like (Cyrus IMAP server via rsyslog with mail.* @<flume_host>):
>> cyrus/master[1120]: process 20155 exited, status 0
>>
>> This is how the same message appears localy on the mailserver (/var/log/mail.log):
>> Jun 13 17:11:01 mail cyrus/master[1120]: process 20155 exited, status 0
>>
>> Log entries from a BSD packetfilter however, seem to be saved correctly via Flume:
>> Jun 13 17:11:21 packetfilter-host: 00:00:28.819270 rule 1/0(match): block in on em0: ...
>>
>> So: mail only the message body, packetfilter with whole message.
>>
>> By doing a packet capture on the Flume box I could verify that the mailserver is indeed sending the information but somewhere along the Flume way the
>> information is lost.
>>
>> Any idea what´s going wrong?
>>
>> Cheers,
>>
>> Emre
>>
>> Flume version: 1.1.0 installed via Cloudera repository (1.1.0+120-1.cdh4.0.0.p0.14~squeeze-cdh4.0.0)
>>
>> Flume config:
>> agent.sources = syslogSource
>> agent.channels = memoryChannel
>> agent.sinks = hadoopSink
>> agent.sources.syslogSource.type = syslogudp
>> agent.sources.syslogSource.port = 5514
>> agent.sources.syslogSource.host = 1.1.1.1
>> agent.sources.syslogSource.channels = memoryChannel
>> agent.sinks.hadoopSink.type = hdfs
>> agent.sinks.hadoopSink.hdfs.path = hdfs://hdfs-node1/tmp/flume/
>> agent.sinks.hadoopSink.hdfs.filePrefix = TESTDATA
>> agent.sinks.hadoopSink.hdfs.fileType = DataStream
>> agent.sinks.hadoopSink.hdfs.writeFormat = Text
>> agent.sinks.hadoopSink.channel = memoryChannel
>> agent.channels.memoryChannel.type = memory
>> agent.channels.memoryChannel.capacity = 100
>>
>> Host system: Debian Squeeze amd64
> 


Re: Syslog messages incomplete - date and host missing

Posted by Hari Shreedharan <hs...@cloudera.com>.
Hi Emre,  

The Syslog source parses the date and host information into flume event headers and only the message itself is sent as the event body. Using the text serializer (which is the default) in the HDFS Sink causes the body to be written out to HDFS. If you check the event headers, you will see headers which have the host information and the timestamp. If you want this data to be written out to HDFS, you can write a serializer(you will need to implement org.apache.flume.serialization.EventSerializer interface, and add the implementation to your classpath and mention that class's fully qualified class name in the configuration hadoopSink.serializer) that dumps it out too - we do have an AvroSerializer that does this in Avro format.


Thanks
Hari

--  
Hari Shreedharan


On Thursday, June 14, 2012 at 1:37 AM, Emre Bastuz wrote:

> Hello Flume Users!
>  
> I can see some odd behaviour on my Flume + HDFS setup:
> In log entries from a mailserver the entries for date, time and host are missing - only the message body appears.
>  
> This is how the "broken" entries look like (Cyrus IMAP server via rsyslog with mail.* @<flume_host>):
> cyrus/master[1120]: process 20155 exited, status 0
>  
> This is how the same message appears localy on the mailserver (/var/log/mail.log):
> Jun 13 17:11:01 mail cyrus/master[1120]: process 20155 exited, status 0
>  
> Log entries from a BSD packetfilter however, seem to be saved correctly via Flume:
> Jun 13 17:11:21 packetfilter-host: 00:00:28.819270 rule 1/0(match): block in on em0: ...
>  
> So: mail only the message body, packetfilter with whole message.
>  
> By doing a packet capture on the Flume box I could verify that the mailserver is indeed sending the information but somewhere along the Flume way the
> information is lost.
>  
> Any idea what´s going wrong?
>  
> Cheers,
>  
> Emre
>  
> Flume version: 1.1.0 installed via Cloudera repository (1.1.0+120-1.cdh4.0.0.p0.14~squeeze-cdh4.0.0)
>  
> Flume config:
> agent.sources = syslogSource
> agent.channels = memoryChannel
> agent.sinks = hadoopSink
> agent.sources.syslogSource.type = syslogudp
> agent.sources.syslogSource.port = 5514
> agent.sources.syslogSource.host = 1.1.1.1
> agent.sources.syslogSource.channels = memoryChannel
> agent.sinks.hadoopSink.type = hdfs
> agent.sinks.hadoopSink.hdfs.path = hdfs://hdfs-node1/tmp/flume/
> agent.sinks.hadoopSink.hdfs.filePrefix = TESTDATA
> agent.sinks.hadoopSink.hdfs.fileType = DataStream
> agent.sinks.hadoopSink.hdfs.writeFormat = Text
> agent.sinks.hadoopSink.channel = memoryChannel
> agent.channels.memoryChannel.type = memory
> agent.channels.memoryChannel.capacity = 100
>  
> Host system: Debian Squeeze amd64