You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by DSuiter RDX <ds...@rdx.com> on 2013/10/30 16:11:24 UTC

Preserving origin syslog information

Hi, just a general behavioral question.

We have a syslogTCP source catching remotely generated syslog events. They
got to an Avro sink, which delivers them to an Avro source, then into an
HDFS sink.

I currently have a test replicating channel delivering it to HDFS with the
avro_event serializer, and also delivering the same events to HDFS without
the avro_event serializer. The latter results in a text-encoded aggregate
file, which works well.

The issue I would like clarification on is this:

When it is saved to HDFS as Avro, there is a epoch timestamp, the hostname,
and some severity and facility information being saved along with the
message body. There is a "headers" and "body" section of the Avro schema,
and the timestamp etc is in the "headers" section, and the actual text is
the "body."

However, when the file is saved to HDFS as text, the only thing we get is
the content of the "body" field, and there is no longer any host,
timestamp, etc., even though those are components of the original message.

Where are the components form the generating server being stripped away? By
syslogTCP source, or by HDFS sink deserializing into text?

Another way to summarize this is: When the server writing the events to
syslog writes them, it writes with timestamp and host fields. If we use
Avro the whole way, it keeps that information as headers, but if we save as
text, no timestamp or host information is preserved. We would like it
preserved so we can programmatically parse the timestamp to sort by day. We
would also like to not have to deal with Avro MapReduce for the time being,
as that has proved challenging. So, is there a way that I can get the WHOLE
event body as the "body" using syslogTCP source, or do we need to look at
exec source to tail the generating server /var/log/messages and send it
that way?

Thanks,
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

Re: Preserving origin syslog information

Posted by Jeff Lord <jl...@cloudera.com>.
Devin,

FLUME-1666 added a keepFields property that will allow you to preserve the
timestamp and hostname in the body of the generated flume event.
That patch was committed to trunk a couple of weeks ago so if you use trunk
to build it should be available.
https://issues.apache.org/jira/browse/FLUME-1666

Please note that this still does not preserve the priority.
I will be submitting another patch this evening which will do just that for
the syslogTCP, syslogUDP, and syslogMultiPort sources.

Best,

Jeff


On Wed, Oct 30, 2013 at 8:11 AM, DSuiter RDX <ds...@rdx.com> wrote:

> Hi, just a general behavioral question.
>
> We have a syslogTCP source catching remotely generated syslog events. They
> got to an Avro sink, which delivers them to an Avro source, then into an
> HDFS sink.
>
> I currently have a test replicating channel delivering it to HDFS with the
> avro_event serializer, and also delivering the same events to HDFS without
> the avro_event serializer. The latter results in a text-encoded aggregate
> file, which works well.
>
> The issue I would like clarification on is this:
>
> When it is saved to HDFS as Avro, there is a epoch timestamp, the
> hostname, and some severity and facility information being saved along with
> the message body. There is a "headers" and "body" section of the Avro
> schema, and the timestamp etc is in the "headers" section, and the actual
> text is the "body."
>
> However, when the file is saved to HDFS as text, the only thing we get is
> the content of the "body" field, and there is no longer any host,
> timestamp, etc., even though those are components of the original message.
>
> Where are the components form the generating server being stripped away?
> By syslogTCP source, or by HDFS sink deserializing into text?
>
> Another way to summarize this is: When the server writing the events to
> syslog writes them, it writes with timestamp and host fields. If we use
> Avro the whole way, it keeps that information as headers, but if we save as
> text, no timestamp or host information is preserved. We would like it
> preserved so we can programmatically parse the timestamp to sort by day. We
> would also like to not have to deal with Avro MapReduce for the time being,
> as that has proved challenging. So, is there a way that I can get the WHOLE
> event body as the "body" using syslogTCP source, or do we need to look at
> exec source to tail the generating server /var/log/messages and send it
> that way?
>
> Thanks,
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>