You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Josh West <js...@one.com> on 2012/10/26 16:05:08 UTC

Syslog Infrastructure with Flume

Hey folks,

I've been experimenting with Flume for a few weeks now, trying to 
determine an approach to designing a reliable, highly available, 
scalable system to store logs from various sources, including syslog.  
Ideally, this system will meet the following requirements:

 1. Logs from syslog across all servers make their way into HDFS.
 2. Logs are stored in HDFS in a manner that is available for
    post-processing:
      * Example:  HIVE partitions - with HDFS Flume Sink, can set
        hdfs.path to
        hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
      * Example:  Custom map reduce jobs...
 3. Logs are stored in HDFS in a manner that is available for "reading"
    by sysadmins:
      * During troubleshooting/firefighting, it is quite helpful to be
        able to login to a central logging system and tail -f / grep logs.
      * We need to be able to see the logs "live".

Some folks may be wondering why are we choosing Flume for syslog, 
instead of something like Graylog2 or Logstash?  The answer is we will 
be using Flume + Hadoop for the transport and processing of other types 
of data in addition to syslog.  For example, webserver access logs for 
post processing and statistical analysis.  So, we would like to make the 
most use of the Hadoop cluster, keeping all logs of all types in one 
redundant/scalable solution. Additionally, by keeping both syslog and 
webserver access logs in Hadoop/HDFS, we can begin to correlate events.

I've run into some snags while attempting to implement Flume in a manner 
that satisfies the requirements listed in the top of this message:

 1. Logs to HDFS:
      * I can indeed use the Flume HDFS Sink to reliably write logs into
        HDFS.
      * Needed to write custom serializer to add Hostname and Timestamp
        fields back to syslog messages.
      * See: https://issues.apache.org/jira/browse/FLUME-1666
 2. Logs to HDFS in manner available for
    reading/firefighting/troubleshooting by sysadmins:
      * Flume HDFS Sink uses the BucketWriter for recording flume events
        to HDFS.
      * Creates data files like:
        /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
      * Each file is format of FlumeData (or custom prefix) followed by
        . followed by unix timestamp of when the file was created.
          o This is somewhat necessary... As you could have multiple
            Flume writers, writing to the same HDFS, the files cannot be
            opened by more than one writer.  So each writer should write
            to its own file.
      * Latest file, currently being written to, is suffixed with ".tmp".
      * This approach is not very sysadmin-friendly....
          o You have to find the latest (ie. the .tmp files) and hadoop
            fs -tail -f /path/to/file.tmp
          o Hadoop's fs -tail -f command first prints the entire file's
            contents, then begins tailing.

So the sum of it all is Flume is awesome for getting syslog (and other) 
data into HDFS for post processing, but not the best at getting it into 
HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal 
world, I have syslog data coming into Flume via one transport (i.e. 
SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a 
manner that is both post-processable and sysadmin-friendly, but it looks 
like this isn't going to happen.

I've thus investigated some alternative approaches to meet the 
requirements.  One of these approaches is to have all of my servers send 
their syslog messages to a central box running rsyslog.  Then, rsyslog 
would perform one of the following actions:

 1. Write logs to HDFS directly using 'omhdfs' module, in a format that
    is both post-processable and sysadmin-friendly :-)
 2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which
    has HDFS mounted as a filesystem.
 3. Write logs to a local filesystem and also replicate logs into a
    flume agent, configured with a SyslogSource and HDFS sink.

Option #1 sounds great.  But unfortunately the 'omhdfs' module for 
rsyslog isn't working very well.  I've gotten it to login to Hadoop/HDFS 
but it has issues creating/appending files. Additionally, templating is 
somewhat suspect (ie. making directories /syslog/someserver/somefacility 
dynamically).

Option #2 sounds reasonable, but either the HDFS FUSE module doesn't 
support append mode (yet) or rsyslog is trying to create/open the files 
in a manner not compliant with HDFS.  No surprise, as we all know HDFS 
can be somewhat "special" at times ;-)  It's actually no matter 
anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather 
useless.  The data is only and finally fed to the tail command once a 
full 64MB (or whatever you use) block size of data has been written to 
the file.  One would only be able to use "hadoop fs -tail -f 
/path/to/log" which has its own issues mentioned previously.

Option #3 would definitely work.  However, now I'm storing my logs 
twice.  Once on some local filesystem and another time in HDFS.  It 
works but its not ideal as it's a waste of space.  And you've probably 
noticed from this email so far, I'd prefer the *ideal* solution :-)

*Note*:  Astute flumers would probably look at option #3 and recommend 
making use of the RollingFileSink in addition to the HDFSSink.  
Unfortunately, the RollingFileSink doesn't support templated/dynamic 
directory creation like the HDFSSink with its hdfs.path setting of 
"hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".

So what exactly am I asking here?  Well, I'd like to know first how 
others are doing this.  A hybrid of rsyslog and Flume?  All and only 
Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how 
would you recommend I handle this?

Thanks for any and all thoughts you can provide.


-- 
Josh West
Lead Systems Administrator
One.com, jsw@one.com


Re: Syslog Infrastructure with Flume

Posted by Josh West <js...@one.com>.
Hi Ron,

Yep -- looks like we'll be storing logs twice, for the time being. But 
we're *so* close to not having to!

On 10/26/2012 11:06 PM, Ron Thielen wrote:
>
> I am exactly where you are with this, except for the problem of my not 
> having had time to write a serializer to address the Hostname 
> Timestamp issue.Questions about the use of Flume in this manner seem 
> to recur on a regular basis, so it seems a common use case.
>
> Sorry I cannot offer a solution since I am in your shoes at the 
> moment, unfortunately looking at storing logs twice.
>
> Ron Thielen
>
> Ronald J Thielen
>
> *From:*Josh West [mailto:jsw@one.com]
> *Sent:* Friday, October 26, 2012 9:05 AM
> *To:* user@flume.apache.org
> *Subject:* Syslog Infrastructure with Flume
>
> Hey folks,
>
> I've been experimenting with Flume for a few weeks now, trying to 
> determine an approach to designing a reliable, highly available, 
> scalable system to store logs from various sources, including syslog.  
> Ideally, this system will meet the following requirements:
>
>  1. Logs from syslog across all servers make their way into HDFS.
>  2. Logs are stored in HDFS in a manner that is available for
>     post-processing:
>       * Example:  HIVE partitions - with HDFS Flume Sink, can set
>         hdfs.path to
>         hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
>       * Example:  Custom map reduce jobs...
>  3. Logs are stored in HDFS in a manner that is available for
>     "reading" by sysadmins:
>       * During troubleshooting/firefighting, it is quite helpful to be
>         able to login to a central logging system and tail -f / grep logs.
>       * We need to be able to see the logs "live".
>
> Some folks may be wondering why are we choosing Flume for syslog, 
> instead of something like Graylog2 or Logstash?  The answer is we will 
> be using Flume + Hadoop for the transport and processing of other 
> types of data in addition to syslog. For example, webserver access 
> logs for post processing and statistical analysis.  So, we would like 
> to make the most use of the Hadoop cluster, keeping all logs of all 
> types in one redundant/scalable solution.  Additionally, by keeping 
> both syslog and webserver access logs in Hadoop/HDFS, we can begin to 
> correlate events.
>
> I've run into some snags while attempting to implement Flume in a 
> manner that satisfies the requirements listed in the top of this message:
>
>  1. Logs to HDFS:
>       * I can indeed use the Flume HDFS Sink to reliably write logs
>         into HDFS.
>       * Needed to write custom serializer to add Hostname and
>         Timestamp fields back to syslog messages.
>       * See: https://issues.apache.org/jira/browse/FLUME-1666
>         <https://issues.apache.org/jira/browse/FLUME-1666>
>  2. Logs to HDFS in manner available for
>     reading/firefighting/troubleshooting by sysadmins:
>       * Flume HDFS Sink uses the BucketWriter for recording flume
>         events to HDFS.
>       * Creates data files like:
>         /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
>       * Each file is format of FlumeData (or custom prefix) followed
>         by . followed by unix timestamp of when the file was created.
>           o This is somewhat necessary... As you could have multiple
>             Flume writers, writing to the same HDFS, the files cannot
>             be opened by more than one writer.  So each writer should
>             write to its own file.
>       * Latest file, currently being written to, is suffixed with ".tmp".
>       * This approach is not very sysadmin-friendly....
>           o You have to find the latest (ie. the .tmp files) and
>             hadoop fs -tail -f /path/to/file.tmp
>           o Hadoop's fs -tail -f command first prints the entire
>             file's contents, then begins tailing.
>
> So the sum of it all is Flume is awesome for getting syslog (and 
> other) data into HDFS for post processing, but not the best at getting 
> it into HDFS in a sysadmin troubleshooting/firefighting format.  In an 
> ideal world, I have syslog data coming into Flume via one transport 
> (i.e. SyslogTcp Source or SyslogUDP Source) and being written into 
> HDFS in a manner that is both post-processable and sysadmin-friendly, 
> but it looks like this isn't going to happen.
>
> I've thus investigated some alternative approaches to meet the 
> requirements.  One of these approaches is to have all of my servers 
> send their syslog messages to a central box running rsyslog.  Then, 
> rsyslog would perform one of the following actions:
>
>  1. Write logs to HDFS directly using 'omhdfs' module, in a format
>     that is both post-processable and sysadmin-friendly :-)
>  2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which
>     has HDFS mounted as a filesystem.
>  3. Write logs to a local filesystem and also replicate logs into a
>     flume agent, configured with a SyslogSource and HDFS sink.
>
> Option #1 sounds great.  But unfortunately the 'omhdfs' module for 
> rsyslog isn't working very well.  I've gotten it to login to 
> Hadoop/HDFS but it has issues creating/appending files.  Additionally, 
> templating is somewhat suspect (ie. making directories 
> /syslog/someserver/somefacility dynamically).
>
> Option #2 sounds reasonable, but either the HDFS FUSE module doesn't 
> support append mode (yet) or rsyslog is trying to create/open the 
> files in a manner not compliant with HDFS.  No surprise, as we all 
> know HDFS can be somewhat "special" at times ;-)  It's actually no 
> matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is 
> rather useless.  The data is only and finally fed to the tail command 
> once a full 64MB (or whatever you use) block size of data has been 
> written to the file.  One would only be able to use "hadoop fs -tail 
> -f /path/to/log" which has its own issues mentioned previously.
>
> Option #3 would definitely work.  However, now I'm storing my logs 
> twice.  Once on some local filesystem and another time in HDFS.  It 
> works but its not ideal as it's a waste of space. And you've probably 
> noticed from this email so far, I'd prefer the *ideal* solution :-)
>
> *Note*:  Astute flumers would probably look at option #3 and recommend 
> making use of the RollingFileSink in addition to the HDFSSink.  
> Unfortunately, the RollingFileSink doesn't support templated/dynamic 
> directory creation like the HDFSSink with its hdfs.path setting of 
> "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
>
> So what exactly am I asking here?  Well, I'd like to know first how 
> others are doing this.  A hybrid of rsyslog and Flume?  All and only 
> Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how 
> would you recommend I handle this?
>
> Thanks for any and all thoughts you can provide.
>
> -- 
> Josh West
> Lead Systems Administrator
> One.com,jsw@one.com  <ma...@one.com>

-- 
Josh West
Lead Systems Administrator
One.com, jsw@one.com


Re: Syslog Infrastructure with Flume

Posted by Roshan Naik <ro...@hortonworks.com>.
yes.. i agree.
-roshan


On Thu, Nov 1, 2012 at 3:52 PM, Hari Shreedharan
<hs...@cloudera.com>wrote:

>  Ok. I am not aware of what is happening in the Hive community. I'd just
> like to make sure that any effort that we put into getting stuff into Flume
> should be useful to as large a community as possible - in this case, that
> is the Hive community.
>
> Thanks,
> Hari
>
> --
> Hari Shreedharan
>
> On Thursday, November 1, 2012 at 11:19 AM, Roshan Naik wrote:
>
> Hari,
>    The way things are progressing... it appears that HCatalog would become
> part of Hive and may not have an independent existence.  The final decision
> is not yet made but thats the direction it seems to be heading. So what I
> am now referring to as a "HCatalog sink" could be just considered as the
> "Hive/HCatalog sink" or just "Hive" sink.
>
> -roshan
>
>
> On Wed, Oct 31, 2012 at 1:39 PM, Hari Shreedharan <
> hshreedharan@cloudera.com> wrote:
>
>  Thanks Roshan. I understand that it makes it easier for us to use
> HCatalog - but I am not sure what percentage of Hive users actually use
> HCat. If we simply use Hive directly, we would be able to address a larger
> community - which I would definitely like (thought I don't know how
> feasible it is). I think it might be better to use Hive directly at least
> to make it more useful to a larger community.
>
>
> Hari
>
>
>

Re: Syslog Infrastructure with Flume

Posted by Hari Shreedharan <hs...@cloudera.com>.
Ok. I am not aware of what is happening in the Hive community. I'd just like to make sure that any effort that we put into getting stuff into Flume should be useful to as large a community as possible - in this case, that is the Hive community. 

Thanks,
Hari

-- 
Hari Shreedharan


On Thursday, November 1, 2012 at 11:19 AM, Roshan Naik wrote:

> Hari,
>    The way things are progressing... it appears that HCatalog would become part of Hive and may not have an independent existence.  The final decision is not yet made but thats the direction it seems to be heading. So what I am now referring to as a "HCatalog sink" could be just considered as the "Hive/HCatalog sink" or just "Hive" sink.
> 
> -roshan
> 
> 
> On Wed, Oct 31, 2012 at 1:39 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)> wrote:
> > Thanks Roshan. I understand that it makes it easier for us to use HCatalog - but I am not sure what percentage of Hive users actually use HCat. If we simply use Hive directly, we would be able to address a larger community - which I would definitely like (thought I don't know how feasible it is). I think it might be better to use Hive directly at least to make it more useful to a larger community.  
> > 
> > 
> > Hari 
> > 


Re: Syslog Infrastructure with Flume

Posted by Roshan Naik <ro...@hortonworks.com>.
Hari,
   The way things are progressing... it appears that HCatalog would become
part of Hive and may not have an independent existence.  The final decision
is not yet made but thats the direction it seems to be heading. So what I
am now referring to as a "HCatalog sink" could be just considered as the
"Hive/HCatalog sink" or just "Hive" sink.

-roshan


On Wed, Oct 31, 2012 at 1:39 PM, Hari Shreedharan <hshreedharan@cloudera.com
> wrote:

>  Thanks Roshan. I understand that it makes it easier for us to use
> HCatalog - but I am not sure what percentage of Hive users actually use
> HCat. If we simply use Hive directly, we would be able to address a larger
> community - which I would definitely like (thought I don't know how
> feasible it is). I think it might be better to use Hive directly at least
> to make it more useful to a larger community.
>
>
> Hari
>
>

Re: Syslog Infrastructure with Flume

Posted by Hari Shreedharan <hs...@cloudera.com>.
Thanks Roshan. I understand that it makes it easier for us to use HCatalog - but I am not sure what percentage of Hive users actually use HCat. If we simply use Hive directly, we would be able to address a larger community - which I would definitely like (thought I don't know how feasible it is). I think it might be better to use Hive directly at least to make it more useful to a larger community.  


Hari 

-- 
Hari Shreedharan


On Wednesday, October 31, 2012 at 1:31 PM, Roshan Naik wrote:

> Hari,
>   Indeed from the end user point of view.. both would accomplish roughly the same. From the implementation standpoint, however,  the Hive sink would have to deal with HDFS (for data transfer) and Hive metastore separately to get the job done. The HCat sink implementation, on the other hand, would accomplish the same using just the HCat apis (for both aspects).  The HCat sink implementation would be much simpler and cleaner as it wont have to reinvent things (even if we reuse code form HDFS sink). The grunt work of of moving data and "transactionally committing" them into partitions is handled by HCat apis.
> -roshan
> 
> 
> 
> On Wed, Oct 31, 2012 at 12:22 PM, Hari Shreedharan <hshreedharan@cloudera.com (mailto:hshreedharan@cloudera.com)> wrote:
> > Roshan,
> > 
> > I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, rather than a Hive Sink? Hive is definitely very popular and would be well appreciated if we could get a Hive Sink written. Since HCatalog is supposed to be compatible with the Hive metastore (right?), why not just implement a Hive sink and make it available to a larger community? I'd definitely like to see a Hive Sink, and would definitely prioritize that and then if explicitly required, add HCatalog support to that - this way it is useful to people who use Hive as well. In fact, there already is a Hive Sink jira here:https://issues.apache.org/jira/browse/FLUME-1008.  
> > 
> > I am a +1 for a Hive Sink, so please take a look.
> > 
> > 
> > Thanks,
> > Hari
> > 
> > 
> > -- 
> > Hari Shreedharan
> > 
> > 
> > On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote:
> > 
> > > I am in the process of investigating the possibility of creating  a HCatalog sink for Flume which should be able to handle such use cases. For your use case it could be thought of as a Hive sink. Goal is basically as follows... it would allow multiple flume agents to pump logs into a hive tables. That would make the data query-able without additional manual steps. Data will get added periodically in the form of new partitions to Hive. You would not have to deal with temporary files or manual addition of data into hive.  
> > > 
> > > -roshan
> > > 
> > > 
> > > 
> > > On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <ralph.goers@dslextreme.com (mailto:ralph.goers@dslextreme.com)> wrote:
> > > > Since you ask...
> > > > 
> > > > In our environment our primary concern is audit logs - have have to audit banking transactions as well as changes administrators make. We have a legacy system that needed to be integrated that had records in a form different than what we want stored. We also need to allow administrators to view events as close to real time as possible. Plus we have to aggregate data across 2 data centers. Although we are currently not including web server access logs we plan to integrate them in over time.  We also have requirements from our security team to pass events for their use to ArcSight. 
> > > > 
> > > > 1. We have a "log extractor" that receives legacy events as they occur and converts them into our new format and passes them to Flume. All new applications use the Log4j 2 Flume Appender to get data to Flume. 
> > > > 2. Flume passes the data to ArcSight for our security team's use.
> > > > 3. We wrote a Flume to Cassandra Sink.
> > > > 4. We wrote our own REST query services to retrieve the data from Cassandra.
> > > > 5. Since we are using DataStax Enterprise version of Cassandra we have also set up "Analytic" nodes that run Hadoop on top of Cassandra. This allows the data to be accessed via normal Hadoop tools for data analytics.
> > > > 6. We have written our own reporting UI component in our Administrative Platform to allow administrators to view activities in real time or to schedule background data collection so users can post process the data on their own.
> > > > 
> > > > We do not have anything to allow an admin to "tail" the log but it wouldn't be hard at all to write an application to accept Flume events via Avro and display the last "n" events as they arrive. 
> > > > 
> > > > One thing I should point out. We format our events in accordance with RFC 5424 and store that in the Flume event body. We then store all our individual pieces of audit event data in Flume headers fields.  The RFC 5424 message is what we send to ArcSight. The event fields and the compressed body are all stored in individual columns in Cassandra. 
> > > > 
> > > > Ralph
> > > > 
> > > > 
> > > > On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
> > > > > I am exactly where you are with this, except for the problem of my not having had time to write a serializer to address the Hostname Timestamp issue.  Questions about the use of Flume in this manner seem to recur on a regular basis, so it seems a common use case. 
> > > > >  
> > > > > Sorry I cannot offer a solution since I am in your shoes at the moment, unfortunately looking at storing logs twice.
> > > > >  
> > > > > Ron Thielen
> > > > >  
> > > > > 
> > > > > <image001.jpg>
> > > > >  
> > > > > From: Josh West [mailto:jsw@one.com] 
> > > > > Sent: Friday, October 26, 2012 9:05 AM
> > > > > To: user@flume.apache.org (mailto:user@flume.apache.org)
> > > > > Subject: Syslog Infrastructure with Flume 
> > > > >  
> > > > > 
> > > > > Hey folks,
> > > > > 
> > > > > I've been experimenting with Flume for a few weeks now, trying to determine an approach to designing a reliable, highly available, scalable system to store logs from various sources, including syslog.  Ideally, this system will meet the following requirements:
> > > > > 
> > > > > Logs from syslog across all servers make their way into HDFS.
> > > > > Logs are stored in HDFS in a manner that is available for post-processing:
> > > > > Example:  HIVE partitions - with HDFS Flume Sink, can set hdfs.path to hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
> > > > > Example:  Custom map reduce jobs...
> > > > > 
> > > > > 
> > > > > Logs are stored in HDFS in a manner that is available for "reading" by sysadmins:
> > > > > During troubleshooting/firefighting, it is quite helpful to be able to login to a central logging system and tail -f / grep logs.
> > > > > We need to be able to see the logs "live".
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Some folks may be wondering why are we choosing Flume for syslog, instead of something like Graylog2 or Logstash?  The answer is we will be using Flume + Hadoop for the transport and processing of other types of data in addition to syslog.  For example, webserver access logs for post processing and statistical analysis.  So, we would like to make the most use of the Hadoop cluster, keeping all logs of all types in one redundant/scalable solution.  Additionally, by keeping both syslog and webserver access logs in Hadoop/HDFS, we can begin to correlate events.
> > > > > 
> > > > > 
> > > > > I've run into some snags while attempting to implement Flume in a manner that satisfies the requirements listed in the top of this message:
> > > > > 
> > > > > Logs to HDFS:
> > > > > I can indeed use the Flume HDFS Sink to reliably write logs into HDFS.
> > > > > Needed to write custom serializer to add Hostname and Timestamp fields back to syslog messages.
> > > > > See:  https://issues.apache.org/jira/browse/FLUME-1666
> > > > > 
> > > > > 
> > > > > Logs to HDFS in manner available for reading/firefighting/troubleshooting by sysadmins:
> > > > > Flume HDFS Sink uses the BucketWriter for recording flume events to HDFS.
> > > > > Creates data files like:  /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
> > > > > Each file is format of FlumeData (or custom prefix) followed by . followed by unix timestamp of when the file was created.
> > > > > This is somewhat necessary... As you could have multiple Flume writers, writing to the same HDFS, the files cannot be opened by more than one writer.  So each writer should write to its own file.
> > > > > 
> > > > > 
> > > > > Latest file, currently being written to, is suffixed with ".tmp".
> > > > > This approach is not very sysadmin-friendly....
> > > > > You have to find the latest (ie. the .tmp files) and hadoop fs -tail -f /path/to/file.tmp
> > > > > Hadoop's fs -tail -f command first prints the entire file's contents, then begins tailing.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > So the sum of it all is Flume is awesome for getting syslog (and other) data into HDFS for post processing, but not the best at getting it into HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal world, I have syslog data coming into Flume via one transport (i.e. SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a manner that is both post-processable and sysadmin-friendly, but it looks like this isn't going to happen.
> > > > > 
> > > > > 
> > > > > I've thus investigated some alternative approaches to meet the requirements.  One of these approaches is to have all of my servers send their syslog messages to a central box running rsyslog.  Then, rsyslog would perform one of the following actions:
> > > > > 
> > > > > Write logs to HDFS directly using 'omhdfs' module, in a format that is both post-processable and sysadmin-friendly :-)
> > > > > Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which has HDFS mounted as a filesystem.
> > > > > Write logs to a local filesystem and also replicate logs into a flume agent, configured with a SyslogSource and HDFS sink.
> > > > > 
> > > > > 
> > > > > Option #1 sounds great.  But unfortunately the 'omhdfs' module for rsyslog isn't working very well.  I've gotten it to login to Hadoop/HDFS but it has issues creating/appending files.  Additionally, templating is somewhat suspect (ie. making directories /syslog/someserver/somefacility dynamically).
> > > > > 
> > > > > 
> > > > > Option #2 sounds reasonable, but either the HDFS FUSE module doesn't support append mode (yet) or rsyslog is trying to create/open the files in a manner not compliant with HDFS.  No surprise, as we all know HDFS can be somewhat "special" at times ;-)  It's actually no matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather useless.  The data is only and finally fed to the tail command once a full 64MB (or whatever you use) block size of data has been written to the file.  One would only be able to use "hadoop fs -tail -f /path/to/log" which has its own issues mentioned previously.
> > > > > 
> > > > > 
> > > > > Option #3 would definitely work.  However, now I'm storing my logs twice.  Once on some local filesystem and another time in HDFS.  It works but its not ideal as it's a waste of space.  And you've probably noticed from this email so far, I'd prefer the ideal solution :-)
> > > > > 
> > > > > 
> > > > > Note:  Astute flumers would probably look at option #3 and recommend making use of the RollingFileSink in addition to the HDFSSink.  Unfortunately, the RollingFileSink doesn't support templated/dynamic directory creation like the HDFSSink with its hdfs.path setting of "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
> > > > > 
> > > > > 
> > > > > So what exactly am I asking here?  Well, I'd like to know first how others are doing this.  A hybrid of rsyslog and Flume?  All and only Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how would you recommend I handle this?
> > > > > 
> > > > > 
> > > > > Thanks for any and all thoughts you can provide.
> > > > > 
> > > > > 
> > > > >  
> > > > > 
> > > > > -- 
> > > > > Josh West
> > > > > Lead Systems Administrator
> > > > > One.com (http://One.com), jsw@one.com (mailto:jsw@one.com)
> > > > > 
> > > > > 
> > > > > 
> > > > > <Ronald J  Thielen.vcf>
> > > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > 
> 


Re: Syslog Infrastructure with Flume

Posted by Roshan Naik <ro...@hortonworks.com>.
Hari,
  Indeed from the end user point of view.. both would accomplish roughly
the same. From the implementation standpoint, however,  the Hive sink would
have to deal with HDFS (for data transfer) and Hive metastore separately to
get the job done. The HCat sink implementation, on the other hand, would
accomplish the same using just the HCat apis (for both aspects).  The HCat
sink implementation would be much simpler and cleaner as it wont have to
reinvent things (even if we reuse code form HDFS sink). The grunt work of
of moving data and "transactionally committing" them into partitions is
handled by HCat apis.
-roshan



On Wed, Oct 31, 2012 at 12:22 PM, Hari Shreedharan <
hshreedharan@cloudera.com> wrote:

> Roshan,
>
> I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog
> sink, rather than a Hive Sink? Hive is definitely very popular and would be
> well appreciated if we could get a Hive Sink written. Since HCatalog is
> supposed to be compatible with the Hive metastore (right?), why not just
> implement a Hive sink and make it available to a larger community? I'd
> definitely like to see a Hive Sink, and would definitely prioritize that
> and then if explicitly required, add HCatalog support to that - this way it
> is useful to people who use Hive as well. In fact, there already is a Hive
> Sink jira here:https://issues.apache.org/jira/browse/FLUME-1008.
>
> I am a +1 for a Hive Sink, so please take a look.
>
>
> Thanks,
> Hari
>
> --
> Hari Shreedharan
>
> On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote:
>
> I am in the process of investigating the possibility of creating  a
> HCatalog sink for Flume which should be able to handle such use cases. For
> your use case it could be thought of as a Hive sink. Goal is basically as
> follows... it would allow multiple flume agents to pump logs into a hive
> tables. That would make the data query-able without additional manual
> steps. Data will get added periodically in the form of new partitions to
> Hive. You would not have to deal with temporary files or manual addition of
> data into hive.
>
> -roshan
>
>
>
> On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <ra...@dslextreme.com>wrote:
>
> Since you ask...
>
> In our environment our primary concern is audit logs - have have to audit
> banking transactions as well as changes administrators make. We have a
> legacy system that needed to be integrated that had records in a form
> different than what we want stored. We also need to allow administrators to
> view events as close to real time as possible. Plus we have to aggregate
> data across 2 data centers. Although we are currently not including web
> server access logs we plan to integrate them in over time.  We also have
> requirements from our security team to pass events for their use to
> ArcSight.
>
> 1. We have a "log extractor" that receives legacy events as they occur and
> converts them into our new format and passes them to Flume. All new
> applications use the Log4j 2 Flume Appender to get data to Flume.
> 2. Flume passes the data to ArcSight for our security team's use.
> 3. We wrote a Flume to Cassandra Sink.
> 4. We wrote our own REST query services to retrieve the data from
> Cassandra.
> 5. Since we are using DataStax Enterprise version of Cassandra we have
> also set up "Analytic" nodes that run Hadoop on top of Cassandra. This
> allows the data to be accessed via normal Hadoop tools for data analytics.
> 6. We have written our own reporting UI component in our Administrative
> Platform to allow administrators to view activities in real time or to
> schedule background data collection so users can post process the data on
> their own.
>
> We do not have anything to allow an admin to "tail" the log but it
> wouldn't be hard at all to write an application to accept Flume events via
> Avro and display the last "n" events as they arrive.
>
> One thing I should point out. We format our events in accordance with RFC
> 5424 and store that in the Flume event body. We then store all our
> individual pieces of audit event data in Flume headers fields.  The RFC
> 5424 message is what we send to ArcSight. The event fields and the
> compressed body are all stored in individual columns in Cassandra.
>
> Ralph
>
>
> On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
>
> I am exactly where you are with this, except for the problem of my not
> having had time to write a serializer to address the Hostname Timestamp
> issue.  Questions about the use of Flume in this manner seem to recur on
> a regular basis, so it seems a common use case.****
> ** **
> Sorry I cannot offer a solution since I am in your shoes at the moment,
> unfortunately looking at storing logs twice.****
> ** **
> Ron Thielen****
> ** **
> <image001.jpg>****
> ** **
>  *From:* Josh West [mailto:jsw@one.com]
> *Sent:* Friday, October 26, 2012 9:05 AM
> *To:* user@flume.apache.org
> *Subject:* Syslog Infrastructure with Flume****
> ** **
>
> Hey folks,
>
> I've been experimenting with Flume for a few weeks now, trying to
> determine an approach to designing a reliable, highly available, scalable
> system to store logs from various sources, including syslog.  Ideally, this
> system will meet the following requirements:****
>
>    1. Logs from syslog across all servers make their way into HDFS.****
>    2. Logs are stored in HDFS in a manner that is available for
>    post-processing:****
>       - Example:  HIVE partitions - with HDFS Flume Sink, can set
>       hdfs.path to
>       hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}****
>       - Example:  Custom map reduce jobs...****
>    3. Logs are stored in HDFS in a manner that is available for "reading"
>    by sysadmins:****
>       - During troubleshooting/firefighting, it is quite helpful to be
>       able to login to a central logging system and tail -f / grep logs.**
>       **
>       - We need to be able to see the logs "live".****
>
> Some folks may be wondering why are we choosing Flume for syslog, instead
> of something like Graylog2 or Logstash?  The answer is we will be using
> Flume + Hadoop for the transport and processing of other types of data in
> addition to syslog.  For example, webserver access logs for post processing
> and statistical analysis.  So, we would like to make the most use of the
> Hadoop cluster, keeping all logs of all types in one redundant/scalable
> solution.  Additionally, by keeping both syslog and webserver access logs
> in Hadoop/HDFS, we can begin to correlate events.****
>
> I've run into some snags while attempting to implement Flume in a manner
> that satisfies the requirements listed in the top of this message:****
>
>    1. Logs to HDFS:****
>       - I can indeed use the Flume HDFS Sink to reliably write logs into
>       HDFS.****
>       - Needed to write custom serializer to add Hostname and Timestamp
>       fields back to syslog messages.****
>       - See:  https://issues.apache.org/jira/browse/FLUME-1666****
>    2. Logs to HDFS in manner available for
>    reading/firefighting/troubleshooting by sysadmins:****
>       - Flume HDFS Sink uses the BucketWriter for recording flume events
>       to HDFS.****
>       - Creates data files like:
>       /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
>       ****
>       - Each file is format of FlumeData (or custom prefix) followed by .
>       followed by unix timestamp of when the file was created.****
>          - This is somewhat necessary... As you could have multiple Flume
>          writers, writing to the same HDFS, the files cannot be opened by more than
>          one writer.  So each writer should write to its own file.****
>       - Latest file, currently being written to, is suffixed with ".tmp".*
>       ***
>       - This approach is not very sysadmin-friendly....****
>          - You have to find the latest (ie. the .tmp files) and hadoop fs
>          -tail -f /path/to/file.tmp****
>          - Hadoop's fs -tail -f command first prints the entire file's
>          contents, then begins tailing.****
>
> So the sum of it all is Flume is awesome for getting syslog (and other)
> data into HDFS for post processing, but not the best at getting it into
> HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal world,
> I have syslog data coming into Flume via one transport (i.e. SyslogTcp
> Source or SyslogUDP Source) and being written into HDFS in a manner that is
> both post-processable and sysadmin-friendly, but it looks like this isn't
> going to happen.****
>
> I've thus investigated some alternative approaches to meet the
> requirements.  One of these approaches is to have all of my servers send
> their syslog messages to a central box running rsyslog.  Then, rsyslog
> would perform one of the following actions:****
>
>    1. Write logs to HDFS directly using 'omhdfs' module, in a format that
>    is both post-processable and sysadmin-friendly :-)****
>    2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which
>    has HDFS mounted as a filesystem.****
>    3. Write logs to a local filesystem and also replicate logs into a
>    flume agent, configured with a SyslogSource and HDFS sink.****
>
> Option #1 sounds great.  But unfortunately the 'omhdfs' module for rsyslog
> isn't working very well.  I've gotten it to login to Hadoop/HDFS but it has
> issues creating/appending files.  Additionally, templating is somewhat
> suspect (ie. making directories /syslog/someserver/somefacility
> dynamically).****
>
> Option #2 sounds reasonable, but either the HDFS FUSE module doesn't
> support append mode (yet) or rsyslog is trying to create/open the files in
> a manner not compliant with HDFS.  No surprise, as we all know HDFS can be
> somewhat "special" at times ;-)  It's actually no matter anyways... Trying
> to "tail -f" a file mounted via HDFS FUSE is rather useless.  The data is
> only and finally fed to the tail command once a full 64MB (or whatever you
> use) block size of data has been written to the file.  One would only be
> able to use "hadoop fs -tail -f /path/to/log" which has its own issues
> mentioned previously.****
>
> Option #3 would definitely work.  However, now I'm storing my logs twice.
> Once on some local filesystem and another time in HDFS.  It works but its
> not ideal as it's a waste of space.  And you've probably noticed from this
> email so far, I'd prefer the *ideal* solution :-)****
>
> *Note*:  Astute flumers would probably look at option #3 and recommend
> making use of the RollingFileSink in addition to the HDFSSink.
> Unfortunately, the RollingFileSink doesn't support templated/dynamic
> directory creation like the HDFSSink with its hdfs.path setting of "
> hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".****
>
> So what exactly am I asking here?  Well, I'd like to know first how others
> are doing this.  A hybrid of rsyslog and Flume?  All and only Flume?  With
> custom serializers/interceptors/sinks?  Or perhaps... how would you
> recommend I handle this?****
>
> Thanks for any and all thoughts you can provide.****
>
> ** **
>
> -- ****
>
> Josh West****
>
> Lead Systems Administrator****
>
> One.com, jsw@one.com****
>
> <Ronald J  Thielen.vcf>
>
>
>
>
>

Re: Syslog Infrastructure with Flume

Posted by Hari Shreedharan <hs...@cloudera.com>.
Roshan,

I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, rather than a Hive Sink? Hive is definitely very popular and would be well appreciated if we could get a Hive Sink written. Since HCatalog is supposed to be compatible with the Hive metastore (right?), why not just implement a Hive sink and make it available to a larger community? I'd definitely like to see a Hive Sink, and would definitely prioritize that and then if explicitly required, add HCatalog support to that - this way it is useful to people who use Hive as well. In fact, there already is a Hive Sink jira here:https://issues.apache.org/jira/browse/FLUME-1008. 

I am a +1 for a Hive Sink, so please take a look.


Thanks,
Hari


-- 
Hari Shreedharan


On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote:

> I am in the process of investigating the possibility of creating  a HCatalog sink for Flume which should be able to handle such use cases. For your use case it could be thought of as a Hive sink. Goal is basically as follows... it would allow multiple flume agents to pump logs into a hive tables. That would make the data query-able without additional manual steps. Data will get added periodically in the form of new partitions to Hive. You would not have to deal with temporary files or manual addition of data into hive.  
> 
> -roshan
> 
> 
> 
> On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <ralph.goers@dslextreme.com (mailto:ralph.goers@dslextreme.com)> wrote:
> > Since you ask...
> > 
> > In our environment our primary concern is audit logs - have have to audit banking transactions as well as changes administrators make. We have a legacy system that needed to be integrated that had records in a form different than what we want stored. We also need to allow administrators to view events as close to real time as possible. Plus we have to aggregate data across 2 data centers. Although we are currently not including web server access logs we plan to integrate them in over time.  We also have requirements from our security team to pass events for their use to ArcSight. 
> > 
> > 1. We have a "log extractor" that receives legacy events as they occur and converts them into our new format and passes them to Flume. All new applications use the Log4j 2 Flume Appender to get data to Flume. 
> > 2. Flume passes the data to ArcSight for our security team's use.
> > 3. We wrote a Flume to Cassandra Sink.
> > 4. We wrote our own REST query services to retrieve the data from Cassandra.
> > 5. Since we are using DataStax Enterprise version of Cassandra we have also set up "Analytic" nodes that run Hadoop on top of Cassandra. This allows the data to be accessed via normal Hadoop tools for data analytics.
> > 6. We have written our own reporting UI component in our Administrative Platform to allow administrators to view activities in real time or to schedule background data collection so users can post process the data on their own.
> > 
> > We do not have anything to allow an admin to "tail" the log but it wouldn't be hard at all to write an application to accept Flume events via Avro and display the last "n" events as they arrive. 
> > 
> > One thing I should point out. We format our events in accordance with RFC 5424 and store that in the Flume event body. We then store all our individual pieces of audit event data in Flume headers fields.  The RFC 5424 message is what we send to ArcSight. The event fields and the compressed body are all stored in individual columns in Cassandra. 
> > 
> > Ralph
> > 
> > 
> > On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
> > > I am exactly where you are with this, except for the problem of my not having had time to write a serializer to address the Hostname Timestamp issue.  Questions about the use of Flume in this manner seem to recur on a regular basis, so it seems a common use case. 
> > >  
> > > Sorry I cannot offer a solution since I am in your shoes at the moment, unfortunately looking at storing logs twice.
> > >  
> > > Ron Thielen
> > >  
> > > 
> > > <image001.jpg>
> > >  
> > > From: Josh West [mailto:jsw@one.com] 
> > > Sent: Friday, October 26, 2012 9:05 AM
> > > To: user@flume.apache.org (mailto:user@flume.apache.org)
> > > Subject: Syslog Infrastructure with Flume 
> > >  
> > > Hey folks,
> > > 
> > > I've been experimenting with Flume for a few weeks now, trying to determine an approach to designing a reliable, highly available, scalable system to store logs from various sources, including syslog.  Ideally, this system will meet the following requirements: 
> > > Logs from syslog across all servers make their way into HDFS.
> > > Logs are stored in HDFS in a manner that is available for post-processing:
> > > Example:  HIVE partitions - with HDFS Flume Sink, can set hdfs.path to hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
> > > Example:  Custom map reduce jobs...
> > > 
> > > 
> > > Logs are stored in HDFS in a manner that is available for "reading" by sysadmins:
> > > During troubleshooting/firefighting, it is quite helpful to be able to login to a central logging system and tail -f / grep logs.
> > > We need to be able to see the logs "live".
> > > 
> > > 
> > > 
> > > 
> > > Some folks may be wondering why are we choosing Flume for syslog, instead of something like Graylog2 or Logstash?  The answer is we will be using Flume + Hadoop for the transport and processing of other types of data in addition to syslog.  For example, webserver access logs for post processing and statistical analysis.  So, we would like to make the most use of the Hadoop cluster, keeping all logs of all types in one redundant/scalable solution.  Additionally, by keeping both syslog and webserver access logs in Hadoop/HDFS, we can begin to correlate events.
> > > 
> > > 
> > > I've run into some snags while attempting to implement Flume in a manner that satisfies the requirements listed in the top of this message:
> > > 
> > > Logs to HDFS:
> > > I can indeed use the Flume HDFS Sink to reliably write logs into HDFS.
> > > Needed to write custom serializer to add Hostname and Timestamp fields back to syslog messages.
> > > See:  https://issues.apache.org/jira/browse/FLUME-1666
> > > 
> > > 
> > > Logs to HDFS in manner available for reading/firefighting/troubleshooting by sysadmins:
> > > Flume HDFS Sink uses the BucketWriter for recording flume events to HDFS.
> > > Creates data files like:  /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
> > > Each file is format of FlumeData (or custom prefix) followed by . followed by unix timestamp of when the file was created.
> > > This is somewhat necessary... As you could have multiple Flume writers, writing to the same HDFS, the files cannot be opened by more than one writer.  So each writer should write to its own file.
> > > 
> > > 
> > > Latest file, currently being written to, is suffixed with ".tmp".
> > > This approach is not very sysadmin-friendly....
> > > You have to find the latest (ie. the .tmp files) and hadoop fs -tail -f /path/to/file.tmp
> > > Hadoop's fs -tail -f command first prints the entire file's contents, then begins tailing.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > So the sum of it all is Flume is awesome for getting syslog (and other) data into HDFS for post processing, but not the best at getting it into HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal world, I have syslog data coming into Flume via one transport (i.e. SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a manner that is both post-processable and sysadmin-friendly, but it looks like this isn't going to happen.
> > > 
> > > 
> > > I've thus investigated some alternative approaches to meet the requirements.  One of these approaches is to have all of my servers send their syslog messages to a central box running rsyslog.  Then, rsyslog would perform one of the following actions:
> > > 
> > > Write logs to HDFS directly using 'omhdfs' module, in a format that is both post-processable and sysadmin-friendly :-)
> > > Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which has HDFS mounted as a filesystem.
> > > Write logs to a local filesystem and also replicate logs into a flume agent, configured with a SyslogSource and HDFS sink.
> > > 
> > > 
> > > Option #1 sounds great.  But unfortunately the 'omhdfs' module for rsyslog isn't working very well.  I've gotten it to login to Hadoop/HDFS but it has issues creating/appending files.  Additionally, templating is somewhat suspect (ie. making directories /syslog/someserver/somefacility dynamically).
> > > 
> > > 
> > > Option #2 sounds reasonable, but either the HDFS FUSE module doesn't support append mode (yet) or rsyslog is trying to create/open the files in a manner not compliant with HDFS.  No surprise, as we all know HDFS can be somewhat "special" at times ;-)  It's actually no matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather useless.  The data is only and finally fed to the tail command once a full 64MB (or whatever you use) block size of data has been written to the file.  One would only be able to use "hadoop fs -tail -f /path/to/log" which has its own issues mentioned previously.
> > > 
> > > 
> > > Option #3 would definitely work.  However, now I'm storing my logs twice.  Once on some local filesystem and another time in HDFS.  It works but its not ideal as it's a waste of space.  And you've probably noticed from this email so far, I'd prefer the ideal solution :-)
> > > 
> > > 
> > > Note:  Astute flumers would probably look at option #3 and recommend making use of the RollingFileSink in addition to the HDFSSink.  Unfortunately, the RollingFileSink doesn't support templated/dynamic directory creation like the HDFSSink with its hdfs.path setting of "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
> > > 
> > > 
> > > So what exactly am I asking here?  Well, I'd like to know first how others are doing this.  A hybrid of rsyslog and Flume?  All and only Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how would you recommend I handle this?
> > > 
> > > 
> > > Thanks for any and all thoughts you can provide.
> > > 
> > > 
> > >  
> > > 
> > > -- 
> > > Josh West
> > > Lead Systems Administrator
> > > One.com (http://One.com), jsw@one.com (mailto:jsw@one.com)
> > > 
> > > 
> > > 
> > > <Ronald J  Thielen.vcf>
> > > 
> > 
> > 
> > 
> 


Re: Syslog Infrastructure with Flume

Posted by Josh West <js...@one.com>.
Very cool.  Hcatalog seems like a nice idea, as otherwise lots of 
thought and planning must go into how one stores their data ... ensuring 
it can be read from all different Apache Hadoop related projects... E.G. 
if you store flume data in an HDFS path with 
/something=partition1/foo=bar, you'll need to use special Pig libraries 
to load the partitions.

Keep us updated please!

On 10/30/2012 12:37 AM, Roshan Naik wrote:
> I am in the process of investigating the possibility of creating  a 
> HCatalog sink for Flume which should be able to handle such use cases. 
> For your use case it could be thought of as a Hive sink. Goal is 
> basically as follows... it would allow multiple flume agents to pump 
> logs into a hive tables. That would make the data query-able without 
> additional manual steps. Data will get added periodically in the form 
> of new partitions to Hive. You would not have to deal with temporary 
> files or manual addition of data into hive.
>
> -roshan
>
>
>
> On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers 
> <ralph.goers@dslextreme.com <ma...@dslextreme.com>> wrote:
>
>     Since you ask...
>
>     In our environment our primary concern is audit logs - have have
>     to audit banking transactions as well as changes administrators
>     make. We have a legacy system that needed to be integrated that
>     had records in a form different than what we want stored. We also
>     need to allow administrators to view events as close to real time
>     as possible. Plus we have to aggregate data across 2 data centers.
>     Although we are currently not including web server access logs we
>     plan to integrate them in over time.  We also have requirements
>     from our security team to pass events for their use to ArcSight.
>
>     1. We have a "log extractor" that receives legacy events as they
>     occur and converts them into our new format and passes them to
>     Flume. All new applications use the Log4j 2 Flume Appender to get
>     data to Flume.
>     2. Flume passes the data to ArcSight for our security team's use.
>     3. We wrote a Flume to Cassandra Sink.
>     4. We wrote our own REST query services to retrieve the data from
>     Cassandra.
>     5. Since we are using DataStax Enterprise version of Cassandra we
>     have also set up "Analytic" nodes that run Hadoop on top of
>     Cassandra. This allows the data to be accessed via normal Hadoop
>     tools for data analytics.
>     6. We have written our own reporting UI component in our
>     Administrative Platform to allow administrators to view activities
>     in real time or to schedule background data collection so users
>     can post process the data on their own.
>
>     We do not have anything to allow an admin to "tail" the log but it
>     wouldn't be hard at all to write an application to accept Flume
>     events via Avro and display the last "n" events as they arrive.
>
>     One thing I should point out. We format our events in accordance
>     with RFC 5424 and store that in the Flume event body. We then
>     store all our individual pieces of audit event data in Flume
>     headers fields.  The RFC 5424 message is what we send to ArcSight.
>     The event fields and the compressed body are all stored in
>     individual columns in Cassandra.
>
>     Ralph
>
>
>     On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
>
>>     I am exactly where you are with this, except for the problem of
>>     my not having had time to write a serializer to address the
>>     Hostname Timestamp issue.Questions about the use of Flume in this
>>     manner seem to recur on a regular basis, so it seems a common use
>>     case.
>>     Sorry I cannot offer a solution since I am in your shoes at the
>>     moment, unfortunately looking at storing logs twice.
>>     Ron Thielen
>>     <image001.jpg>
>>     *From:*Josh West [mailto:jsw@one.com <ma...@one.com>]
>>     *Sent:*Friday, October 26, 2012 9:05 AM
>>     *To:*user@flume.apache.org <ma...@flume.apache.org>
>>     *Subject:*Syslog Infrastructure with Flume
>>
>>     Hey folks,
>>
>>     I've been experimenting with Flume for a few weeks now, trying to
>>     determine an approach to designing a reliable, highly available,
>>     scalable system to store logs from various sources, including
>>     syslog.  Ideally, this system will meet the following requirements:
>>
>>      1. Logs from syslog across all servers make their way into HDFS.
>>      2. Logs are stored in HDFS in a manner that is available for
>>         post-processing:
>>           * Example: HIVE partitions - with HDFS Flume Sink, can set
>>             hdfs.path
>>             tohdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
>>           * Example: Custom map reduce jobs...
>>      3. Logs are stored in HDFS in a manner that is available for
>>         "reading" by sysadmins:
>>           * During troubleshooting/firefighting, it is quite helpful
>>             to be able to login to a central logging system and tail
>>             -f / grep logs.
>>           * We need to be able to see the logs "live".
>>
>>     Some folks may be wondering why are we choosing Flume for syslog,
>>     instead of something like Graylog2 or Logstash?  The answer is we
>>     will be using Flume + Hadoop for the transport and processing of
>>     other types of data in addition to syslog.  For example,
>>     webserver access logs for post processing and statistical
>>     analysis.  So, we would like to make the most use of the Hadoop
>>     cluster, keeping all logs of all types in one redundant/scalable
>>     solution. Additionally, by keeping both syslog and webserver
>>     access logs in Hadoop/HDFS, we can begin to correlate events.
>>
>>     I've run into some snags while attempting to implement Flume in a
>>     manner that satisfies the requirements listed in the top of this
>>     message:
>>
>>      1. Logs to HDFS:
>>           * I can indeed use the Flume HDFS Sink to reliably write
>>             logs into HDFS.
>>           * Needed to write custom serializer to add Hostname and
>>             Timestamp fields back to syslog messages.
>>           * See: https://issues.apache.org/jira/browse/FLUME-1666
>>      2. Logs to HDFS in manner available for
>>         reading/firefighting/troubleshooting by sysadmins:
>>           * Flume HDFS Sink uses the BucketWriter for recording flume
>>             events to HDFS.
>>           * Creates data files like:
>>             /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
>>           * Each file is format of FlumeData (or custom prefix)
>>             followed by . followed by unix timestamp of when the file
>>             was created.
>>               o This is somewhat necessary... As you could have
>>                 multiple Flume writers, writing to the same HDFS, the
>>                 files cannot be opened by more than one writer.  So
>>                 each writer should write to its own file.
>>           * Latest file, currently being written to, is suffixed with
>>             ".tmp".
>>           * This approach is not very sysadmin-friendly....
>>               o You have to find the latest (ie. the .tmp files) and
>>                 hadoop fs -tail -f /path/to/file.tmp
>>               o Hadoop's fs -tail -f command first prints the entire
>>                 file's contents, then begins tailing.
>>
>>     So the sum of it all is Flume is awesome for getting syslog (and
>>     other) data into HDFS for post processing, but not the best at
>>     getting it into HDFS in a sysadmin troubleshooting/firefighting
>>     format.  In an ideal world, I have syslog data coming into Flume
>>     via one transport (i.e. SyslogTcp Source or SyslogUDP Source) and
>>     being written into HDFS in a manner that is both post-processable
>>     and sysadmin-friendly, but it looks like this isn't going to happen.
>>
>>     I've thus investigated some alternative approaches to meet the
>>     requirements.  One of these approaches is to have all of my
>>     servers send their syslog messages to a central box running
>>     rsyslog. Then, rsyslog would perform one of the following actions:
>>
>>      1. Write logs to HDFS directly using 'omhdfs' module, in a
>>         format that is both post-processable and sysadmin-friendly :-)
>>      2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility,
>>         which has HDFS mounted as a filesystem.
>>      3. Write logs to a local filesystem and also replicate logs into
>>         a flume agent, configured with a SyslogSource and HDFS sink.
>>
>>     Option #1 sounds great.  But unfortunately the 'omhdfs' module
>>     for rsyslog isn't working very well.  I've gotten it to login to
>>     Hadoop/HDFS but it has issues creating/appending files.
>>     Additionally, templating is somewhat suspect (ie. making
>>     directories /syslog/someserver/somefacility dynamically).
>>
>>     Option #2 sounds reasonable, but either the HDFS FUSE module
>>     doesn't support append mode (yet) or rsyslog is trying to
>>     create/open the files in a manner not compliant with HDFS.  No
>>     surprise, as we all know HDFS can be somewhat "special" at times
>>     ;-) It's actually no matter anyways... Trying to "tail -f" a file
>>     mounted via HDFS FUSE is rather useless. The data is only and
>>     finally fed to the tail command once a full 64MB (or whatever you
>>     use) block size of data has been written to the file. One would
>>     only be able to use "hadoop fs -tail -f /path/to/log" which has
>>     its own issues mentioned previously.
>>
>>     Option #3 would definitely work.  However, now I'm storing my
>>     logs twice.  Once on some local filesystem and another time in
>>     HDFS.  It works but its not ideal as it's a waste of space.  And
>>     you've probably noticed from this email so far, I'd prefer
>>     the*ideal*solution :-)
>>
>>     *Note*: Astute flumers would probably look at option #3 and
>>     recommend making use of the RollingFileSink in addition to the
>>     HDFSSink. Unfortunately, the RollingFileSink doesn't support
>>     templated/dynamic directory creation like the HDFSSink with its
>>     hdfs.path setting of
>>     "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
>>
>>     So what exactly am I asking here?  Well, I'd like to know first
>>     how others are doing this.  A hybrid of rsyslog and Flume?  All
>>     and only Flume?  With custom serializers/interceptors/sinks?  Or
>>     perhaps... how would you recommend I handle this?
>>
>>     Thanks for any and all thoughts you can provide.
>>
>>     -- 
>>     Josh West
>>     Lead Systems Administrator
>>     One.com  <http://One.com>,jsw@one.com  <ma...@one.com>
>>     <Ronald J  Thielen.vcf>
>
>

-- 
Josh West
Lead Systems Administrator
One.com, jsw@one.com


Re: Syslog Infrastructure with Flume

Posted by Roshan Naik <ro...@hortonworks.com>.
I am in the process of investigating the possibility of creating  a
HCatalog sink for Flume which should be able to handle such use cases. For
your use case it could be thought of as a Hive sink. Goal is basically as
follows... it would allow multiple flume agents to pump logs into a hive
tables. That would make the data query-able without additional manual
steps. Data will get added periodically in the form of new partitions to
Hive. You would not have to deal with temporary files or manual addition of
data into hive.

-roshan



On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <ra...@dslextreme.com>wrote:

> Since you ask...
>
> In our environment our primary concern is audit logs - have have to audit
> banking transactions as well as changes administrators make. We have a
> legacy system that needed to be integrated that had records in a form
> different than what we want stored. We also need to allow administrators to
> view events as close to real time as possible. Plus we have to aggregate
> data across 2 data centers. Although we are currently not including web
> server access logs we plan to integrate them in over time.  We also have
> requirements from our security team to pass events for their use to
> ArcSight.
>
> 1. We have a "log extractor" that receives legacy events as they occur and
> converts them into our new format and passes them to Flume. All new
> applications use the Log4j 2 Flume Appender to get data to Flume.
> 2. Flume passes the data to ArcSight for our security team's use.
> 3. We wrote a Flume to Cassandra Sink.
> 4. We wrote our own REST query services to retrieve the data from
> Cassandra.
> 5. Since we are using DataStax Enterprise version of Cassandra we have
> also set up "Analytic" nodes that run Hadoop on top of Cassandra. This
> allows the data to be accessed via normal Hadoop tools for data analytics.
> 6. We have written our own reporting UI component in our Administrative
> Platform to allow administrators to view activities in real time or to
> schedule background data collection so users can post process the data on
> their own.
>
> We do not have anything to allow an admin to "tail" the log but it
> wouldn't be hard at all to write an application to accept Flume events via
> Avro and display the last "n" events as they arrive.
>
> One thing I should point out. We format our events in accordance with RFC
> 5424 and store that in the Flume event body. We then store all our
> individual pieces of audit event data in Flume headers fields.  The RFC
> 5424 message is what we send to ArcSight. The event fields and the
> compressed body are all stored in individual columns in Cassandra.
>
> Ralph
>
>
> On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
>
> I am exactly where you are with this, except for the problem of my not
> having had time to write a serializer to address the Hostname Timestamp
> issue.  Questions about the use of Flume in this manner seem to recur on
> a regular basis, so it seems a common use case.****
> ** **
> Sorry I cannot offer a solution since I am in your shoes at the moment,
> unfortunately looking at storing logs twice.****
> ** **
> Ron Thielen****
> ** **
> <image001.jpg>****
> ** **
> *From:* Josh West [mailto:jsw@one.com]
> *Sent:* Friday, October 26, 2012 9:05 AM
> *To:* user@flume.apache.org
> *Subject:* Syslog Infrastructure with Flume****
> ** **
>
> Hey folks,
>
> I've been experimenting with Flume for a few weeks now, trying to
> determine an approach to designing a reliable, highly available, scalable
> system to store logs from various sources, including syslog.  Ideally, this
> system will meet the following requirements:****
>
>    1. Logs from syslog across all servers make their way into HDFS.****
>    2. Logs are stored in HDFS in a manner that is available for
>    post-processing:****
>       - Example:  HIVE partitions - with HDFS Flume Sink, can set
>       hdfs.path to
>       hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}****
>       - Example:  Custom map reduce jobs...****
>    3. Logs are stored in HDFS in a manner that is available for "reading"
>    by sysadmins:****
>       - During troubleshooting/firefighting, it is quite helpful to be
>       able to login to a central logging system and tail -f / grep logs.**
>       **
>       - We need to be able to see the logs "live".****
>
> Some folks may be wondering why are we choosing Flume for syslog, instead
> of something like Graylog2 or Logstash?  The answer is we will be using
> Flume + Hadoop for the transport and processing of other types of data in
> addition to syslog.  For example, webserver access logs for post processing
> and statistical analysis.  So, we would like to make the most use of the
> Hadoop cluster, keeping all logs of all types in one redundant/scalable
> solution.  Additionally, by keeping both syslog and webserver access logs
> in Hadoop/HDFS, we can begin to correlate events.****
>
> I've run into some snags while attempting to implement Flume in a manner
> that satisfies the requirements listed in the top of this message:****
>
>    1. Logs to HDFS:****
>       - I can indeed use the Flume HDFS Sink to reliably write logs into
>       HDFS.****
>       - Needed to write custom serializer to add Hostname and Timestamp
>       fields back to syslog messages.****
>       - See:  https://issues.apache.org/jira/browse/FLUME-1666****
>    2. Logs to HDFS in manner available for
>    reading/firefighting/troubleshooting by sysadmins:****
>       - Flume HDFS Sink uses the BucketWriter for recording flume events
>       to HDFS.****
>       - Creates data files like:
>       /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
>       ****
>       - Each file is format of FlumeData (or custom prefix) followed by .
>       followed by unix timestamp of when the file was created.****
>          - This is somewhat necessary... As you could have multiple Flume
>          writers, writing to the same HDFS, the files cannot be opened by more than
>          one writer.  So each writer should write to its own file.****
>       - Latest file, currently being written to, is suffixed with ".tmp".*
>       ***
>       - This approach is not very sysadmin-friendly....****
>          - You have to find the latest (ie. the .tmp files) and hadoop fs
>          -tail -f /path/to/file.tmp****
>          - Hadoop's fs -tail -f command first prints the entire file's
>          contents, then begins tailing.****
>
> So the sum of it all is Flume is awesome for getting syslog (and other)
> data into HDFS for post processing, but not the best at getting it into
> HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal world,
> I have syslog data coming into Flume via one transport (i.e. SyslogTcp
> Source or SyslogUDP Source) and being written into HDFS in a manner that is
> both post-processable and sysadmin-friendly, but it looks like this isn't
> going to happen.****
>
> I've thus investigated some alternative approaches to meet the
> requirements.  One of these approaches is to have all of my servers send
> their syslog messages to a central box running rsyslog.  Then, rsyslog
> would perform one of the following actions:****
>
>    1. Write logs to HDFS directly using 'omhdfs' module, in a format that
>    is both post-processable and sysadmin-friendly :-)****
>    2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which
>    has HDFS mounted as a filesystem.****
>    3. Write logs to a local filesystem and also replicate logs into a
>    flume agent, configured with a SyslogSource and HDFS sink.****
>
> Option #1 sounds great.  But unfortunately the 'omhdfs' module for rsyslog
> isn't working very well.  I've gotten it to login to Hadoop/HDFS but it has
> issues creating/appending files.  Additionally, templating is somewhat
> suspect (ie. making directories /syslog/someserver/somefacility
> dynamically).****
>
> Option #2 sounds reasonable, but either the HDFS FUSE module doesn't
> support append mode (yet) or rsyslog is trying to create/open the files in
> a manner not compliant with HDFS.  No surprise, as we all know HDFS can be
> somewhat "special" at times ;-)  It's actually no matter anyways... Trying
> to "tail -f" a file mounted via HDFS FUSE is rather useless.  The data is
> only and finally fed to the tail command once a full 64MB (or whatever you
> use) block size of data has been written to the file.  One would only be
> able to use "hadoop fs -tail -f /path/to/log" which has its own issues
> mentioned previously.****
>
> Option #3 would definitely work.  However, now I'm storing my logs twice.
> Once on some local filesystem and another time in HDFS.  It works but its
> not ideal as it's a waste of space.  And you've probably noticed from this
> email so far, I'd prefer the *ideal* solution :-)****
>
> *Note*:  Astute flumers would probably look at option #3 and recommend
> making use of the RollingFileSink in addition to the HDFSSink.
> Unfortunately, the RollingFileSink doesn't support templated/dynamic
> directory creation like the HDFSSink with its hdfs.path setting of "
> hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".****
>
> So what exactly am I asking here?  Well, I'd like to know first how others
> are doing this.  A hybrid of rsyslog and Flume?  All and only Flume?  With
> custom serializers/interceptors/sinks?  Or perhaps... how would you
> recommend I handle this?****
>
> Thanks for any and all thoughts you can provide.****
>
> ** **
>
> -- ****
>
> Josh West****
>
> Lead Systems Administrator****
>
> One.com, jsw@one.com****
>
> <Ronald J  Thielen.vcf>
>
>
>

Re: Syslog Infrastructure with Flume

Posted by Ralph Goers <ra...@dslextreme.com>.
Since you ask...

In our environment our primary concern is audit logs - have have to audit banking transactions as well as changes administrators make. We have a legacy system that needed to be integrated that had records in a form different than what we want stored. We also need to allow administrators to view events as close to real time as possible. Plus we have to aggregate data across 2 data centers. Although we are currently not including web server access logs we plan to integrate them in over time.  We also have requirements from our security team to pass events for their use to ArcSight.

1. We have a "log extractor" that receives legacy events as they occur and converts them into our new format and passes them to Flume. All new applications use the Log4j 2 Flume Appender to get data to Flume.
2. Flume passes the data to ArcSight for our security team's use.
3. We wrote a Flume to Cassandra Sink.
4. We wrote our own REST query services to retrieve the data from Cassandra.
5. Since we are using DataStax Enterprise version of Cassandra we have also set up "Analytic" nodes that run Hadoop on top of Cassandra. This allows the data to be accessed via normal Hadoop tools for data analytics.
6. We have written our own reporting UI component in our Administrative Platform to allow administrators to view activities in real time or to schedule background data collection so users can post process the data on their own.

We do not have anything to allow an admin to "tail" the log but it wouldn't be hard at all to write an application to accept Flume events via Avro and display the last "n" events as they arrive.

One thing I should point out. We format our events in accordance with RFC 5424 and store that in the Flume event body. We then store all our individual pieces of audit event data in Flume headers fields.  The RFC 5424 message is what we send to ArcSight. The event fields and the compressed body are all stored in individual columns in Cassandra.

Ralph


On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:

> I am exactly where you are with this, except for the problem of my not having had time to write a serializer to address the Hostname Timestamp issue.  Questions about the use of Flume in this manner seem to recur on a regular basis, so it seems a common use case.
>  
> Sorry I cannot offer a solution since I am in your shoes at the moment, unfortunately looking at storing logs twice.
>  
> Ron Thielen
>  
> <image001.jpg>
>  
> From: Josh West [mailto:jsw@one.com] 
> Sent: Friday, October 26, 2012 9:05 AM
> To: user@flume.apache.org
> Subject: Syslog Infrastructure with Flume
>  
> Hey folks,
> 
> I've been experimenting with Flume for a few weeks now, trying to determine an approach to designing a reliable, highly available, scalable system to store logs from various sources, including syslog.  Ideally, this system will meet the following requirements:
> 
> Logs from syslog across all servers make their way into HDFS.
> Logs are stored in HDFS in a manner that is available for post-processing:
> Example:  HIVE partitions - with HDFS Flume Sink, can set hdfs.path to hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
> Example:  Custom map reduce jobs...
> Logs are stored in HDFS in a manner that is available for "reading" by sysadmins:
> During troubleshooting/firefighting, it is quite helpful to be able to login to a central logging system and tail -f / grep logs.
> We need to be able to see the logs "live".
> Some folks may be wondering why are we choosing Flume for syslog, instead of something like Graylog2 or Logstash?  The answer is we will be using Flume + Hadoop for the transport and processing of other types of data in addition to syslog.  For example, webserver access logs for post processing and statistical analysis.  So, we would like to make the most use of the Hadoop cluster, keeping all logs of all types in one redundant/scalable solution.  Additionally, by keeping both syslog and webserver access logs in Hadoop/HDFS, we can begin to correlate events.
> 
> I've run into some snags while attempting to implement Flume in a manner that satisfies the requirements listed in the top of this message:
> 
> Logs to HDFS:
> I can indeed use the Flume HDFS Sink to reliably write logs into HDFS.
> Needed to write custom serializer to add Hostname and Timestamp fields back to syslog messages.
> See:  https://issues.apache.org/jira/browse/FLUME-1666
> Logs to HDFS in manner available for reading/firefighting/troubleshooting by sysadmins:
> Flume HDFS Sink uses the BucketWriter for recording flume events to HDFS.
> Creates data files like:  /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
> Each file is format of FlumeData (or custom prefix) followed by . followed by unix timestamp of when the file was created.
> This is somewhat necessary... As you could have multiple Flume writers, writing to the same HDFS, the files cannot be opened by more than one writer.  So each writer should write to its own file.
> Latest file, currently being written to, is suffixed with ".tmp".
> This approach is not very sysadmin-friendly....
> You have to find the latest (ie. the .tmp files) and hadoop fs -tail -f /path/to/file.tmp
> Hadoop's fs -tail -f command first prints the entire file's contents, then begins tailing.
> So the sum of it all is Flume is awesome for getting syslog (and other) data into HDFS for post processing, but not the best at getting it into HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal world, I have syslog data coming into Flume via one transport (i.e. SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a manner that is both post-processable and sysadmin-friendly, but it looks like this isn't going to happen.
> 
> I've thus investigated some alternative approaches to meet the requirements.  One of these approaches is to have all of my servers send their syslog messages to a central box running rsyslog.  Then, rsyslog would perform one of the following actions:
> 
> Write logs to HDFS directly using 'omhdfs' module, in a format that is both post-processable and sysadmin-friendly :-)
> Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which has HDFS mounted as a filesystem.
> Write logs to a local filesystem and also replicate logs into a flume agent, configured with a SyslogSource and HDFS sink.
> Option #1 sounds great.  But unfortunately the 'omhdfs' module for rsyslog isn't working very well.  I've gotten it to login to Hadoop/HDFS but it has issues creating/appending files.  Additionally, templating is somewhat suspect (ie. making directories /syslog/someserver/somefacility dynamically).
> 
> Option #2 sounds reasonable, but either the HDFS FUSE module doesn't support append mode (yet) or rsyslog is trying to create/open the files in a manner not compliant with HDFS.  No surprise, as we all know HDFS can be somewhat "special" at times ;-)  It's actually no matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather useless.  The data is only and finally fed to the tail command once a full 64MB (or whatever you use) block size of data has been written to the file.  One would only be able to use "hadoop fs -tail -f /path/to/log" which has its own issues mentioned previously.
> 
> Option #3 would definitely work.  However, now I'm storing my logs twice.  Once on some local filesystem and another time in HDFS.  It works but its not ideal as it's a waste of space.  And you've probably noticed from this email so far, I'd prefer the ideal solution :-)
> 
> Note:  Astute flumers would probably look at option #3 and recommend making use of the RollingFileSink in addition to the HDFSSink.  Unfortunately, the RollingFileSink doesn't support templated/dynamic directory creation like the HDFSSink with its hdfs.path setting of "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
> 
> So what exactly am I asking here?  Well, I'd like to know first how others are doing this.  A hybrid of rsyslog and Flume?  All and only Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how would you recommend I handle this?
> 
> Thanks for any and all thoughts you can provide.
> 
>  
> 
> -- 
> Josh West
> Lead Systems Administrator
> One.com, jsw@one.com
> <Ronald J  Thielen.vcf>


RE: Syslog Infrastructure with Flume

Posted by Ron Thielen <rt...@uchicago.edu>.
I am exactly where you are with this, except for the problem of my not having had time to write a serializer to address the Hostname Timestamp issue.  Questions about the use of Flume in this manner seem to recur on a regular basis, so it seems a common use case.

Sorry I cannot offer a solution since I am in your shoes at the moment, unfortunately looking at storing logs twice.

Ron Thielen

[Ronald J  Thielen]

From: Josh West [mailto:jsw@one.com]
Sent: Friday, October 26, 2012 9:05 AM
To: user@flume.apache.org
Subject: Syslog Infrastructure with Flume

Hey folks,

I've been experimenting with Flume for a few weeks now, trying to determine an approach to designing a reliable, highly available, scalable system to store logs from various sources, including syslog.  Ideally, this system will meet the following requirements:

  1.  Logs from syslog across all servers make their way into HDFS.
  2.  Logs are stored in HDFS in a manner that is available for post-processing:
     *   Example:  HIVE partitions - with HDFS Flume Sink, can set hdfs.path to hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
     *   Example:  Custom map reduce jobs...
  3.  Logs are stored in HDFS in a manner that is available for "reading" by sysadmins:
     *   During troubleshooting/firefighting, it is quite helpful to be able to login to a central logging system and tail -f / grep logs.
     *   We need to be able to see the logs "live".

Some folks may be wondering why are we choosing Flume for syslog, instead of something like Graylog2 or Logstash?  The answer is we will be using Flume + Hadoop for the transport and processing of other types of data in addition to syslog.  For example, webserver access logs for post processing and statistical analysis.  So, we would like to make the most use of the Hadoop cluster, keeping all logs of all types in one redundant/scalable solution.  Additionally, by keeping both syslog and webserver access logs in Hadoop/HDFS, we can begin to correlate events.

I've run into some snags while attempting to implement Flume in a manner that satisfies the requirements listed in the top of this message:

  1.  Logs to HDFS:
     *   I can indeed use the Flume HDFS Sink to reliably write logs into HDFS.
     *   Needed to write custom serializer to add Hostname and Timestamp fields back to syslog messages.
     *   See:  https://issues.apache.org/jira/browse/FLUME-1666
  2.  Logs to HDFS in manner available for reading/firefighting/troubleshooting by sysadmins:
     *   Flume HDFS Sink uses the BucketWriter for recording flume events to HDFS.
     *   Creates data files like:  /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
     *   Each file is format of FlumeData (or custom prefix) followed by . followed by unix timestamp of when the file was created.
        *   This is somewhat necessary... As you could have multiple Flume writers, writing to the same HDFS, the files cannot be opened by more than one writer.  So each writer should write to its own file.
     *   Latest file, currently being written to, is suffixed with ".tmp".
     *   This approach is not very sysadmin-friendly....
        *   You have to find the latest (ie. the .tmp files) and hadoop fs -tail -f /path/to/file.tmp
        *   Hadoop's fs -tail -f command first prints the entire file's contents, then begins tailing.

So the sum of it all is Flume is awesome for getting syslog (and other) data into HDFS for post processing, but not the best at getting it into HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal world, I have syslog data coming into Flume via one transport (i.e. SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a manner that is both post-processable and sysadmin-friendly, but it looks like this isn't going to happen.

I've thus investigated some alternative approaches to meet the requirements.  One of these approaches is to have all of my servers send their syslog messages to a central box running rsyslog.  Then, rsyslog would perform one of the following actions:

  1.  Write logs to HDFS directly using 'omhdfs' module, in a format that is both post-processable and sysadmin-friendly :-)
  2.  Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which has HDFS mounted as a filesystem.
  3.  Write logs to a local filesystem and also replicate logs into a flume agent, configured with a SyslogSource and HDFS sink.

Option #1 sounds great.  But unfortunately the 'omhdfs' module for rsyslog isn't working very well.  I've gotten it to login to Hadoop/HDFS but it has issues creating/appending files.  Additionally, templating is somewhat suspect (ie. making directories /syslog/someserver/somefacility dynamically).

Option #2 sounds reasonable, but either the HDFS FUSE module doesn't support append mode (yet) or rsyslog is trying to create/open the files in a manner not compliant with HDFS.  No surprise, as we all know HDFS can be somewhat "special" at times ;-)  It's actually no matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather useless.  The data is only and finally fed to the tail command once a full 64MB (or whatever you use) block size of data has been written to the file.  One would only be able to use "hadoop fs -tail -f /path/to/log" which has its own issues mentioned previously.

Option #3 would definitely work.  However, now I'm storing my logs twice.  Once on some local filesystem and another time in HDFS.  It works but its not ideal as it's a waste of space.  And you've probably noticed from this email so far, I'd prefer the ideal solution :-)

Note:  Astute flumers would probably look at option #3 and recommend making use of the RollingFileSink in addition to the HDFSSink.  Unfortunately, the RollingFileSink doesn't support templated/dynamic directory creation like the HDFSSink with its hdfs.path setting of "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".

So what exactly am I asking here?  Well, I'd like to know first how others are doing this.  A hybrid of rsyslog and Flume?  All and only Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how would you recommend I handle this?

Thanks for any and all thoughts you can provide.



--

Josh West

Lead Systems Administrator

One.com, jsw@one.com<ma...@one.com>