You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Alain RODRIGUEZ <ar...@gmail.com> on 2012/02/02 15:01:36 UTC

Logs used as Flume Sources and real-time analytics

Hi,

I'm new with Flume and I'd like to use it to get a stable flow of data to
my database (To be able to handle rush hours by delaying the write in
database, without introducing any timeout or latency to the user).

My questions are :

What is the best way to create the log file that will be used as source for
flume ?

Our production environment is running apache servers and php scripts.
I can't just use access log because some informations are stored in
session, so I need to build a custom source.
An other point is that writing a file seems to be primitive and not really
efficient since it writes the disk instead of writing the memory for any
event I store (many events every second).

How to use this system (as Facebook does with scribe) to proceed real-time
analytics ?

I'm open to here about hdfs, hbase or whatever could help reaching my goals
which are a stable flow to the database and near real-time analytics
(seconds to minutes).

Thanks for your help.

Alain

Re: Logs used as Flume Sources and real-time analytics

Posted by Michal Taborsky <mi...@nrholding.com>.

I suppose you could use the tailDir as you describe, but you have to test
it of course in your environment. As for the ID, we just generate it
ourselves in the app, we also include some information such as timestamp
and IP address  of the originating host as part of the ID, together with a
random part. Bu I think flume also generates some unique ID for you, we
just do not use it, we use the raw data format.

*
--
Michal Táborský*
*chief systems architect*
Netretail Holding, BV
nrholding.com <http://www.nrholding.com>




2012/2/9 Alain RODRIGUEZ <ar...@gmail.com>

> What about using a "tailDir" source ? Instead of rotate files we could
> just write a new log file per unit of time (let's say every minute) like
> "mylog_201202091522.log".
>
> Doing it I guess that if an agent goes down, when it will be up again, it
> will just reload the entire logs kept in this folder, and as
> Cassandra/Hbase handle duplicates entry, we will have all the data exactly
> once.
>
> This solution is quite heavy but we can imagine cleaning the folder with
> old logs every hours or day and flume shouldn't crash often (I hope :D).
> Knowing the approximate server downtime,we also can choose manually what
> logs files to reload by simply putting them back into the tailed directory.
>
> Btw, "each event has a unique generated ID" --> Do you generate it
> yourself or does flume do it for you ?
>
> Alain
>
>
> 2012/2/9 Michal Taborsky <mi...@nrholding.com>
>
>> Hi Alain,
>>
>> each event has a unique generated ID. We use HBase for long term storage
>> of the events, using the ID as a key, so the loading process does the
>> deduplication automatically. I have no experience with Cassandra.
>>
>> As for the problem with rotating logs, we do not have a solution for
>> that. The only case, where data loss can occur is when the rotation happens
>> while the flume agent is down. We check the health of flume agent
>> frequently, so the risk of this happening is acceptable for us (again, the
>> event collection is not business critical for us). If you need 100% event
>> delivery, the file tailing might not be for you.
>>
>> *
>> --
>> Michal Táborský*
>> *chief systems architect*
>> Netretail Holding, BV
>> nrholding.com <http://www.nrholding.com>
>>
>>
>>
>>
>> 2012/2/9 Alain RODRIGUEZ <ar...@gmail.com>
>>
>>> Hi again :).
>>>
>>> 1 - How to rewrite all the logs after a crash knowing that we rotate the
>>> logs. I mean we tail a specific file, how to tell to tail to rewrite also
>>> the old logs that have been rotated ?
>>>
>>> 2 - To avoid duplicates can't we just use checkpoints inserted into the
>>> logs, let's say every hour. When something crashes I would just have to
>>> erase every entry that come after checkpoint time and rewrite the logs from
>>> this checkpoint. Is this a bad Idea ?
>>>
>>> 3 - What should I use to store the checkpoint in a different way from
>>> the real log ? Are decorators made for this kind of work ?
>>>
>>> 4 - I would like to use Cassandra for the logs storage, I saw some
>>> plugins giving Cassandra sinks but I would like to store data in a custom
>>> way. How to do it ? Do I need to build a custom plugin/sink ?
>>>
>>> 5 - My business process also use my Cassandra DB (without flume,
>>> directly via thrift), how to ensure that log writing won't overload my
>>> database and introduce latency in my business process ? I mean, is there a
>>> way to to manage the throughput sent by the flume's tails and slow then
>>> when my Cassandra cluster is overloaded ?
>>>
>>> I hope I'm not flooding you with all these (stupid?) questions.
>>>
>>> Alain
>>>
>>> 2012/2/7 alo alt <wg...@googlemail.com>
>>>
>>>> Yes, I agree fully. The tailing is a useful mechanism, but since we
>>>> also have to deliver in time and reliable the core team decides to remove
>>>> that feature. In your case tail make sense, in a session application not
>>>> (bank, travel, car rental, pizza service and so on). One missing token or
>>>> session can harm.
>>>>
>>>> For flumeNG another sink is implemented, called exec-agent. Here you
>>>> can easy put a tail sink, but then you have to consider that all runs well.
>>>> But for new users I would please point them to flumeNG, because flume and
>>>> flumeNG has no compatibility, flumeNG is written completely new. I think
>>>> when flumeNG release the next milestone the support for flume will slowly
>>>> going down.
>>>>
>>>> best, and thanks for the discussion,
>>>>  Alex
>>>>
>>>> --
>>>> Alexander Lorenz
>>>> http://mapredit.blogspot.com
>>>>
>>>> On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:
>>>>
>>>> > Hi Alex,
>>>> >
>>>> > truth be told, I am quite satisfied with the file tailing and I'll
>>>> try to explain why I like it. The main reason is, at least for us, the web
>>>> application itself is business critical, the event collection is not.
>>>> Writing to a plain file is a thing that can rarely fail and if it fails, it
>>>> fails quickly and in a controlled fashion. But piping to a flume agent for
>>>> example? How sure can I be, that the write will work all the time or fail
>>>> immediately? That it will not wait for some timeout or the other? Or throw
>>>> some unexpected error and bring down the app.
>>>> >
>>>> > The other aspect is simple development and debugging. Any developer
>>>> can read a plain file and check if the data he's writing is correct, but
>>>> with any sophisticated method you either need more complicated testing
>>>> environment or redirection switches that will write to files in development
>>>> and to flume in testing and production, which complicates stuff.
>>>> >
>>>> > --
>>>> > Michal Táborský
>>>> > chief systems architect
>>>> > Netretail Holding, BV
>>>> > nrholding.com
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > 2012/2/7 alo alt <wg...@googlemail.com>
>>>> > Hi,
>>>> >
>>>> > sorry for pitching in, but FlumeNG will not support tailing sources,
>>>> because we have here a lot of problems. First, and mostly the worst problem
>>>> is the marker in a tailing file. If the agent crash, or the server, or the
>>>> collector the marker will be lost. So, if you restart you'll getting all
>>>> events again. Sure, you can use append, but here you get lost events.
>>>> >
>>>> > For a easy migration from flume to flumeNG use sources which are
>>>> supported in NG. Syslog as example, more sources you can found here:
>>>> https://cwiki.apache.org/FLUME/flume-ng.html
>>>> >
>>>> > You could use Avro for the sessions, and you could pipe direct to a
>>>> local flume agent. Also syslog with a buffering mode could be work. Also in
>>>> flumeNG now we have hBase handler and thrift.
>>>> > Another idea for collecting sessions could be
>>>> http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api
>>>> for hdfs?
>>>> >
>>>> > - Alex
>>>> >
>>>> >
>>>> > --
>>>> > Alexander Lorenz
>>>> > http://mapredit.blogspot.com
>>>> >
>>>> > On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
>>>> >
>>>> > > Thank you for your answer it helps me a lot knowing I'am doing
>>>> things in a good way.
>>>> > >
>>>> > > I've got an other question: How do you manage restart the service
>>>> after a crash ? I mean, you tail the log file, so if your server crashes or
>>>> you stop the tail for any reason, how do you do not to tail all the logs
>>>> from the start, how do you manage restarting from the exact point where you
>>>> left your tail process ?
>>>> > >
>>>> > > Thanks again for your help, I really appreciate :-).
>>>> > >
>>>> > > Alain
>>>> > >
>>>> > > 2012/2/2 Michal Taborsky <mi...@nrholding.com>
>>>> > > Hello Alain,
>>>> > >
>>>> > > we are using Flume for probably the same purposes. We are writing
>>>> JSON encoded event data to flat file on every application server. Since
>>>> each application server writes only maybe tens of events per second, the
>>>> performance hit of writing to disk is negligible (and the events are
>>>> written to disk only after the content is generated and sent to the user,
>>>> so there is no latency for the end user). This file is tailed by Flume and
>>>> delivered thru collectors to HDFS. The collectors are forking the events to
>>>> RabbitMQ as well. We have a Node.js application, that picks up these events
>>>> and does some real-time analytics on them. The delay between event
>>>> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
>>>> > >
>>>> > > Hope this helps.
>>>> > >
>>>> > > --
>>>> > > Michal Táborský
>>>> > > chief systems architect
>>>> > > Netretail Holding, BV
>>>> > > nrholding.com
>>>> > >
>>>> > >
>>>> > >
>>>> > >
>>>> > > 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
>>>> > > Hi,
>>>> > >
>>>> > > I'm new with Flume and I'd like to use it to get a stable flow of
>>>> data to my database (To be able to handle rush hours by delaying the write
>>>> in database, without introducing any timeout or latency to the user).
>>>> > >
>>>> > > My questions are :
>>>> > >
>>>> > > What is the best way to create the log file that will be used as
>>>> source for flume ?
>>>> > >
>>>> > > Our production environment is running apache servers and php
>>>> scripts.
>>>> > > I can't just use access log because some informations are stored in
>>>> session, so I need to build a custom source.
>>>> > > An other point is that writing a file seems to be primitive and not
>>>> really efficient since it writes the disk instead of writing the memory for
>>>> any event I store (many events every second).
>>>> > >
>>>> > > How to use this system (as Facebook does with scribe) to proceed
>>>> real-time analytics ?
>>>> > >
>>>> > > I'm open to here about hdfs, hbase or whatever could help reaching
>>>> my goals which are a stable flow to the database and near real-time
>>>> analytics (seconds to minutes).
>>>> > >
>>>> > > Thanks for your help.
>>>> > >
>>>> > > Alain
>>>> > >
>>>> > >
>>>> >
>>>> >
>>>>
>>>>
>>>
>>
>

Re: Logs used as Flume Sources and real-time analytics

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

What about using a "tailDir" source ? Instead of rotate files we could just
write a new log file per unit of time (let's say every minute) like
"mylog_201202091522.log".

Doing it I guess that if an agent goes down, when it will be up again, it
will just reload the entire logs kept in this folder, and as
Cassandra/Hbase handle duplicates entry, we will have all the data exactly
once.

This solution is quite heavy but we can imagine cleaning the folder with
old logs every hours or day and flume shouldn't crash often (I hope :D).
Knowing the approximate server downtime,we also can choose manually what
logs files to reload by simply putting them back into the tailed directory.

Btw, "each event has a unique generated ID" --> Do you generate it yourself
or does flume do it for you ?

Alain

2012/2/9 Michal Taborsky <mi...@nrholding.com>

> Hi Alain,
>
> each event has a unique generated ID. We use HBase for long term storage
> of the events, using the ID as a key, so the loading process does the
> deduplication automatically. I have no experience with Cassandra.
>
> As for the problem with rotating logs, we do not have a solution for that.
> The only case, where data loss can occur is when the rotation happens while
> the flume agent is down. We check the health of flume agent frequently, so
> the risk of this happening is acceptable for us (again, the event
> collection is not business critical for us). If you need 100% event
> delivery, the file tailing might not be for you.
>
> *
> --
> Michal Táborský*
> *chief systems architect*
> Netretail Holding, BV
> nrholding.com <http://www.nrholding.com>
>
>
>
>
> 2012/2/9 Alain RODRIGUEZ <ar...@gmail.com>
>
>> Hi again :).
>>
>> 1 - How to rewrite all the logs after a crash knowing that we rotate the
>> logs. I mean we tail a specific file, how to tell to tail to rewrite also
>> the old logs that have been rotated ?
>>
>> 2 - To avoid duplicates can't we just use checkpoints inserted into the
>> logs, let's say every hour. When something crashes I would just have to
>> erase every entry that come after checkpoint time and rewrite the logs from
>> this checkpoint. Is this a bad Idea ?
>>
>> 3 - What should I use to store the checkpoint in a different way from the
>> real log ? Are decorators made for this kind of work ?
>>
>> 4 - I would like to use Cassandra for the logs storage, I saw some
>> plugins giving Cassandra sinks but I would like to store data in a custom
>> way. How to do it ? Do I need to build a custom plugin/sink ?
>>
>> 5 - My business process also use my Cassandra DB (without flume, directly
>> via thrift), how to ensure that log writing won't overload my database and
>> introduce latency in my business process ? I mean, is there a way to to
>> manage the throughput sent by the flume's tails and slow then when my
>> Cassandra cluster is overloaded ?
>>
>> I hope I'm not flooding you with all these (stupid?) questions.
>>
>> Alain
>>
>> 2012/2/7 alo alt <wg...@googlemail.com>
>>
>>> Yes, I agree fully. The tailing is a useful mechanism, but since we also
>>> have to deliver in time and reliable the core team decides to remove that
>>> feature. In your case tail make sense, in a session application not (bank,
>>> travel, car rental, pizza service and so on). One missing token or session
>>> can harm.
>>>
>>> For flumeNG another sink is implemented, called exec-agent. Here you can
>>> easy put a tail sink, but then you have to consider that all runs well. But
>>> for new users I would please point them to flumeNG, because flume and
>>> flumeNG has no compatibility, flumeNG is written completely new. I think
>>> when flumeNG release the next milestone the support for flume will slowly
>>> going down.
>>>
>>> best, and thanks for the discussion,
>>>  Alex
>>>
>>> --
>>> Alexander Lorenz
>>> http://mapredit.blogspot.com
>>>
>>> On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:
>>>
>>> > Hi Alex,
>>> >
>>> > truth be told, I am quite satisfied with the file tailing and I'll try
>>> to explain why I like it. The main reason is, at least for us, the web
>>> application itself is business critical, the event collection is not.
>>> Writing to a plain file is a thing that can rarely fail and if it fails, it
>>> fails quickly and in a controlled fashion. But piping to a flume agent for
>>> example? How sure can I be, that the write will work all the time or fail
>>> immediately? That it will not wait for some timeout or the other? Or throw
>>> some unexpected error and bring down the app.
>>> >
>>> > The other aspect is simple development and debugging. Any developer
>>> can read a plain file and check if the data he's writing is correct, but
>>> with any sophisticated method you either need more complicated testing
>>> environment or redirection switches that will write to files in development
>>> and to flume in testing and production, which complicates stuff.
>>> >
>>> > --
>>> > Michal Táborský
>>> > chief systems architect
>>> > Netretail Holding, BV
>>> > nrholding.com
>>> >
>>> >
>>> >
>>> >
>>> > 2012/2/7 alo alt <wg...@googlemail.com>
>>> > Hi,
>>> >
>>> > sorry for pitching in, but FlumeNG will not support tailing sources,
>>> because we have here a lot of problems. First, and mostly the worst problem
>>> is the marker in a tailing file. If the agent crash, or the server, or the
>>> collector the marker will be lost. So, if you restart you'll getting all
>>> events again. Sure, you can use append, but here you get lost events.
>>> >
>>> > For a easy migration from flume to flumeNG use sources which are
>>> supported in NG. Syslog as example, more sources you can found here:
>>> https://cwiki.apache.org/FLUME/flume-ng.html
>>> >
>>> > You could use Avro for the sessions, and you could pipe direct to a
>>> local flume agent. Also syslog with a buffering mode could be work. Also in
>>> flumeNG now we have hBase handler and thrift.
>>> > Another idea for collecting sessions could be
>>> http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api
>>> for hdfs?
>>> >
>>> > - Alex
>>> >
>>> >
>>> > --
>>> > Alexander Lorenz
>>> > http://mapredit.blogspot.com
>>> >
>>> > On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
>>> >
>>> > > Thank you for your answer it helps me a lot knowing I'am doing
>>> things in a good way.
>>> > >
>>> > > I've got an other question: How do you manage restart the service
>>> after a crash ? I mean, you tail the log file, so if your server crashes or
>>> you stop the tail for any reason, how do you do not to tail all the logs
>>> from the start, how do you manage restarting from the exact point where you
>>> left your tail process ?
>>> > >
>>> > > Thanks again for your help, I really appreciate :-).
>>> > >
>>> > > Alain
>>> > >
>>> > > 2012/2/2 Michal Taborsky <mi...@nrholding.com>
>>> > > Hello Alain,
>>> > >
>>> > > we are using Flume for probably the same purposes. We are writing
>>> JSON encoded event data to flat file on every application server. Since
>>> each application server writes only maybe tens of events per second, the
>>> performance hit of writing to disk is negligible (and the events are
>>> written to disk only after the content is generated and sent to the user,
>>> so there is no latency for the end user). This file is tailed by Flume and
>>> delivered thru collectors to HDFS. The collectors are forking the events to
>>> RabbitMQ as well. We have a Node.js application, that picks up these events
>>> and does some real-time analytics on them. The delay between event
>>> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
>>> > >
>>> > > Hope this helps.
>>> > >
>>> > > --
>>> > > Michal Táborský
>>> > > chief systems architect
>>> > > Netretail Holding, BV
>>> > > nrholding.com
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
>>> > > Hi,
>>> > >
>>> > > I'm new with Flume and I'd like to use it to get a stable flow of
>>> data to my database (To be able to handle rush hours by delaying the write
>>> in database, without introducing any timeout or latency to the user).
>>> > >
>>> > > My questions are :
>>> > >
>>> > > What is the best way to create the log file that will be used as
>>> source for flume ?
>>> > >
>>> > > Our production environment is running apache servers and php scripts.
>>> > > I can't just use access log because some informations are stored in
>>> session, so I need to build a custom source.
>>> > > An other point is that writing a file seems to be primitive and not
>>> really efficient since it writes the disk instead of writing the memory for
>>> any event I store (many events every second).
>>> > >
>>> > > How to use this system (as Facebook does with scribe) to proceed
>>> real-time analytics ?
>>> > >
>>> > > I'm open to here about hdfs, hbase or whatever could help reaching
>>> my goals which are a stable flow to the database and near real-time
>>> analytics (seconds to minutes).
>>> > >
>>> > > Thanks for your help.
>>> > >
>>> > > Alain
>>> > >
>>> > >
>>> >
>>> >
>>>
>>>
>>
>

Re: Logs used as Flume Sources and real-time analytics

Posted by Michal Taborsky <mi...@nrholding.com>.

Hi Alain,

each event has a unique generated ID. We use HBase for long term storage of
the events, using the ID as a key, so the loading process does the
deduplication automatically. I have no experience with Cassandra.

As for the problem with rotating logs, we do not have a solution for that.
The only case, where data loss can occur is when the rotation happens while
the flume agent is down. We check the health of flume agent frequently, so
the risk of this happening is acceptable for us (again, the event
collection is not business critical for us). If you need 100% event
delivery, the file tailing might not be for you.

*
--
Michal Táborský*
*chief systems architect*
Netretail Holding, BV
nrholding.com <http://www.nrholding.com>




2012/2/9 Alain RODRIGUEZ <ar...@gmail.com>

> Hi again :).
>
> 1 - How to rewrite all the logs after a crash knowing that we rotate the
> logs. I mean we tail a specific file, how to tell to tail to rewrite also
> the old logs that have been rotated ?
>
> 2 - To avoid duplicates can't we just use checkpoints inserted into the
> logs, let's say every hour. When something crashes I would just have to
> erase every entry that come after checkpoint time and rewrite the logs from
> this checkpoint. Is this a bad Idea ?
>
> 3 - What should I use to store the checkpoint in a different way from the
> real log ? Are decorators made for this kind of work ?
>
> 4 - I would like to use Cassandra for the logs storage, I saw some plugins
> giving Cassandra sinks but I would like to store data in a custom way. How
> to do it ? Do I need to build a custom plugin/sink ?
>
> 5 - My business process also use my Cassandra DB (without flume, directly
> via thrift), how to ensure that log writing won't overload my database and
> introduce latency in my business process ? I mean, is there a way to to
> manage the throughput sent by the flume's tails and slow then when my
> Cassandra cluster is overloaded ?
>
> I hope I'm not flooding you with all these (stupid?) questions.
>
> Alain
>
> 2012/2/7 alo alt <wg...@googlemail.com>
>
>> Yes, I agree fully. The tailing is a useful mechanism, but since we also
>> have to deliver in time and reliable the core team decides to remove that
>> feature. In your case tail make sense, in a session application not (bank,
>> travel, car rental, pizza service and so on). One missing token or session
>> can harm.
>>
>> For flumeNG another sink is implemented, called exec-agent. Here you can
>> easy put a tail sink, but then you have to consider that all runs well. But
>> for new users I would please point them to flumeNG, because flume and
>> flumeNG has no compatibility, flumeNG is written completely new. I think
>> when flumeNG release the next milestone the support for flume will slowly
>> going down.
>>
>> best, and thanks for the discussion,
>>  Alex
>>
>> --
>> Alexander Lorenz
>> http://mapredit.blogspot.com
>>
>> On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:
>>
>> > Hi Alex,
>> >
>> > truth be told, I am quite satisfied with the file tailing and I'll try
>> to explain why I like it. The main reason is, at least for us, the web
>> application itself is business critical, the event collection is not.
>> Writing to a plain file is a thing that can rarely fail and if it fails, it
>> fails quickly and in a controlled fashion. But piping to a flume agent for
>> example? How sure can I be, that the write will work all the time or fail
>> immediately? That it will not wait for some timeout or the other? Or throw
>> some unexpected error and bring down the app.
>> >
>> > The other aspect is simple development and debugging. Any developer can
>> read a plain file and check if the data he's writing is correct, but with
>> any sophisticated method you either need more complicated testing
>> environment or redirection switches that will write to files in development
>> and to flume in testing and production, which complicates stuff.
>> >
>> > --
>> > Michal Táborský
>> > chief systems architect
>> > Netretail Holding, BV
>> > nrholding.com
>> >
>> >
>> >
>> >
>> > 2012/2/7 alo alt <wg...@googlemail.com>
>> > Hi,
>> >
>> > sorry for pitching in, but FlumeNG will not support tailing sources,
>> because we have here a lot of problems. First, and mostly the worst problem
>> is the marker in a tailing file. If the agent crash, or the server, or the
>> collector the marker will be lost. So, if you restart you'll getting all
>> events again. Sure, you can use append, but here you get lost events.
>> >
>> > For a easy migration from flume to flumeNG use sources which are
>> supported in NG. Syslog as example, more sources you can found here:
>> https://cwiki.apache.org/FLUME/flume-ng.html
>> >
>> > You could use Avro for the sessions, and you could pipe direct to a
>> local flume agent. Also syslog with a buffering mode could be work. Also in
>> flumeNG now we have hBase handler and thrift.
>> > Another idea for collecting sessions could be
>> http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api
>> for hdfs?
>> >
>> > - Alex
>> >
>> >
>> > --
>> > Alexander Lorenz
>> > http://mapredit.blogspot.com
>> >
>> > On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
>> >
>> > > Thank you for your answer it helps me a lot knowing I'am doing things
>> in a good way.
>> > >
>> > > I've got an other question: How do you manage restart the service
>> after a crash ? I mean, you tail the log file, so if your server crashes or
>> you stop the tail for any reason, how do you do not to tail all the logs
>> from the start, how do you manage restarting from the exact point where you
>> left your tail process ?
>> > >
>> > > Thanks again for your help, I really appreciate :-).
>> > >
>> > > Alain
>> > >
>> > > 2012/2/2 Michal Taborsky <mi...@nrholding.com>
>> > > Hello Alain,
>> > >
>> > > we are using Flume for probably the same purposes. We are writing
>> JSON encoded event data to flat file on every application server. Since
>> each application server writes only maybe tens of events per second, the
>> performance hit of writing to disk is negligible (and the events are
>> written to disk only after the content is generated and sent to the user,
>> so there is no latency for the end user). This file is tailed by Flume and
>> delivered thru collectors to HDFS. The collectors are forking the events to
>> RabbitMQ as well. We have a Node.js application, that picks up these events
>> and does some real-time analytics on them. The delay between event
>> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
>> > >
>> > > Hope this helps.
>> > >
>> > > --
>> > > Michal Táborský
>> > > chief systems architect
>> > > Netretail Holding, BV
>> > > nrholding.com
>> > >
>> > >
>> > >
>> > >
>> > > 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
>> > > Hi,
>> > >
>> > > I'm new with Flume and I'd like to use it to get a stable flow of
>> data to my database (To be able to handle rush hours by delaying the write
>> in database, without introducing any timeout or latency to the user).
>> > >
>> > > My questions are :
>> > >
>> > > What is the best way to create the log file that will be used as
>> source for flume ?
>> > >
>> > > Our production environment is running apache servers and php scripts.
>> > > I can't just use access log because some informations are stored in
>> session, so I need to build a custom source.
>> > > An other point is that writing a file seems to be primitive and not
>> really efficient since it writes the disk instead of writing the memory for
>> any event I store (many events every second).
>> > >
>> > > How to use this system (as Facebook does with scribe) to proceed
>> real-time analytics ?
>> > >
>> > > I'm open to here about hdfs, hbase or whatever could help reaching my
>> goals which are a stable flow to the database and near real-time analytics
>> (seconds to minutes).
>> > >
>> > > Thanks for your help.
>> > >
>> > > Alain
>> > >
>> > >
>> >
>> >
>>
>>
>

Re: Logs used as Flume Sources and real-time analytics

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi again :).

1 - How to rewrite all the logs after a crash knowing that we rotate the
logs. I mean we tail a specific file, how to tell to tail to rewrite also
the old logs that have been rotated ?

2 - To avoid duplicates can't we just use checkpoints inserted into the
logs, let's say every hour. When something crashes I would just have to
erase every entry that come after checkpoint time and rewrite the logs from
this checkpoint. Is this a bad Idea ?

3 - What should I use to store the checkpoint in a different way from the
real log ? Are decorators made for this kind of work ?

4 - I would like to use Cassandra for the logs storage, I saw some plugins
giving Cassandra sinks but I would like to store data in a custom way. How
to do it ? Do I need to build a custom plugin/sink ?

5 - My business process also use my Cassandra DB (without flume, directly
via thrift), how to ensure that log writing won't overload my database and
introduce latency in my business process ? I mean, is there a way to to
manage the throughput sent by the flume's tails and slow then when my
Cassandra cluster is overloaded ?

I hope I'm not flooding you with all these (stupid?) questions.

Alain

2012/2/7 alo alt <wg...@googlemail.com>

> Yes, I agree fully. The tailing is a useful mechanism, but since we also
> have to deliver in time and reliable the core team decides to remove that
> feature. In your case tail make sense, in a session application not (bank,
> travel, car rental, pizza service and so on). One missing token or session
> can harm.
>
> For flumeNG another sink is implemented, called exec-agent. Here you can
> easy put a tail sink, but then you have to consider that all runs well. But
> for new users I would please point them to flumeNG, because flume and
> flumeNG has no compatibility, flumeNG is written completely new. I think
> when flumeNG release the next milestone the support for flume will slowly
> going down.
>
> best, and thanks for the discussion,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:
>
> > Hi Alex,
> >
> > truth be told, I am quite satisfied with the file tailing and I'll try
> to explain why I like it. The main reason is, at least for us, the web
> application itself is business critical, the event collection is not.
> Writing to a plain file is a thing that can rarely fail and if it fails, it
> fails quickly and in a controlled fashion. But piping to a flume agent for
> example? How sure can I be, that the write will work all the time or fail
> immediately? That it will not wait for some timeout or the other? Or throw
> some unexpected error and bring down the app.
> >
> > The other aspect is simple development and debugging. Any developer can
> read a plain file and check if the data he's writing is correct, but with
> any sophisticated method you either need more complicated testing
> environment or redirection switches that will write to files in development
> and to flume in testing and production, which complicates stuff.
> >
> > --
> > Michal Táborský
> > chief systems architect
> > Netretail Holding, BV
> > nrholding.com
> >
> >
> >
> >
> > 2012/2/7 alo alt <wg...@googlemail.com>
> > Hi,
> >
> > sorry for pitching in, but FlumeNG will not support tailing sources,
> because we have here a lot of problems. First, and mostly the worst problem
> is the marker in a tailing file. If the agent crash, or the server, or the
> collector the marker will be lost. So, if you restart you'll getting all
> events again. Sure, you can use append, but here you get lost events.
> >
> > For a easy migration from flume to flumeNG use sources which are
> supported in NG. Syslog as example, more sources you can found here:
> https://cwiki.apache.org/FLUME/flume-ng.html
> >
> > You could use Avro for the sessions, and you could pipe direct to a
> local flume agent. Also syslog with a buffering mode could be work. Also in
> flumeNG now we have hBase handler and thrift.
> > Another idea for collecting sessions could be
> http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api
> for hdfs?
> >
> > - Alex
> >
> >
> > --
> > Alexander Lorenz
> > http://mapredit.blogspot.com
> >
> > On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
> >
> > > Thank you for your answer it helps me a lot knowing I'am doing things
> in a good way.
> > >
> > > I've got an other question: How do you manage restart the service
> after a crash ? I mean, you tail the log file, so if your server crashes or
> you stop the tail for any reason, how do you do not to tail all the logs
> from the start, how do you manage restarting from the exact point where you
> left your tail process ?
> > >
> > > Thanks again for your help, I really appreciate :-).
> > >
> > > Alain
> > >
> > > 2012/2/2 Michal Taborsky <mi...@nrholding.com>
> > > Hello Alain,
> > >
> > > we are using Flume for probably the same purposes. We are writing JSON
> encoded event data to flat file on every application server. Since each
> application server writes only maybe tens of events per second, the
> performance hit of writing to disk is negligible (and the events are
> written to disk only after the content is generated and sent to the user,
> so there is no latency for the end user). This file is tailed by Flume and
> delivered thru collectors to HDFS. The collectors are forking the events to
> RabbitMQ as well. We have a Node.js application, that picks up these events
> and does some real-time analytics on them. The delay between event
> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
> > >
> > > Hope this helps.
> > >
> > > --
> > > Michal Táborský
> > > chief systems architect
> > > Netretail Holding, BV
> > > nrholding.com
> > >
> > >
> > >
> > >
> > > 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
> > > Hi,
> > >
> > > I'm new with Flume and I'd like to use it to get a stable flow of data
> to my database (To be able to handle rush hours by delaying the write in
> database, without introducing any timeout or latency to the user).
> > >
> > > My questions are :
> > >
> > > What is the best way to create the log file that will be used as
> source for flume ?
> > >
> > > Our production environment is running apache servers and php scripts.
> > > I can't just use access log because some informations are stored in
> session, so I need to build a custom source.
> > > An other point is that writing a file seems to be primitive and not
> really efficient since it writes the disk instead of writing the memory for
> any event I store (many events every second).
> > >
> > > How to use this system (as Facebook does with scribe) to proceed
> real-time analytics ?
> > >
> > > I'm open to here about hdfs, hbase or whatever could help reaching my
> goals which are a stable flow to the database and near real-time analytics
> (seconds to minutes).
> > >
> > > Thanks for your help.
> > >
> > > Alain
> > >
> > >
> >
> >
>
>

Re: Logs used as Flume Sources and real-time analytics

Posted by alo alt <wg...@googlemail.com>.

Yes, I agree fully. The tailing is a useful mechanism, but since we also have to deliver in time and reliable the core team decides to remove that feature. In your case tail make sense, in a session application not (bank, travel, car rental, pizza service and so on). One missing token or session can harm.

For flumeNG another sink is implemented, called exec-agent. Here you can easy put a tail sink, but then you have to consider that all runs well. But for new users I would please point them to flumeNG, because flume and flumeNG has no compatibility, flumeNG is written completely new. I think when flumeNG release the next milestone the support for flume will slowly going down.

best, and thanks for the discussion,
 Alex 

--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 1:16 PM, Michal Taborsky wrote:

> Hi Alex,
> 
> truth be told, I am quite satisfied with the file tailing and I'll try to explain why I like it. The main reason is, at least for us, the web application itself is business critical, the event collection is not. Writing to a plain file is a thing that can rarely fail and if it fails, it fails quickly and in a controlled fashion. But piping to a flume agent for example? How sure can I be, that the write will work all the time or fail immediately? That it will not wait for some timeout or the other? Or throw some unexpected error and bring down the app.
> 
> The other aspect is simple development and debugging. Any developer can read a plain file and check if the data he's writing is correct, but with any sophisticated method you either need more complicated testing environment or redirection switches that will write to files in development and to flume in testing and production, which complicates stuff.
> 
> --
> Michal Táborský
> chief systems architect
> Netretail Holding, BV
> nrholding.com
> 
> 
> 
> 
> 2012/2/7 alo alt <wg...@googlemail.com>
> Hi,
> 
> sorry for pitching in, but FlumeNG will not support tailing sources, because we have here a lot of problems. First, and mostly the worst problem is the marker in a tailing file. If the agent crash, or the server, or the collector the marker will be lost. So, if you restart you'll getting all events again. Sure, you can use append, but here you get lost events.
> 
> For a easy migration from flume to flumeNG use sources which are supported in NG. Syslog as example, more sources you can found here: https://cwiki.apache.org/FLUME/flume-ng.html
> 
> You could use Avro for the sessions, and you could pipe direct to a local flume agent. Also syslog with a buffering mode could be work. Also in flumeNG now we have hBase handler and thrift.
> Another idea for collecting sessions could be http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api for hdfs?
> 
> - Alex
> 
> 
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
> 
> On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
> 
> > Thank you for your answer it helps me a lot knowing I'am doing things in a good way.
> >
> > I've got an other question: How do you manage restart the service after a crash ? I mean, you tail the log file, so if your server crashes or you stop the tail for any reason, how do you do not to tail all the logs from the start, how do you manage restarting from the exact point where you left your tail process ?
> >
> > Thanks again for your help, I really appreciate :-).
> >
> > Alain
> >
> > 2012/2/2 Michal Taborsky <mi...@nrholding.com>
> > Hello Alain,
> >
> > we are using Flume for probably the same purposes. We are writing JSON encoded event data to flat file on every application server. Since each application server writes only maybe tens of events per second, the performance hit of writing to disk is negligible (and the events are written to disk only after the content is generated and sent to the user, so there is no latency for the end user). This file is tailed by Flume and delivered thru collectors to HDFS. The collectors are forking the events to RabbitMQ as well. We have a Node.js application, that picks up these events and does some real-time analytics on them. The delay between event origination and analytics is below 10 seconds, usually 1-3 seconds in total.
> >
> > Hope this helps.
> >
> > --
> > Michal Táborský
> > chief systems architect
> > Netretail Holding, BV
> > nrholding.com
> >
> >
> >
> >
> > 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
> > Hi,
> >
> > I'm new with Flume and I'd like to use it to get a stable flow of data to my database (To be able to handle rush hours by delaying the write in database, without introducing any timeout or latency to the user).
> >
> > My questions are :
> >
> > What is the best way to create the log file that will be used as source for flume ?
> >
> > Our production environment is running apache servers and php scripts.
> > I can't just use access log because some informations are stored in session, so I need to build a custom source.
> > An other point is that writing a file seems to be primitive and not really efficient since it writes the disk instead of writing the memory for any event I store (many events every second).
> >
> > How to use this system (as Facebook does with scribe) to proceed real-time analytics ?
> >
> > I'm open to here about hdfs, hbase or whatever could help reaching my goals which are a stable flow to the database and near real-time analytics (seconds to minutes).
> >
> > Thanks for your help.
> >
> > Alain
> >
> >
> 
>

Re: Logs used as Flume Sources and real-time analytics

Posted by Michal Taborsky <mi...@nrholding.com>.

Hi Alex,

truth be told, I am quite satisfied with the file tailing and I'll try to
explain why I like it. The main reason is, at least for us, the web
application itself is business critical, the event collection is not.
Writing to a plain file is a thing that can rarely fail and if it fails, it
fails quickly and in a controlled fashion. But piping to a flume agent for
example? How sure can I be, that the write will work all the time or fail
immediately? That it will not wait for some timeout or the other? Or throw
some unexpected error and bring down the app.

The other aspect is simple development and debugging. Any developer can
read a plain file and check if the data he's writing is correct, but with
any sophisticated method you either need more complicated testing
environment or redirection switches that will write to files in development
and to flume in testing and production, which complicates stuff.

*
--
Michal Táborský*
*chief systems architect*
Netretail Holding, BV
nrholding.com <http://www.nrholding.com>




2012/2/7 alo alt <wg...@googlemail.com>

> Hi,
>
> sorry for pitching in, but FlumeNG will not support tailing sources,
> because we have here a lot of problems. First, and mostly the worst problem
> is the marker in a tailing file. If the agent crash, or the server, or the
> collector the marker will be lost. So, if you restart you'll getting all
> events again. Sure, you can use append, but here you get lost events.
>
> For a easy migration from flume to flumeNG use sources which are supported
> in NG. Syslog as example, more sources you can found here:
> https://cwiki.apache.org/FLUME/flume-ng.html
>
> You could use Avro for the sessions, and you could pipe direct to a local
> flume agent. Also syslog with a buffering mode could be work. Also in
> flumeNG now we have hBase handler and thrift.
> Another idea for collecting sessions could be
> http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api
> for hdfs?
>
> - Alex
>
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:
>
> > Thank you for your answer it helps me a lot knowing I'am doing things in
> a good way.
> >
> > I've got an other question: How do you manage restart the service after
> a crash ? I mean, you tail the log file, so if your server crashes or you
> stop the tail for any reason, how do you do not to tail all the logs from
> the start, how do you manage restarting from the exact point where you left
> your tail process ?
> >
> > Thanks again for your help, I really appreciate :-).
> >
> > Alain
> >
> > 2012/2/2 Michal Taborsky <mi...@nrholding.com>
> > Hello Alain,
> >
> > we are using Flume for probably the same purposes. We are writing JSON
> encoded event data to flat file on every application server. Since each
> application server writes only maybe tens of events per second, the
> performance hit of writing to disk is negligible (and the events are
> written to disk only after the content is generated and sent to the user,
> so there is no latency for the end user). This file is tailed by Flume and
> delivered thru collectors to HDFS. The collectors are forking the events to
> RabbitMQ as well. We have a Node.js application, that picks up these events
> and does some real-time analytics on them. The delay between event
> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
> >
> > Hope this helps.
> >
> > --
> > Michal Táborský
> > chief systems architect
> > Netretail Holding, BV
> > nrholding.com
> >
> >
> >
> >
> > 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
> > Hi,
> >
> > I'm new with Flume and I'd like to use it to get a stable flow of data
> to my database (To be able to handle rush hours by delaying the write in
> database, without introducing any timeout or latency to the user).
> >
> > My questions are :
> >
> > What is the best way to create the log file that will be used as source
> for flume ?
> >
> > Our production environment is running apache servers and php scripts.
> > I can't just use access log because some informations are stored in
> session, so I need to build a custom source.
> > An other point is that writing a file seems to be primitive and not
> really efficient since it writes the disk instead of writing the memory for
> any event I store (many events every second).
> >
> > How to use this system (as Facebook does with scribe) to proceed
> real-time analytics ?
> >
> > I'm open to here about hdfs, hbase or whatever could help reaching my
> goals which are a stable flow to the database and near real-time analytics
> (seconds to minutes).
> >
> > Thanks for your help.
> >
> > Alain
> >
> >
>
>

Re: Logs used as Flume Sources and real-time analytics

Posted by alo alt <wg...@googlemail.com>.

Hi,

sorry for pitching in, but FlumeNG will not support tailing sources, because we have here a lot of problems. First, and mostly the worst problem is the marker in a tailing file. If the agent crash, or the server, or the collector the marker will be lost. So, if you restart you'll getting all events again. Sure, you can use append, but here you get lost events. 

For a easy migration from flume to flumeNG use sources which are supported in NG. Syslog as example, more sources you can found here: https://cwiki.apache.org/FLUME/flume-ng.html

You could use Avro for the sessions, and you could pipe direct to a local flume agent. Also syslog with a buffering mode could be work. Also in flumeNG now we have hBase handler and thrift. 
Another idea for collecting sessions could be http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html , an REST api for hdfs?

- Alex


--
Alexander Lorenz
http://mapredit.blogspot.com

On Feb 7, 2012, at 11:14 AM, Alain RODRIGUEZ wrote:

> Thank you for your answer it helps me a lot knowing I'am doing things in a good way.
> 
> I've got an other question: How do you manage restart the service after a crash ? I mean, you tail the log file, so if your server crashes or you stop the tail for any reason, how do you do not to tail all the logs from the start, how do you manage restarting from the exact point where you left your tail process ?
> 
> Thanks again for your help, I really appreciate :-).
> 
> Alain 
> 
> 2012/2/2 Michal Taborsky <mi...@nrholding.com>
> Hello Alain,
> 
> we are using Flume for probably the same purposes. We are writing JSON encoded event data to flat file on every application server. Since each application server writes only maybe tens of events per second, the performance hit of writing to disk is negligible (and the events are written to disk only after the content is generated and sent to the user, so there is no latency for the end user). This file is tailed by Flume and delivered thru collectors to HDFS. The collectors are forking the events to RabbitMQ as well. We have a Node.js application, that picks up these events and does some real-time analytics on them. The delay between event origination and analytics is below 10 seconds, usually 1-3 seconds in total.
> 
> Hope this helps.
> 
> --
> Michal Táborský
> chief systems architect
> Netretail Holding, BV
> nrholding.com
> 
> 
> 
> 
> 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
> Hi,
> 
> I'm new with Flume and I'd like to use it to get a stable flow of data to my database (To be able to handle rush hours by delaying the write in database, without introducing any timeout or latency to the user).
> 
> My questions are :
> 
> What is the best way to create the log file that will be used as source for flume ? 
> 
> Our production environment is running apache servers and php scripts.
> I can't just use access log because some informations are stored in session, so I need to build a custom source. 
> An other point is that writing a file seems to be primitive and not really efficient since it writes the disk instead of writing the memory for any event I store (many events every second).
> 
> How to use this system (as Facebook does with scribe) to proceed real-time analytics ?
> 
> I'm open to here about hdfs, hbase or whatever could help reaching my goals which are a stable flow to the database and near real-time analytics (seconds to minutes).
> 
> Thanks for your help.
> 
> Alain 
> 
>

Re: Logs used as Flume Sources and real-time analytics

Posted by Michal Taborsky <mi...@nrholding.com>.

Hi Alain,

we rotate the log files quite often and since Flume does not guarantee, in
the end-to-end mode, that the event will arrive exactly once, you have to
deal with duplicates anyway in the post processing. So we don't care if the
agent crashes and sends the entire file again, as long as it does not
happen too often, which it does not.

Michal

2012/2/7 Alain RODRIGUEZ <ar...@gmail.com>

> Thank you for your answer it helps me a lot knowing I'am doing things in a
> good way.
>
> I've got an other question: How do you manage restart the service after a
> crash ? I mean, you tail the log file, so if your server crashes or you
> stop the tail for any reason, how do you do not to tail all the logs from
> the start, how do you manage restarting from the exact point where you left
> your tail process ?
>
> Thanks again for your help, I really appreciate :-).
>
> Alain
>
> 2012/2/2 Michal Taborsky <mi...@nrholding.com>
>
>> Hello Alain,
>>
>> we are using Flume for probably the same purposes. We are writing JSON
>> encoded event data to flat file on every application server. Since each
>> application server writes only maybe tens of events per second, the
>> performance hit of writing to disk is negligible (and the events are
>> written to disk only after the content is generated and sent to the user,
>> so there is no latency for the end user). This file is tailed by Flume and
>> delivered thru collectors to HDFS. The collectors are forking the events to
>> RabbitMQ as well. We have a Node.js application, that picks up these events
>> and does some real-time analytics on them. The delay between event
>> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
>>
>> Hope this helps.
>>
>> *
>> --
>> Michal Táborský*
>> *chief systems architect*
>> Netretail Holding, BV
>> nrholding.com <http://www.nrholding.com>
>>
>>
>>
>>
>> 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
>>
>>> Hi,
>>>
>>> I'm new with Flume and I'd like to use it to get a stable flow of data
>>> to my database (To be able to handle rush hours by delaying the write in
>>> database, without introducing any timeout or latency to the user).
>>>
>>> My questions are :
>>>
>>> What is the best way to create the log file that will be used as source
>>> for flume ?
>>>
>>> Our production environment is running apache servers and php scripts.
>>> I can't just use access log because some informations are stored in
>>> session, so I need to build a custom source.
>>> An other point is that writing a file seems to be primitive and not
>>> really efficient since it writes the disk instead of writing the memory for
>>> any event I store (many events every second).
>>>
>>> How to use this system (as Facebook does with scribe) to proceed
>>> real-time analytics ?
>>>
>>> I'm open to here about hdfs, hbase or whatever could help reaching my
>>> goals which are a stable flow to the database and near real-time analytics
>>> (seconds to minutes).
>>>
>>> Thanks for your help.
>>>
>>> Alain
>>>
>>
>>
>

Re: Logs used as Flume Sources and real-time analytics

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Thank you for your answer it helps me a lot knowing I'am doing things in a
good way.

I've got an other question: How do you manage restart the service after a
crash ? I mean, you tail the log file, so if your server crashes or you
stop the tail for any reason, how do you do not to tail all the logs from
the start, how do you manage restarting from the exact point where you left
your tail process ?

Thanks again for your help, I really appreciate :-).

Alain

2012/2/2 Michal Taborsky <mi...@nrholding.com>

> Hello Alain,
>
> we are using Flume for probably the same purposes. We are writing JSON
> encoded event data to flat file on every application server. Since each
> application server writes only maybe tens of events per second, the
> performance hit of writing to disk is negligible (and the events are
> written to disk only after the content is generated and sent to the user,
> so there is no latency for the end user). This file is tailed by Flume and
> delivered thru collectors to HDFS. The collectors are forking the events to
> RabbitMQ as well. We have a Node.js application, that picks up these events
> and does some real-time analytics on them. The delay between event
> origination and analytics is below 10 seconds, usually 1-3 seconds in total.
>
> Hope this helps.
>
> *
> --
> Michal Táborský*
> *chief systems architect*
> Netretail Holding, BV
> nrholding.com <http://www.nrholding.com>
>
>
>
>
> 2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>
>
>> Hi,
>>
>> I'm new with Flume and I'd like to use it to get a stable flow of data to
>> my database (To be able to handle rush hours by delaying the write in
>> database, without introducing any timeout or latency to the user).
>>
>> My questions are :
>>
>> What is the best way to create the log file that will be used as source
>> for flume ?
>>
>> Our production environment is running apache servers and php scripts.
>> I can't just use access log because some informations are stored in
>> session, so I need to build a custom source.
>> An other point is that writing a file seems to be primitive and not
>> really efficient since it writes the disk instead of writing the memory for
>> any event I store (many events every second).
>>
>> How to use this system (as Facebook does with scribe) to proceed
>> real-time analytics ?
>>
>> I'm open to here about hdfs, hbase or whatever could help reaching my
>> goals which are a stable flow to the database and near real-time analytics
>> (seconds to minutes).
>>
>> Thanks for your help.
>>
>> Alain
>>
>
>

Re: Logs used as Flume Sources and real-time analytics

Posted by Michal Taborsky <mi...@nrholding.com>.

Hello Alain,

we are using Flume for probably the same purposes. We are writing JSON
encoded event data to flat file on every application server. Since each
application server writes only maybe tens of events per second, the
performance hit of writing to disk is negligible (and the events are
written to disk only after the content is generated and sent to the user,
so there is no latency for the end user). This file is tailed by Flume and
delivered thru collectors to HDFS. The collectors are forking the events to
RabbitMQ as well. We have a Node.js application, that picks up these events
and does some real-time analytics on them. The delay between event
origination and analytics is below 10 seconds, usually 1-3 seconds in total.

Hope this helps.

*
--
Michal Táborský*
*chief systems architect*
Netretail Holding, BV
nrholding.com <http://www.nrholding.com>




2012/2/2 Alain RODRIGUEZ <ar...@gmail.com>

> Hi,
>
> I'm new with Flume and I'd like to use it to get a stable flow of data to
> my database (To be able to handle rush hours by delaying the write in
> database, without introducing any timeout or latency to the user).
>
> My questions are :
>
> What is the best way to create the log file that will be used as source
> for flume ?
>
> Our production environment is running apache servers and php scripts.
> I can't just use access log because some informations are stored in
> session, so I need to build a custom source.
> An other point is that writing a file seems to be primitive and not really
> efficient since it writes the disk instead of writing the memory for any
> event I store (many events every second).
>
> How to use this system (as Facebook does with scribe) to proceed real-time
> analytics ?
>
> I'm open to here about hdfs, hbase or whatever could help reaching my
> goals which are a stable flow to the database and near real-time analytics
> (seconds to minutes).
>
> Thanks for your help.
>
> Alain
>