You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Asim Zafir <as...@gmail.com> on 2014/02/05 22:22:34 UTC

distributed weblogs ingestion on HDFS via flume

Flume Users,


Here is the problem statement, will be very much interested to have your
valuable input and feedback on the following:


*Assuming that fact that we generate  200GB of logs PER DAY from 50
webservers *



Goal is to sync that to HDFS repository





1) do all the webserver in our case needs to run a flume agent?

2) do all the webserver will be acting as source in our setup ?

3) can we sync webservers logs directly to HDFS store by passing channels?

4) do we have a choice of directly synching the weblogs to HDFS store and
not let the webserver right locally? what is the best practice?

5) what setup will that be where i would let the flume, sync a local
datadire on weblogs, and sync it as soon as the date arrives to this
directory?

6) do i need a dedicated flume server for this setup?

7) if i do use  memory based channel and then do HDFS sync do I need a
dedicated server, or can run those agents on the webserver itself, provided
there is enough memory OR would it be recommended to position my config to
a centralize flume server and the establish the sync.

8) how should we do the capacity planning for a memory based channel?

9) how should we do the capacity planning for a file based channel ?



sincerely,

AZ

RE: distributed weblogs ingestion on HDFS via flume

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

Hi Asim,

I have a similar use case that has been in production for about a year. We have 6 web servers sending about 15GB a day of web server logs to an 11 node Hadoop cluster. Additionally those same hosts send another few GB of data from the web applications themselves to HDFS using flume. I would say in total we send about 120GB a day to HDFS from those 6 hosts. The web servers each run a local flume agent with an identical config.  I will just describe our configuration as I think it answers most of your questions.

The web logs are IIS log files that roll once a day and are written to continuously. We wanted a relatively low latency from log file to HDFS so we use a scheduled task to create an incremental 'diff' file every minute and place it in a spool directory. From there a spoolDir source on the local agent processes the file. The same task that creates the diff files also cleans up ones that have been processed by flume. This spoolDir source is just for weblogs and has no special routing, it uses a file channel and Avro sinks to connect to HDFS. There are two sinks in a failover configuration using a sink group that send to flume agents co-located on HDFS data nodes.

The application logs are JSON data that is sent directly from the application to flume via an httpSource. Again, the logs are sent to the same local agent. We have about a dozen separate log streams in total from our application but we send all these events to the one httpSource. Using headers we split into 'high' and 'low' priority streams using a multiplexing channel selector. We don't really do any special handling of the high priority events, we just watch one of them a LOT closer than the other. ;)  These channels are also drained by an Avro sink and are sent to a different pair of flume agents, again co-located with data nodes.

The flume agents that run on the data nodes just have avro sources, file channels and HDFS sinks. We have found two HDFS sinks (without any sink groups) can be necessary to keep the channels from backing up in some cases. The amount of file channels on these agents varies, some log streams are split into their own channels first and many of them use a 'catch all' channel that uses tokenized paths on the sink to write the data to different locations in HDFS according to header values.  The HDFS sinks all write DataStream files with format Text and bucket into Hive friendly partitioned directories by date and hour. The JSON events are one line each and we use a custom Hive SerDe for the JSON data. We use Hive external table definitions to read the data and use Oozie to process every log stream hourly into Snappy compressed internal Hive tables and then drop the raw data in 8 days. We don't use Impala all that much as most of our workflows just crunch the data and push it back into a SQL DW for the data people.

Here is a good start for capacity planning: https://cwiki.apache.org/confluence/display/FLUME/Flume%27s+Memory+Consumption.

We have gotten away with default channel sizes (1 million) so far without issue. We do try to separate the file channels to different physical disks as much as we can to optimize our hardware.

Hope that helps,
Paul Chavez


From: Asim Zafir [mailto:asim.zafir@gmail.com]
Sent: Wednesday, February 05, 2014 1:23 PM
To: user@flume.apache.org
Subject: distributed weblogs ingestion on HDFS via flume

Flume Users,

Here is the problem statement, will be very much interested to have your valuable input and feedback on the following:

Assuming that fact that we generate  200GB of logs PER DAY from 50 webservers

Goal is to sync that to HDFS repository


1) do all the webserver in our case needs to run a flume agent?
2) do all the webserver will be acting as source in our setup ?
3) can we sync webservers logs directly to HDFS store by passing channels?
4) do we have a choice of directly synching the weblogs to HDFS store and not let the webserver right locally? what is the best practice?
5) what setup will that be where i would let the flume, sync a local datadire on weblogs, and sync it as soon as the date arrives to this directory?
6) do i need a dedicated flume server for this setup?
7) if i do use  memory based channel and then do HDFS sync do I need a dedicated server, or can run those agents on the webserver itself, provided there is enough memory OR would it be recommended to position my config to a centralize flume server and the establish the sync.
8) how should we do the capacity planning for a memory based channel?
9) how should we do the capacity planning for a file based channel ?

sincerely,
AZ

Re: distributed weblogs ingestion on HDFS via flume

Posted by ed <ed...@gmail.com>.

HI Asim,

You would not do AvroSink to HDFS as both of these are sinks.  You would
only use AvroSource and AvroSink if you need to create a multi-hop topology
where you have to forward logs from one flume agent to another.  If you're
using sylogd to do the forwarding then you would not need to use AvroSource
and AvroSink.  You can store the data in HDFS using any format you want.
 Avro is nice because Avro containers are splittable, the binary format is
very compact, the data is more portable, and it's faster to serialize and
deserialize than say reading in a line of text, splitting it, and type
casting everything to the correct type within your MR job.  Sort of a side
benefit but HUE will display deflate compressed avro files for you in a
human readable way via the File Browser plugin which is pretty darn nice
when you're working with your data.

I think AvroSource and AvroSink are actually a little confusing in the
naming.  They don't convert your data to avro for you other than the
intermediate format they use to pass data between themselves via the Avro
RPC mechanism.  If you want to convert your data to Avro for storage in
HDFS you would do this using an event serializer in the HDFS Sink.  By
default I think the HDFS Sink will write your data out as a SequenceFiles.

Flume has a built in AvroEventSerializer but you will need to write your
own EventSerializer if you have a specific Avro schema you want to use
(which you probably will), otherwise SequenceFiles work well and have been
around a long time so are supported by pretty much everything in the Hadoop
ecosystem.

I've never used Impala but Hive works just fine with deflate compressed
Avro files using the AvroSerde and I think Snappy should work to.

~Ed



On Thu, Feb 6, 2014 at 10:29 AM, Asim Zafir <as...@gmail.com> wrote:

> Ed,
>
> thanks for the response!.. I was wondering if we do use avro sink to hdfs,
> I assume the resident file format in HDFS will be avro?, the reason i am
> asking this question is because  hive/impala and map/reduce supposed to
> have dependency on file format and compression.as stated here
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_file_formats.html
>
> will be really interested to see your response as to how do you guys have
> handled or suggest to handle these issues?
>
> thanks
>
> Asim
>
>
> On Wed, Feb 5, 2014 at 4:09 PM, ed <ed...@gmail.com> wrote:
>
>> Hi Asim,
>>
>>
>> Here's some information that might be helpful based on my relatively new
>> experience with Flume:
>>
>>
>>  *1) do all the webserver in our case needs to run a flume agent?*
>>
>>
>> They could but don't necessarily have to.  For example, if you don't want
>> to put a flume agent on all your web servers you could forward the logs
>> using syslog to another server running a flume agent listening for the logs
>> using the syslog source.  If you do want to put a flume agent on your web
>> servers then you could send the logs to a local syslog source which would
>> use the avro sink to pass the logs to the flume collection server which
>> would do the actually writing to HDFS, or you could use a file spooler
>> source to read the logs from disk and then forward them to the collector
>> (again using avro source and sink)
>>
>>
>> *Not Using Flume on the Webservers:*
>>
>>
>> [webserver1: apache -> syslogd] ==>
>>
>> [webserver2: apache -> syslogd] ==> [flume collection server: flume
>> syslog source --> flume hdfs sink]
>>
>> [webserver3: apache -> syslogd] ==>
>>
>>
>> *Using Flume on the Webservers Option1:*
>>
>>
>> [webserver1: apache -> syslogd -> flume syslog source -> flume avro sink]
>> ==>
>>
>> [webserver2: apache -> syslogd -> flume syslog source -> flume avro sink]
>> ==>  [flume collection server: flume avro source --> flume hdfs sink]
>>
>> [webserver3: apache -> syslogd -> flume syslog source -> flume avro sink]
>> ==>
>>
>>
>> *Using Flume on Webservers Option2:*
>>
>>
>> [webserver1: apache -> filesystem -> flume file spooler source -> flume
>> avro sink] ==>
>>
>> [webserver2: apache -> filesystem -> flume file spooler source -> flume
>> avro sink] ==> [flume collection server: flume avro source --> flume hdfs
>> sink]
>>
>> [webserver3: apache -> filesystem -> flume file spooler source -> flume
>> avro sink] ==>
>>
>>
>> (by the way there are probably other ways to do this and you could even
>> split out the collection tier from the storage tier (currently done by the
>> same final agent)
>>
>>
>> *2) do all the webserver will be acting as source in our setup ?*
>>
>>
>> They will be acting as a source in the general sense that you want to
>> ingest their logs.  However, they don't necessarily have to run a flume
>> agent if you have some other way to ship the logs to a listening flume
>> agent somewhere (most likely using syslog but we've also had success with
>> receiving logs via the netcat source).
>>
>>
>> *3) can we sync webservers logs directly to HDFS store by passing
>> channels?*
>>
>>
>> Not sure what you mean here but you will need a flume source and sink
>> running (in this case an HDFS sink).  You can't get the logs into HDFS
>> using only a channel.
>>
>>
>> *4) do we have a choice of directly synching the weblogs to HDFS store
>> and not let the webserver right locally? what is the best practice?*
>>
>>
>> If for example you're using Apache you could configure apache to send the
>> logs directly to syslog which would forward them to the listening Flume
>> syslog source on a remote server which would then write the logs to HDFS
>> using the HDFS sink over a memory channel.  In this case you could avoid
>> having the logs written to disk but if one part of the data flow goes down
>> (e.g., the flume agent crashes) you will lose log data.  You could switch
>> to a file channel which is durable and would help minimize the risk of data
>> loss.  If you don't care about potential data loss then memory channel is
>> much faster and a bit easier to setup.
>>
>>
>> *5) what setup will that be where i would let the flume, sync a local
>> datadire on weblogs, and sync it as soon as the date arrives to this
>> directory?*
>>
>>
>> You would want to use a file spooler source to read the log directory
>> then send to a collector using the avrosource/sink.
>>
>>
>> *6) do i need a dedicated flume server for this setup?*
>>
>>
>> It depends on what else the flume server is doing.  Personally I think
>> it's much easier if you dedicate a box to the task as you don't have to
>> worry about resource contention and monitoring becomes easier.  In
>> addition, if you use the file channel you will want dedicated disks for
>> that purpose.  Note that I'm referring to your collector/storage tier.
>>  Obviously if you use a flume agent on the webserver it will not be a
>> dedicated box but this shouldn't be an issue as that agent is only
>> responsible for collecting logs off a single machine and forwarding them on
>> (this blog post has some good info on tuning and topology design:
>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1)
>>
>>
>> *7) if i do use  memory based channel and then do HDFS sync do I need a
>> dedicated server, or can run those agents on the webserver itself, provided
>> there is enough memory OR would it be recommended to position my config to
>> a centralize flume server and the establish the sync.*
>>
>>
>> I would not recommend running flume agents on all the webservers with
>> HDFS sink.  It seems much better to funnel the logs to 1 or more agents
>> that write to HDFS but not have all 50 webservers writing themselves.
>>
>>
>> *8) how should we do the capacity planning for a memory based channel?*
>>
>>
>> You have to decide how long you want to be able to hold data in the
>> memory channel in the event a downstream agent does down (or the HDFS sink
>> gets backed up).  Once you have that value you need to figure out what your
>> average event size is and the rate at which you are collecting events.
>>  This will give you a rough idea.  I'm sure there is some per event memory
>> overhead as well (but I don't know the exact value for that).  If you're
>> using Cloudera Manager you can monitor the memory channel usage directly
>> from the Cloudera Manager interface which is very useful.
>>
>>
>> *9) how should we do the capacity planning for a file based channel ?*
>>
>>
>> Assuming you're referring to heap memory, I think I saw in a different
>> thread that you need 32 bytes per event you want to store (the channel
>> capacity) + whatever Flume core will use. So if your channel capacity is 1
>> million events you will need ~32MB of heap space + 100-500MB for Flume
>> core.  You will of course need enough disk space to store the actual logs
>> themselves.
>>
>>
>> Best,
>>
>>
>> Ed
>>
>>
>>
>>
>>
>> On Thu, Feb 6, 2014 at 6:22 AM, Asim Zafir <as...@gmail.com> wrote:
>>
>>> Flume Users,
>>>
>>>
>>> Here is the problem statement, will be very much interested to have your
>>> valuable input and feedback on the following:
>>>
>>>
>>> *Assuming that fact that we generate  200GB of logs PER DAY from 50
>>> webservers *
>>>
>>>
>>>
>>> Goal is to sync that to HDFS repository
>>>
>>>
>>>
>>>
>>>
>>> 1) do all the webserver in our case needs to run a flume agent?
>>>
>>> 2) do all the webserver will be acting as source in our setup ?
>>>
>>> 3) can we sync webservers logs directly to HDFS store by passing
>>> channels?
>>>
>>> 4) do we have a choice of directly synching the weblogs to HDFS store
>>> and not let the webserver right locally? what is the best practice?
>>>
>>> 5) what setup will that be where i would let the flume, sync a local
>>> datadire on weblogs, and sync it as soon as the date arrives to this
>>> directory?
>>>
>>> 6) do i need a dedicated flume server for this setup?
>>>
>>> 7) if i do use  memory based channel and then do HDFS sync do I need a
>>> dedicated server, or can run those agents on the webserver itself, provided
>>> there is enough memory OR would it be recommended to position my config to
>>> a centralize flume server and the establish the sync.
>>>
>>> 8) how should we do the capacity planning for a memory based channel?
>>>
>>> 9) how should we do the capacity planning for a file based channel ?
>>>
>>>
>>>
>>> sincerely,
>>>
>>> AZ
>>>
>>
>>
>

Re: distributed weblogs ingestion on HDFS via flume

Posted by Asim Zafir <as...@gmail.com>.

Ed,

thanks for the response!.. I was wondering if we do use avro sink to hdfs,
I assume the resident file format in HDFS will be avro?, the reason i am
asking this question is because  hive/impala and map/reduce supposed to
have dependency on file format and compression.as stated here
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_file_formats.html

will be really interested to see your response as to how do you guys have
handled or suggest to handle these issues?

thanks

Asim


On Wed, Feb 5, 2014 at 4:09 PM, ed <ed...@gmail.com> wrote:

> Hi Asim,
>
>
> Here's some information that might be helpful based on my relatively new
> experience with Flume:
>
>
>  *1) do all the webserver in our case needs to run a flume agent?*
>
>
> They could but don't necessarily have to.  For example, if you don't want
> to put a flume agent on all your web servers you could forward the logs
> using syslog to another server running a flume agent listening for the logs
> using the syslog source.  If you do want to put a flume agent on your web
> servers then you could send the logs to a local syslog source which would
> use the avro sink to pass the logs to the flume collection server which
> would do the actually writing to HDFS, or you could use a file spooler
> source to read the logs from disk and then forward them to the collector
> (again using avro source and sink)
>
>
> *Not Using Flume on the Webservers:*
>
>
> [webserver1: apache -> syslogd] ==>
>
> [webserver2: apache -> syslogd] ==> [flume collection server: flume syslog
> source --> flume hdfs sink]
>
> [webserver3: apache -> syslogd] ==>
>
>
> *Using Flume on the Webservers Option1:*
>
>
> [webserver1: apache -> syslogd -> flume syslog source -> flume avro sink]
> ==>
>
> [webserver2: apache -> syslogd -> flume syslog source -> flume avro sink]
> ==>  [flume collection server: flume avro source --> flume hdfs sink]
>
> [webserver3: apache -> syslogd -> flume syslog source -> flume avro sink]
> ==>
>
>
> *Using Flume on Webservers Option2:*
>
>
> [webserver1: apache -> filesystem -> flume file spooler source -> flume
> avro sink] ==>
>
> [webserver2: apache -> filesystem -> flume file spooler source -> flume
> avro sink] ==> [flume collection server: flume avro source --> flume hdfs
> sink]
>
> [webserver3: apache -> filesystem -> flume file spooler source -> flume
> avro sink] ==>
>
>
> (by the way there are probably other ways to do this and you could even
> split out the collection tier from the storage tier (currently done by the
> same final agent)
>
>
> *2) do all the webserver will be acting as source in our setup ?*
>
>
> They will be acting as a source in the general sense that you want to
> ingest their logs.  However, they don't necessarily have to run a flume
> agent if you have some other way to ship the logs to a listening flume
> agent somewhere (most likely using syslog but we've also had success with
> receiving logs via the netcat source).
>
>
> *3) can we sync webservers logs directly to HDFS store by passing
> channels?*
>
>
> Not sure what you mean here but you will need a flume source and sink
> running (in this case an HDFS sink).  You can't get the logs into HDFS
> using only a channel.
>
>
> *4) do we have a choice of directly synching the weblogs to HDFS store and
> not let the webserver right locally? what is the best practice?*
>
>
> If for example you're using Apache you could configure apache to send the
> logs directly to syslog which would forward them to the listening Flume
> syslog source on a remote server which would then write the logs to HDFS
> using the HDFS sink over a memory channel.  In this case you could avoid
> having the logs written to disk but if one part of the data flow goes down
> (e.g., the flume agent crashes) you will lose log data.  You could switch
> to a file channel which is durable and would help minimize the risk of data
> loss.  If you don't care about potential data loss then memory channel is
> much faster and a bit easier to setup.
>
>
> *5) what setup will that be where i would let the flume, sync a local
> datadire on weblogs, and sync it as soon as the date arrives to this
> directory?*
>
>
> You would want to use a file spooler source to read the log directory then
> send to a collector using the avrosource/sink.
>
>
> *6) do i need a dedicated flume server for this setup?*
>
>
> It depends on what else the flume server is doing.  Personally I think
> it's much easier if you dedicate a box to the task as you don't have to
> worry about resource contention and monitoring becomes easier.  In
> addition, if you use the file channel you will want dedicated disks for
> that purpose.  Note that I'm referring to your collector/storage tier.
>  Obviously if you use a flume agent on the webserver it will not be a
> dedicated box but this shouldn't be an issue as that agent is only
> responsible for collecting logs off a single machine and forwarding them on
> (this blog post has some good info on tuning and topology design:
> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1)
>
>
> *7) if i do use  memory based channel and then do HDFS sync do I need a
> dedicated server, or can run those agents on the webserver itself, provided
> there is enough memory OR would it be recommended to position my config to
> a centralize flume server and the establish the sync.*
>
>
> I would not recommend running flume agents on all the webservers with HDFS
> sink.  It seems much better to funnel the logs to 1 or more agents that
> write to HDFS but not have all 50 webservers writing themselves.
>
>
> *8) how should we do the capacity planning for a memory based channel?*
>
>
> You have to decide how long you want to be able to hold data in the memory
> channel in the event a downstream agent does down (or the HDFS sink gets
> backed up).  Once you have that value you need to figure out what your
> average event size is and the rate at which you are collecting events.
>  This will give you a rough idea.  I'm sure there is some per event memory
> overhead as well (but I don't know the exact value for that).  If you're
> using Cloudera Manager you can monitor the memory channel usage directly
> from the Cloudera Manager interface which is very useful.
>
>
> *9) how should we do the capacity planning for a file based channel ?*
>
>
> Assuming you're referring to heap memory, I think I saw in a different
> thread that you need 32 bytes per event you want to store (the channel
> capacity) + whatever Flume core will use. So if your channel capacity is 1
> million events you will need ~32MB of heap space + 100-500MB for Flume
> core.  You will of course need enough disk space to store the actual logs
> themselves.
>
>
> Best,
>
>
> Ed
>
>
>
>
>
> On Thu, Feb 6, 2014 at 6:22 AM, Asim Zafir <as...@gmail.com> wrote:
>
>> Flume Users,
>>
>>
>> Here is the problem statement, will be very much interested to have your
>> valuable input and feedback on the following:
>>
>>
>> *Assuming that fact that we generate  200GB of logs PER DAY from 50
>> webservers *
>>
>>
>>
>> Goal is to sync that to HDFS repository
>>
>>
>>
>>
>>
>> 1) do all the webserver in our case needs to run a flume agent?
>>
>> 2) do all the webserver will be acting as source in our setup ?
>>
>> 3) can we sync webservers logs directly to HDFS store by passing channels?
>>
>> 4) do we have a choice of directly synching the weblogs to HDFS store and
>> not let the webserver right locally? what is the best practice?
>>
>> 5) what setup will that be where i would let the flume, sync a local
>> datadire on weblogs, and sync it as soon as the date arrives to this
>> directory?
>>
>> 6) do i need a dedicated flume server for this setup?
>>
>> 7) if i do use  memory based channel and then do HDFS sync do I need a
>> dedicated server, or can run those agents on the webserver itself, provided
>> there is enough memory OR would it be recommended to position my config to
>> a centralize flume server and the establish the sync.
>>
>> 8) how should we do the capacity planning for a memory based channel?
>>
>> 9) how should we do the capacity planning for a file based channel ?
>>
>>
>>
>> sincerely,
>>
>> AZ
>>
>
>

Re: distributed weblogs ingestion on HDFS via flume

Posted by ed <ed...@gmail.com>.

Hi Asim,

Here's some information that might be helpful based on my relatively new
experience with Flume:

*1) do all the webserver in our case needs to run a flume agent?*

They could but don't necessarily have to.  For example, if you don't want
to put a flume agent on all your web servers you could forward the logs
using syslog to another server running a flume agent listening for the logs
using the syslog source.  If you do want to put a flume agent on your web
servers then you could send the logs to a local syslog source which would
use the avro sink to pass the logs to the flume collection server which
would do the actually writing to HDFS, or you could use a file spooler
source to read the logs from disk and then forward them to the collector
(again using avro source and sink)

*Not Using Flume on the Webservers:*

[webserver1: apache -> syslogd] ==>

[webserver2: apache -> syslogd] ==> [flume collection server: flume syslog
source --> flume hdfs sink]

[webserver3: apache -> syslogd] ==>

*Using Flume on the Webservers Option1:*

[webserver1: apache -> syslogd -> flume syslog source -> flume avro sink]
==>

[webserver2: apache -> syslogd -> flume syslog source -> flume avro sink]
==>  [flume collection server: flume avro source --> flume hdfs sink]

[webserver3: apache -> syslogd -> flume syslog source -> flume avro sink]
==>

*Using Flume on Webservers Option2:*

[webserver1: apache -> filesystem -> flume file spooler source -> flume
avro sink] ==>

[webserver2: apache -> filesystem -> flume file spooler source -> flume
avro sink] ==> [flume collection server: flume avro source --> flume hdfs
sink]

[webserver3: apache -> filesystem -> flume file spooler source -> flume
avro sink] ==>

(by the way there are probably other ways to do this and you could even
split out the collection tier from the storage tier (currently done by the
same final agent)

*2) do all the webserver will be acting as source in our setup ?*

They will be acting as a source in the general sense that you want to
ingest their logs.  However, they don't necessarily have to run a flume
agent if you have some other way to ship the logs to a listening flume
agent somewhere (most likely using syslog but we've also had success with
receiving logs via the netcat source).

*3) can we sync webservers logs directly to HDFS store by passing channels?*

Not sure what you mean here but you will need a flume source and sink
running (in this case an HDFS sink).  You can't get the logs into HDFS
using only a channel.

*4) do we have a choice of directly synching the weblogs to HDFS store and
not let the webserver right locally? what is the best practice?*

If for example you're using Apache you could configure apache to send the
logs directly to syslog which would forward them to the listening Flume
syslog source on a remote server which would then write the logs to HDFS
using the HDFS sink over a memory channel.  In this case you could avoid
having the logs written to disk but if one part of the data flow goes down
(e.g., the flume agent crashes) you will lose log data.  You could switch
to a file channel which is durable and would help minimize the risk of data
loss.  If you don't care about potential data loss then memory channel is
much faster and a bit easier to setup.

*5) what setup will that be where i would let the flume, sync a local
datadire on weblogs, and sync it as soon as the date arrives to this
directory?*

You would want to use a file spooler source to read the log directory then
send to a collector using the avrosource/sink.

*6) do i need a dedicated flume server for this setup?*

It depends on what else the flume server is doing.  Personally I think it's
much easier if you dedicate a box to the task as you don't have to worry
about resource contention and monitoring becomes easier.  In addition, if
you use the file channel you will want dedicated disks for that purpose.
 Note that I'm referring to your collector/storage tier.  Obviously if you
use a flume agent on the webserver it will not be a dedicated box but this
shouldn't be an issue as that agent is only responsible for collecting logs
off a single machine and forwarding them on (this blog post has some good
info on tuning and topology design:
https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1)

*7) if i do use  memory based channel and then do HDFS sync do I need a
dedicated server, or can run those agents on the webserver itself, provided
there is enough memory OR would it be recommended to position my config to
a centralize flume server and the establish the sync.*

I would not recommend running flume agents on all the webservers with HDFS
sink.  It seems much better to funnel the logs to 1 or more agents that
write to HDFS but not have all 50 webservers writing themselves.

*8) how should we do the capacity planning for a memory based channel?*

You have to decide how long you want to be able to hold data in the memory
channel in the event a downstream agent does down (or the HDFS sink gets
backed up).  Once you have that value you need to figure out what your
average event size is and the rate at which you are collecting events.
 This will give you a rough idea.  I'm sure there is some per event memory
overhead as well (but I don't know the exact value for that).  If you're
using Cloudera Manager you can monitor the memory channel usage directly
from the Cloudera Manager interface which is very useful.

*9) how should we do the capacity planning for a file based channel ?*

Assuming you're referring to heap memory, I think I saw in a different
thread that you need 32 bytes per event you want to store (the channel
capacity) + whatever Flume core will use. So if your channel capacity is 1
million events you will need ~32MB of heap space + 100-500MB for Flume
core.  You will of course need enough disk space to store the actual logs
themselves.

Best,

Ed

On Thu, Feb 6, 2014 at 6:22 AM, Asim Zafir <as...@gmail.com> wrote:

> Flume Users,
>
>
> Here is the problem statement, will be very much interested to have your
> valuable input and feedback on the following:
>
>
> *Assuming that fact that we generate  200GB of logs PER DAY from 50
> webservers *
>
>
>
> Goal is to sync that to HDFS repository
>
>
>
>
>
> 1) do all the webserver in our case needs to run a flume agent?
>
> 2) do all the webserver will be acting as source in our setup ?
>
> 3) can we sync webservers logs directly to HDFS store by passing channels?
>
> 4) do we have a choice of directly synching the weblogs to HDFS store and
> not let the webserver right locally? what is the best practice?
>
> 5) what setup will that be where i would let the flume, sync a local
> datadire on weblogs, and sync it as soon as the date arrives to this
> directory?
>
> 6) do i need a dedicated flume server for this setup?
>
> 7) if i do use  memory based channel and then do HDFS sync do I need a
> dedicated server, or can run those agents on the webserver itself, provided
> there is enough memory OR would it be recommended to position my config to
> a centralize flume server and the establish the sync.
>
> 8) how should we do the capacity planning for a memory based channel?
>
> 9) how should we do the capacity planning for a file based channel ?
>
>
>
> sincerely,
>
> AZ
>