You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Mohit Durgapal <du...@gmail.com> on 2014/04/03 15:20:15 UTC

Flume Configuration & topology approach

Hi,

We are setting up a flume cluster but we are facing some issues related to
heap size (out of memory). Is there a standard configuration for a standard
load?

If there is can you suggest what would it be for the load stats given below?

Also, we are not sure what topology to go ahead with in our use case.

We basically have two web servers which can generate logs at the speed of
2000 entries per second. Each entry of size around 137Bytes.

Currently we have used rsyslog( writing to a tcp port) to which a php
script writes these logs to. And we are running a local flume agent on each
webserver , these local agents listen to a tcp port and put data directly
in hdfs.

 So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.

I am confused between three approaches:

Approach 1: Web Server, RSyslog & Flume Agent on same machine  and a Flume
collector running on the Namenode in hadoop cluster, to collect the data
and dump into hdfs.

Approach 2: Web Server, RSyslog on same machine  and a Flume collector
(listening on a remote port for events written by rsyslog on web
server)running on the Namenode in hadoop cluster, to collect the data and
dump into hdfs.


Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all
agents writing directly to the hdfs.


Also, we are using hive, so we are writing directly into partitioned
directories. So we want to think of an approach that allows us to write on
Hourly partitions.

I hope that's not too vague.



Regards
Mohit Durgapal

RE: Flume Configuration & topology approach

Posted by "Sun, Lining" <Li...@intuit.com>.

We implemented memory mapping channel that has memory channel performance and also file channel reliability. We plan to contribute it to open source in near future.

Lining

From: Jeff Lord [mailto:jlord@cloudera.com]
Sent: Monday, April 07, 2014 8:48 AM
To: user@flume.apache.org
Subject: Re: Flume Configuration & topology approach

No. If you need to guarantee delivery of events please use a file channel.
https://blogs.apache.org/flume/entry/apache_flume_filechannel

On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon <cs...@gmail.com>> wrote:

On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com>> wrote:
>
>
>
>
> On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <du...@gmail.com>> wrote:
>>
>> Hi Jeff,
>>
>> Yes, I am using the memory channel, and that's because I want it to be more reliable and not miss any events/messages.
>> As I've read in flume documentation that the memory channel is fast but there could be a chance of missing events if the in-memory buffer fills up.
>
>
> Memory channel is not reliable, meaning if the flume agent goes down or is restarted while there are events in the channel than this data will be lost.
> For reliability please use the file channel.
>

Jeff,

I am using an upstream agent with a spooling directory source and a memory channel, and the downstream agent uses a memory channel and an HDFS sink. If my downstream agent goes down for any reason, are the entries lost in the downstream agent's memory channel still preserved in the memory channel / file directory of the upstream agent?

All the best, Chris

Re: Flume Configuration & topology approach

Posted by Christopher Shannon <cs...@gmail.com>.

Got it. Thanx.
On Apr 7, 2014 12:39 PM, "Jeff Lord" <jl...@cloudera.com> wrote:

> No not at all. Flume's transactional model guarantees delivery between
> hops.
> https://blogs.apache.org/flume/entry/flume_ng_architecture
>
>
> On Mon, Apr 7, 2014 at 10:16 AM, Christopher Shannon <
> cshannon108@gmail.com> wrote:
>
>> So, this basically means that Flume's transactional model is also
>> unreliable. That would have to mean that the downstream agent is sending an
>> ack to the upstream agent before it actually persists the event.
>>
>> On Apr 7, 2014 10:48 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
>> >
>> > No. If you need to guarantee delivery of events please use a file
>> channel.
>> > https://blogs.apache.org/flume/entry/apache_flume_filechannel
>> >
>> >
>> > On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon <
>> cshannon108@gmail.com> wrote:
>> >>
>> >>
>> >> On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <
>> durgapalmohit@gmail.com> wrote:
>> >> >>
>> >> >> Hi Jeff,
>> >> >>
>> >> >> Yes, I am using the memory channel, and that's because I want it to
>> be more reliable and not miss any events/messages.
>> >> >> As I've read in flume documentation that the memory channel is fast
>> but there could be a chance of missing events if the in-memory buffer fills
>> up.
>> >> >
>> >> >
>> >> > Memory channel is not reliable, meaning if the flume agent goes down
>> or is restarted while there are events in the channel than this data will
>> be lost.
>> >> > For reliability please use the file channel.
>> >> >
>> >>
>> >> Jeff,
>> >>
>> >> I am using an upstream agent with a spooling directory source and a
>> memory channel, and the downstream agent uses a memory channel and an HDFS
>> sink. If my downstream agent goes down for any reason, are the entries lost
>> in the downstream agent's memory channel still preserved in the memory
>> channel / file directory of the upstream agent?
>> >>
>> >> All the best, Chris
>> >
>> >
>>  No. If you need to guarantee delivery of events please use a file
>> channel.
>> https://blogs.apache.org/flume/entry/apache_flume_filechannel
>>
>>
>> On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon <
>> cshannon108@gmail.com> wrote:
>>
>>>
>>> On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <
>>> durgapalmohit@gmail.com> wrote:
>>> >>
>>> >> Hi Jeff,
>>> >>
>>> >> Yes, I am using the memory channel, and that's because I want it to
>>> be more reliable and not miss any events/messages.
>>> >> As I've read in flume documentation that the memory channel is fast
>>> but there could be a chance of missing events if the in-memory buffer fills
>>> up.
>>> >
>>> >
>>> > Memory channel is not reliable, meaning if the flume agent goes down
>>> or is restarted while there are events in the channel than this data will
>>> be lost.
>>> > For reliability please use the file channel.
>>> >
>>>
>>> Jeff,
>>>
>>> I am using an upstream agent with a spooling directory source and a
>>> memory channel, and the downstream agent uses a memory channel and an HDFS
>>> sink. If my downstream agent goes down for any reason, are the entries lost
>>> in the downstream agent's memory channel still preserved in the memory
>>> channel / file directory of the upstream agent?
>>>
>>> All the best, Chris
>>>
>>
>>
>

Re: Flume Configuration & topology approach

Posted by Jeff Lord <jl...@cloudera.com>.

No not at all. Flume's transactional model guarantees delivery between hops.
https://blogs.apache.org/flume/entry/flume_ng_architecture


On Mon, Apr 7, 2014 at 10:16 AM, Christopher Shannon
<cs...@gmail.com>wrote:

> So, this basically means that Flume's transactional model is also
> unreliable. That would have to mean that the downstream agent is sending an
> ack to the upstream agent before it actually persists the event.
>
> On Apr 7, 2014 10:48 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
> >
> > No. If you need to guarantee delivery of events please use a file
> channel.
> > https://blogs.apache.org/flume/entry/apache_flume_filechannel
> >
> >
> > On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon <
> cshannon108@gmail.com> wrote:
> >>
> >>
> >> On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <
> durgapalmohit@gmail.com> wrote:
> >> >>
> >> >> Hi Jeff,
> >> >>
> >> >> Yes, I am using the memory channel, and that's because I want it to
> be more reliable and not miss any events/messages.
> >> >> As I've read in flume documentation that the memory channel is fast
> but there could be a chance of missing events if the in-memory buffer fills
> up.
> >> >
> >> >
> >> > Memory channel is not reliable, meaning if the flume agent goes down
> or is restarted while there are events in the channel than this data will
> be lost.
> >> > For reliability please use the file channel.
> >> >
> >>
> >> Jeff,
> >>
> >> I am using an upstream agent with a spooling directory source and a
> memory channel, and the downstream agent uses a memory channel and an HDFS
> sink. If my downstream agent goes down for any reason, are the entries lost
> in the downstream agent's memory channel still preserved in the memory
> channel / file directory of the upstream agent?
> >>
> >> All the best, Chris
> >
> >
>  No. If you need to guarantee delivery of events please use a file
> channel.
> https://blogs.apache.org/flume/entry/apache_flume_filechannel
>
>
> On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon <cshannon108@gmail.com
> > wrote:
>
>>
>> On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
>> >
>> >
>> >
>> >
>> > On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <
>> durgapalmohit@gmail.com> wrote:
>> >>
>> >> Hi Jeff,
>> >>
>> >> Yes, I am using the memory channel, and that's because I want it to be
>> more reliable and not miss any events/messages.
>> >> As I've read in flume documentation that the memory channel is fast
>> but there could be a chance of missing events if the in-memory buffer fills
>> up.
>> >
>> >
>> > Memory channel is not reliable, meaning if the flume agent goes down or
>> is restarted while there are events in the channel than this data will be
>> lost.
>> > For reliability please use the file channel.
>> >
>>
>> Jeff,
>>
>> I am using an upstream agent with a spooling directory source and a
>> memory channel, and the downstream agent uses a memory channel and an HDFS
>> sink. If my downstream agent goes down for any reason, are the entries lost
>> in the downstream agent's memory channel still preserved in the memory
>> channel / file directory of the upstream agent?
>>
>> All the best, Chris
>>
>
>

Re: Flume Configuration & topology approach

Posted by Christopher Shannon <cs...@gmail.com>.

So, this basically means that Flume's transactional model is also
unreliable. That would have to mean that the downstream agent is sending an
ack to the upstream agent before it actually persists the event.

On Apr 7, 2014 10:48 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
>
> No. If you need to guarantee delivery of events please use a file channel.
> https://blogs.apache.org/flume/entry/apache_flume_filechannel
>
>
> On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon <cs...@gmail.com>
wrote:
>>
>>
>> On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
>> >
>> >
>> >
>> >
>> > On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <
durgapalmohit@gmail.com> wrote:
>> >>
>> >> Hi Jeff,
>> >>
>> >> Yes, I am using the memory channel, and that's because I want it to
be more reliable and not miss any events/messages.
>> >> As I've read in flume documentation that the memory channel is fast
but there could be a chance of missing events if the in-memory buffer fills
up.
>> >
>> >
>> > Memory channel is not reliable, meaning if the flume agent goes down
or is restarted while there are events in the channel than this data will
be lost.
>> > For reliability please use the file channel.
>> >
>>
>> Jeff,
>>
>> I am using an upstream agent with a spooling directory source and a
memory channel, and the downstream agent uses a memory channel and an HDFS
sink. If my downstream agent goes down for any reason, are the entries lost
in the downstream agent's memory channel still preserved in the memory
channel / file directory of the upstream agent?
>>
>> All the best, Chris
>
>
 No. If you need to guarantee delivery of events please use a file channel.
https://blogs.apache.org/flume/entry/apache_flume_filechannel


On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon
<cs...@gmail.com>wrote:

>
> On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
> >
> >
> >
> >
> > On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <du...@gmail.com>
> wrote:
> >>
> >> Hi Jeff,
> >>
> >> Yes, I am using the memory channel, and that's because I want it to be
> more reliable and not miss any events/messages.
> >> As I've read in flume documentation that the memory channel is fast but
> there could be a chance of missing events if the in-memory buffer fills up.
> >
> >
> > Memory channel is not reliable, meaning if the flume agent goes down or
> is restarted while there are events in the channel than this data will be
> lost.
> > For reliability please use the file channel.
> >
>
> Jeff,
>
> I am using an upstream agent with a spooling directory source and a memory
> channel, and the downstream agent uses a memory channel and an HDFS sink.
> If my downstream agent goes down for any reason, are the entries lost in
> the downstream agent's memory channel still preserved in the memory channel
> / file directory of the upstream agent?
>
> All the best, Chris
>

Re: Flume Configuration & topology approach

Posted by Jeff Lord <jl...@cloudera.com>.

No. If you need to guarantee delivery of events please use a file channel.
https://blogs.apache.org/flume/entry/apache_flume_filechannel


On Mon, Apr 7, 2014 at 8:38 AM, Christopher Shannon
<cs...@gmail.com>wrote:

>
> On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
> >
> >
> >
> >
> > On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <du...@gmail.com>
> wrote:
> >>
> >> Hi Jeff,
> >>
> >> Yes, I am using the memory channel, and that's because I want it to be
> more reliable and not miss any events/messages.
> >> As I've read in flume documentation that the memory channel is fast but
> there could be a chance of missing events if the in-memory buffer fills up.
> >
> >
> > Memory channel is not reliable, meaning if the flume agent goes down or
> is restarted while there are events in the channel than this data will be
> lost.
> > For reliability please use the file channel.
> >
>
> Jeff,
>
> I am using an upstream agent with a spooling directory source and a memory
> channel, and the downstream agent uses a memory channel and an HDFS sink.
> If my downstream agent goes down for any reason, are the entries lost in
> the downstream agent's memory channel still preserved in the memory channel
> / file directory of the upstream agent?
>
> All the best, Chris
>

Re: Flume Configuration & topology approach

Posted by Christopher Shannon <cs...@gmail.com>.

On Apr 7, 2014 9:35 AM, "Jeff Lord" <jl...@cloudera.com> wrote:
>
>
>
>
> On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <du...@gmail.com>
wrote:
>>
>> Hi Jeff,
>>
>> Yes, I am using the memory channel, and that's because I want it to be
more reliable and not miss any events/messages.
>> As I've read in flume documentation that the memory channel is fast but
there could be a chance of missing events if the in-memory buffer fills up.
>
>
> Memory channel is not reliable, meaning if the flume agent goes down or
is restarted while there are events in the channel than this data will be
lost.
> For reliability please use the file channel.
>

Jeff,

I am using an upstream agent with a spooling directory source and a memory
channel, and the downstream agent uses a memory channel and an HDFS sink.
If my downstream agent goes down for any reason, are the entries lost in
the downstream agent's memory channel still preserved in the memory channel
/ file directory of the upstream agent?

All the best, Chris

Re: Flume Configuration & topology approach

Posted by Jeff Lord <jl...@cloudera.com>.

On Thu, Apr 3, 2014 at 11:27 AM, Mohit Durgapal <du...@gmail.com>wrote:

> Hi Jeff,
>
> Yes, I am using the memory channel, and that's because I want it to be
> more reliable and not miss any events/messages.
> As I've read in flume documentation that the memory channel is fast but
> there could be a chance of missing events if the in-memory buffer fills up.
>

Memory channel is not reliable, meaning if the flume agent goes down or is
restarted while there are events in the channel than this data will be lost.
For reliability please use the file channel.


>
> I am sorry for not mentioning the heap settings but I was running it with
> default vm settings which I increased later(to 1GB), after that I did not
> get the OOME. But then again I am not sure what is the right setting or
> maybe this is more like a hit n trial setting depending on our data load
> and environment.
>
>
> So as per your suggestion, I need to consider having two dedicated
> machines for running flume agents for two web servers and one for
> collector?  We have just started working on flume and I think your
> suggestion really makes sense because we are pretty sure that it is going
> to scale.
>
> Also, we are using rsyslog to log to a tcp port on localhost and flume
> listening to that tcp port on the same machine. Is that a good and reliable
> design? We tried the exec source with tail -F command on the log file but I
> guess that's not a very dependable(also mentioned in the flume
> documentation) way as it fetches all the rows from the file if flume
> restarts. Also, I am a little skeptical of the logrotate cron that rotates
> the logs as I did a few test and found a lot of problems with it.
>
>
Are you using the multiport syslog source?
This is definitely a better option than the exec source.



> Where as rsyslog tcp option provides an option of dumping data to local
> disk if the tcp queue gets full. So even if flume goes down we don't lose
> the data.
>
> One more thing, I just installed cloudera manager a week back. But I have
> done all testing using flume from command line. I want to know if I could
> use cloudera manager to install and manage flume instances in the new
> machines. It'd be great to have one UI to manage all the agents and
> collector nodes and even change their configurations.
>
> Absolutely Cloudera Manager can be used to install, manage, and monitor
your flume agents.



> So we are very much beginners in this field, any suggestions or
> recommendations are welcome. Thanks for your help :)
>
>
> Mohit
>
>
>
> On Thu, Apr 3, 2014 at 7:06 PM, Jeff Lord <jl...@cloudera.com> wrote:
>
>> Mohit,
>>
>> Are you using memory channel? You mention you are getting OOME but you
>> don't even say what the heap you are setting on the flume jvm is?
>>
>> Don't run an agent on the namenode. Occasionally you will see folks
>> installing an agent on one of the datanodes in the cluster but its not
>> typically recommended. It's fine to install the agent on your webserver but
>> perhaps a more scaleable approach would be to dedicate two servers to flume
>> agents. This will allow you to load balance your writes into the flume
>> pipeline at some point. As you scale you will not want to have every agent
>> writing to hdfs so at some point you may consider adding a collector tier
>> that will aggregate the flow and reduce the connections going into your
>> hdfs cluster.
>>
>> -Jeff
>>
>>
>>
>> On Thu, Apr 3, 2014 at 6:20 AM, Mohit Durgapal <du...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> We are setting up a flume cluster but we are facing some issues related
>>> to heap size (out of memory). Is there a standard configuration for a
>>> standard load?
>>>
>>> If there is can you suggest what would it be for the load stats given
>>> below?
>>>
>>> Also, we are not sure what topology to go ahead with in our use case.
>>>
>>> We basically have two web servers which can generate logs at the speed
>>> of 2000 entries per second. Each entry of size around 137Bytes.
>>>
>>> Currently we have used rsyslog( writing to a tcp port) to which a php
>>> script writes these logs to. And we are running a local flume agent on each
>>> webserver , these local agents listen to a tcp port and put data directly
>>> in hdfs.
>>>
>>>  So localhost:tcpport is the "flume source " and "hdfs" is the flume
>>> sink.
>>>
>>> I am confused between three approaches:
>>>
>>> Approach 1: Web Server, RSyslog & Flume Agent on same machine  and a
>>> Flume collector running on the Namenode in hadoop cluster, to collect the
>>> data and dump into hdfs.
>>>
>>> Approach 2: Web Server, RSyslog on same machine  and a Flume collector
>>> (listening on a remote port for events written by rsyslog on web
>>> server)running on the Namenode in hadoop cluster, to collect the data and
>>> dump into hdfs.
>>>
>>>
>>> Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all
>>> agents writing directly to the hdfs.
>>>
>>>
>>> Also, we are using hive, so we are writing directly into partitioned
>>> directories. So we want to think of an approach that allows us to write on
>>> Hourly partitions.
>>>
>>> I hope that's not too vague.
>>>
>>>
>>>
>>> Regards
>>> Mohit Durgapal
>>>
>>
>>
>

Re: Flume Configuration & topology approach

Posted by Mohit Durgapal <du...@gmail.com>.

Hi Jeff,

Yes, I am using the memory channel, and that's because I want it to be more
reliable and not miss any events/messages.
As I've read in flume documentation that the memory channel is fast but
there could be a chance of missing events if the in-memory buffer fills up.

I am sorry for not mentioning the heap settings but I was running it with
default vm settings which I increased later(to 1GB), after that I did not
get the OOME. But then again I am not sure what is the right setting or
maybe this is more like a hit n trial setting depending on our data load
and environment.

So as per your suggestion, I need to consider having two dedicated machines
for running flume agents for two web servers and one for collector?  We
have just started working on flume and I think your suggestion really makes
sense because we are pretty sure that it is going to scale.

Also, we are using rsyslog to log to a tcp port on localhost and flume
listening to that tcp port on the same machine. Is that a good and reliable
design? We tried the exec source with tail -F command on the log file but I
guess that's not a very dependable(also mentioned in the flume
documentation) way as it fetches all the rows from the file if flume
restarts. Also, I am a little skeptical of the logrotate cron that rotates
the logs as I did a few test and found a lot of problems with it.

Where as rsyslog tcp option provides an option of dumping data to local
disk if the tcp queue gets full. So even if flume goes down we don't lose
the data.

One more thing, I just installed cloudera manager a week back. But I have
done all testing using flume from command line. I want to know if I could
use cloudera manager to install and manage flume instances in the new
machines. It'd be great to have one UI to manage all the agents and
collector nodes and even change their configurations.

So we are very much beginners in this field, any suggestions or
recommendations are welcome. Thanks for your help :)

Mohit

On Thu, Apr 3, 2014 at 7:06 PM, Jeff Lord <jl...@cloudera.com> wrote:

> Mohit,
>
> Are you using memory channel? You mention you are getting OOME but you
> don't even say what the heap you are setting on the flume jvm is?
>
> Don't run an agent on the namenode. Occasionally you will see folks
> installing an agent on one of the datanodes in the cluster but its not
> typically recommended. It's fine to install the agent on your webserver but
> perhaps a more scaleable approach would be to dedicate two servers to flume
> agents. This will allow you to load balance your writes into the flume
> pipeline at some point. As you scale you will not want to have every agent
> writing to hdfs so at some point you may consider adding a collector tier
> that will aggregate the flow and reduce the connections going into your
> hdfs cluster.
>
> -Jeff
>
>
>
> On Thu, Apr 3, 2014 at 6:20 AM, Mohit Durgapal <du...@gmail.com>wrote:
>
>> Hi,
>>
>> We are setting up a flume cluster but we are facing some issues related
>> to heap size (out of memory). Is there a standard configuration for a
>> standard load?
>>
>> If there is can you suggest what would it be for the load stats given
>> below?
>>
>> Also, we are not sure what topology to go ahead with in our use case.
>>
>> We basically have two web servers which can generate logs at the speed of
>> 2000 entries per second. Each entry of size around 137Bytes.
>>
>> Currently we have used rsyslog( writing to a tcp port) to which a php
>> script writes these logs to. And we are running a local flume agent on each
>> webserver , these local agents listen to a tcp port and put data directly
>> in hdfs.
>>
>>  So localhost:tcpport is the "flume source " and "hdfs" is the flume
>> sink.
>>
>> I am confused between three approaches:
>>
>> Approach 1: Web Server, RSyslog & Flume Agent on same machine  and a
>> Flume collector running on the Namenode in hadoop cluster, to collect the
>> data and dump into hdfs.
>>
>> Approach 2: Web Server, RSyslog on same machine  and a Flume collector
>> (listening on a remote port for events written by rsyslog on web
>> server)running on the Namenode in hadoop cluster, to collect the data and
>> dump into hdfs.
>>
>>
>> Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all
>> agents writing directly to the hdfs.
>>
>>
>> Also, we are using hive, so we are writing directly into partitioned
>> directories. So we want to think of an approach that allows us to write on
>> Hourly partitions.
>>
>> I hope that's not too vague.
>>
>>
>>
>> Regards
>> Mohit Durgapal
>>
>
>

Re: Flume Configuration & topology approach

Posted by Jeff Lord <jl...@cloudera.com>.

Mohit,

Are you using memory channel? You mention you are getting OOME but you
don't even say what the heap you are setting on the flume jvm is?

Don't run an agent on the namenode. Occasionally you will see folks
installing an agent on one of the datanodes in the cluster but its not
typically recommended. It's fine to install the agent on your webserver but
perhaps a more scaleable approach would be to dedicate two servers to flume
agents. This will allow you to load balance your writes into the flume
pipeline at some point. As you scale you will not want to have every agent
writing to hdfs so at some point you may consider adding a collector tier
that will aggregate the flow and reduce the connections going into your
hdfs cluster.

-Jeff

On Thu, Apr 3, 2014 at 6:20 AM, Mohit Durgapal <du...@gmail.com>wrote:

> Hi,
>
> We are setting up a flume cluster but we are facing some issues related to
> heap size (out of memory). Is there a standard configuration for a standard
> load?
>
> If there is can you suggest what would it be for the load stats given
> below?
>
> Also, we are not sure what topology to go ahead with in our use case.
>
> We basically have two web servers which can generate logs at the speed of
> 2000 entries per second. Each entry of size around 137Bytes.
>
> Currently we have used rsyslog( writing to a tcp port) to which a php
> script writes these logs to. And we are running a local flume agent on each
> webserver , these local agents listen to a tcp port and put data directly
> in hdfs.
>
>  So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.
>
> I am confused between three approaches:
>
> Approach 1: Web Server, RSyslog & Flume Agent on same machine  and a Flume
> collector running on the Namenode in hadoop cluster, to collect the data
> and dump into hdfs.
>
> Approach 2: Web Server, RSyslog on same machine  and a Flume collector
> (listening on a remote port for events written by rsyslog on web
> server)running on the Namenode in hadoop cluster, to collect the data and
> dump into hdfs.
>
>
> Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all
> agents writing directly to the hdfs.
>
>
> Also, we are using hive, so we are writing directly into partitioned
> directories. So we want to think of an approach that allows us to write on
> Hourly partitions.
>
> I hope that's not too vague.
>
>
>
> Regards
> Mohit Durgapal
>