You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by "Guyle M. Taber" <gu...@gmtech.net> on 2017/06/23 18:50:37 UTC

Where to put the flume agents within a cluster

We have a 32 data node Hadoop cluster that receives incoming flume data via three data nodes acting as flume agents. We’re using round robin DNS entries to spread incoming flume data from various external architectures to the three flume agents on those three data nodes.

It seems like historically, the three data nodes that are the flume agents always have many more blocks than other data nodes, so I’m wondering what the best approach for placement of flume agents would be within a cluster. Should all data nodes in the cluster be flume nodes, or should the flume agent be placed on a name node or other non-data node?

Thanks for any guidance.

Re: Where to put the flume agents within a cluster

Posted by Chris Horrocks <ch...@hor.rocks>.

Hi
Ive seen this before. If you put a flume agent on a worker node that is running a HDFS data node, and asusming you are using flume to write into HDFS, you will find that the worker that has the flume agent on it will be the data node chosen to house the (first replica of the) data. This may slightly warp the distribution of data across your workers (up to the HDFS balancer limit anyway) & have an impact on locality. This is due to the bias that various hadoop services have in electing for (box) local instances of a service rather than engage in expensive operations like copy data across the network. Simple fix is to add some edge nodes that run nothing but flume.
DNS RR seems a clunky way of load sharing btw. If you can get the data into something like Kafka the flume kafka source's consumer group will equally distribute assignments of the partitions for the topic in question.

On Fri, Jun 23, 2017 at 7:50 pm, Guyle M. Taber <gu...@gmtech.net> wrote:

> We have a 32 data node Hadoop cluster that receives incoming flume data via three data nodes acting as flume agents. We’re using round robin DNS entries to spread incoming flume data from various external architectures to the three flume agents on those three data nodes. It seems like historically, the three data nodes that are the flume agents always have many more blocks than other data nodes, so I’m wondering what the best approach for placement of flume agents would be within a cluster. Should all data nodes in the cluster be flume nodes, or should the flume agent be placed on a name node or other non-data node? Thanks for any guidance.