You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@chukwa.apache.org by Oded Rosen <od...@legolas-media.com> on 2010/03/01 15:43:28 UTC

Where to (physically) place the collector

I've been searching the docs but could find no help --
We have some machines that produce data - and on each we have
an adapter (agent). Those machines are 'close' to each other - same network
(physically).
Then, we have the HDFS cluster on other machines, on another network. The
two networks are of course connected (via internet).
So, we want to know which is better - network-wise: to put the collector on
the same network of the adapters, or on the same computer as the hdfs
namenode?
Option A - collector close to adapters - seems better to me because they
send data ALL THE TIME to the collector, while the collector sends data to
the hdfs only every 5 mins, with one writing action.

P.S - our collector writes exactly what he gets from the adapters, so there
are no considerations regarding data volumes.

Any recommendations?
Thanks,
-- 
Oded

Re: Where to (physically) place the collector

Posted by Ariel Rabkin <as...@gmail.com>.

I think best practice is actually to have the collector on the
datanode[s].   There's no particular reason to funnel fs writes
through the namenode, since traffic to the nn is very small compared
to the overall volume being written.

The collector is not only writing every five minutes. The collector is
writing continuously. However, the filesystem doesn't promise that
data will be visible until a block boundary, which we impose every
five minutes at least by closing files.

--Ari

On Mon, Mar 1, 2010 at 6:43 AM, Oded Rosen <od...@legolas-media.com> wrote:
> I've been searching the docs but could find no help --
> We have some machines that produce data - and on each we have
> an adapter (agent). Those machines are 'close' to each other - same network
> (physically).
> Then, we have the HDFS cluster on other machines, on another network. The
> two networks are of course connected (via internet).
> So, we want to know which is better - network-wise: to put the collector on
> the same network of the adapters, or on the same computer as the hdfs
> namenode?
> Option A - collector close to adapters - seems better to me because they
> send data ALL THE TIME to the collector, while the collector sends data to
> the hdfs only every 5 mins, with one writing action.
> P.S - our collector writes exactly what he gets from the adapters, so there
> are no considerations regarding data volumes.
> Any recommendations?
> Thanks,
> --
> Oded
>

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department