You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@chukwa.apache.org by Ariel Rabkin <as...@gmail.com> on 2009/06/30 00:47:51 UTC

Re: chukwa suitability for collecting data from multiple datanodes to HDFS

Chukwa does indeed aim to solve the problem you have. Reliability is a
goal for us, but not our highest priority.  The current implementation
falls short in a few places. These are known bugs, and shouldn't be
hard to fix -- they just haven't been priorities.

As to reliability mechanisms:

1) Chunks of sent data have sequence IDs, so you can tell what file it
came from, and what part of the file it is. This allows post-facto
detection of data loss or duplication.

2) Agents checkpoint themselves periodically, so if they crash, data
might be duplicated, but won't be lost.

3) There are a few times when a collector can crash, and data hasn't
yet been committed to stable storage. This is mostly a  consequence of
not having flush() in HDFS yet.  It is possible to hack around that.
Fixing this will require writing some code, but not any major
architectural change.

On Mon, Jun 29, 2009 at 3:20 PM, Ken White<ke...@mailcatch.com> wrote:
> Hi all!
>
> I am trying to use Hadoop for processing large amount of data (ad network
> statistics). This data is gathered on multiple nodes. Some of it is precious -
> it must not be lost before it is processed. Since HDFS doesn't work well with
> multiple small file uploads we are looking for a reliable (fault tolerant)
> solution to upload this data to HDFS as it is generated.
>
> If I understand correctly, Chukwa does just that? The main purpose is
> different, but it collects data from multiple nodes and writes it to HDFS,
> which is basically the same.
>

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department