You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@chukwa.apache.org by Ken White <ke...@mailcatch.com> on 2009/06/30 00:20:43 UTC

chukwa suitability for collecting data from multiple datanodes to HDFS

Hi all!

I am trying to use Hadoop for processing large amount of data (ad network 
statistics). This data is gathered on multiple nodes. Some of it is precious - 
it must not be lost before it is processed. Since HDFS doesn't work well with 
multiple small file uploads we are looking for a reliable (fault tolerant) 
solution to upload this data to HDFS as it is generated.

If I understand correctly, Chukwa does just that? The main purpose is 
different, but it collects data from multiple nodes and writes it to HDFS, 
which is basically the same. 

I was wondering what measures (if any) does Chukwa take to make sure no data 
is lost? (what happens if Collector dies - fire, flood, axe through CPU,...?) 
In other words, is reliability important for Chukwa or is it not a primary 
concern (because of different usage). 

I would appreciate other ideas how to handle small incremental data upload 
too, of course. I am a bit new in this field but I guess I am not the first 
one to have this kind of problem. :)

Thank you!

Kind regards,

Ken


Re: chukwa suitability for collecting data from multiple datanodes to HDFS

Posted by Ariel Rabkin <as...@gmail.com>.
Alright.

We're in the final stages of rolling a 0.2 release.  Probably you're
better off playing with Trunk, though; I think the two are about
equally reliable but there are a few bug fixes that are trunk only.
And if you're doing any coding, it should be against trunk.

One warning. Documentation is still sparse. Don't hesitate to email us
if you have any questions.  Anything that's confusing to you, we need
to document or fix.

On Tue, Jun 30, 2009 at 1:09 AM, Ken White<ke...@mailcatch.com> wrote:

> Thank you for the pointers! I will take a look at the code and see if anything
> can be done to improve reliability.

Quite a bit can be done.  There's basically four outstanding issues I
know of that prevent us from saying "we have a really robust
pipeline."

- CHUKWA-284 [patch available] fixes one of them.
- The second is that agents assume that any time a collector returns
an OK to a post, that the data is the collector's problem; but
collectors assume that they can then lose data.
- The third, related issue, is that the collector itself has only a
loose notion of when data is stable, because HDFS hasn't got a real
flush().
- The last is that we don't do duplicate detection where you want us
to. We handle duplicates at the very end of the pipeline, where we
update the database that drives the visualizations. But you probably
want to do it at the archiving stage. I just opened CHUKWA-338 for
this.

> If nothing else, I could be running two instances of chukwa on different nodes
> and write the important data to both of them.

Hrm.  That's actually both harder and easier than you think.
Harder, because Collectors have pretty strong expectations for what
their input looks like: it needs to be a list of Chukwa chunks in the
same format used by the ChukwaHttpSender. Which I don't think has any
documentation more accurate than the code. And because the duplicate
detection won't do the right thing in that scenario.  (See above about
CHUKWA-338)

Easier, because I think you can actually reuse ChukwaHttpSender.
Probably, the thing to do is to create a new class of Chukwa Connector
that contains multiple senders.

--Ari

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: chukwa suitability for collecting data from multiple datanodes to HDFS

Posted by Ken White <ke...@mailcatch.com>.
>Chukwa does indeed aim to solve the problem you have. Reliability is a
>goal for us, but not our highest priority.  The current implementation
>falls short in a few places. These are known bugs, and shouldn't be
>hard to fix -- they just haven't been priorities.

Ari, thank you for the answer, I appreciate it! 

Great, that takes a bit of coding from me, I was going to write the collectors 
myself. I will install Chukwa and try to set it up. 

>3) There are a few times when a collector can crash, and data hasn't
>yet been committed to stable storage. This is mostly a  consequence of
>not having flush() in HDFS yet.  It is possible to hack around that.
>Fixing this will require writing some code, but not any major
>architectural change.

Thank you for the pointers! I will take a look at the code and see if anything 
can be done to improve reliability. 

If nothing else, I could be running two instances of chukwa on different nodes 
and write the important data to both of them. 

Thank you again, kind regards!

Ken

Re: chukwa suitability for collecting data from multiple datanodes to HDFS

Posted by Ariel Rabkin <as...@gmail.com>.
Chukwa does indeed aim to solve the problem you have. Reliability is a
goal for us, but not our highest priority.  The current implementation
falls short in a few places. These are known bugs, and shouldn't be
hard to fix -- they just haven't been priorities.

As to reliability mechanisms:

1) Chunks of sent data have sequence IDs, so you can tell what file it
came from, and what part of the file it is. This allows post-facto
detection of data loss or duplication.

2) Agents checkpoint themselves periodically, so if they crash, data
might be duplicated, but won't be lost.

3) There are a few times when a collector can crash, and data hasn't
yet been committed to stable storage. This is mostly a  consequence of
not having flush() in HDFS yet.  It is possible to hack around that.
Fixing this will require writing some code, but not any major
architectural change.

On Mon, Jun 29, 2009 at 3:20 PM, Ken White<ke...@mailcatch.com> wrote:
> Hi all!
>
> I am trying to use Hadoop for processing large amount of data (ad network
> statistics). This data is gathered on multiple nodes. Some of it is precious -
> it must not be lost before it is processed. Since HDFS doesn't work well with
> multiple small file uploads we are looking for a reliable (fault tolerant)
> solution to upload this data to HDFS as it is generated.
>
> If I understand correctly, Chukwa does just that? The main purpose is
> different, but it collects data from multiple nodes and writes it to HDFS,
> which is basically the same.
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department