You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@chukwa.apache.org by Ariel Rabkin <as...@gmail.com> on 2010/10/12 07:49:51 UTC

a more fault-tolerant collector

Howdy.

This is an answer to a question Bill asked me recently: can we
redesign the Collector process to behave better if the filesystem is
unavailable?

I think we can do this by backpressure. If the write fails, the
collector should return an error to the agent. And the agent should
treat it like a post failure, and retry.  Thoughts?

--Ar

-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: a more fault-tolerant collector

Posted by Bill Graham <bi...@gmail.com>.
see below.

On Tue, Oct 12, 2010 at 10:40 AM, Ariel Rabkin <as...@gmail.com> wrote:
> There is support for assuming writes are asynchronous. I don't know if
> it's ever been tried in production though.  It also doesn't quite
> solve the problem.

I didn't mean to imply this solves the problem. This makes the problem
much more complex, is what I was pointing out. If the collector is
writing async, it can't tell the agent about a failure.

>
> Right now, collectors only return on success; on a write failure, they exit.
> We would need to figure out if a write failure means a 200 OK with a
> special body of the response, or a different HTTP response code.
> Likewise, agents need to be modified to handle that return code.

+1 on a HTTP 500 response in this case.

>
> And we have a bit of a design question. If a collector is up, but
> can't write, should the agent fail over to a new one, or just retry
> later?

+1 on failing over to a new collector. AFAIK a collector will always
attempt to write data to it's own node as one of the replica's due to
how HDFS works. If the DN is having problems, it's possible that that
collector will always have problems, and another should be tried.

>
> --Ari
>
> On Tue, Oct 12, 2010 at 10:25 AM, Bill Graham <bi...@gmail.com> wrote:
>> We had problems with a single DN the other day and all collectors
>> ultimately died after trying N failed attempts. I believe at least one
>> of the failures was was during a commit.
>>
>> I think backpressure sounds like the right approach, bit it seems like
>> there would be some practical challenges, particularly around async
>> writes or commits to HDFS. Ari, how does this behavior work currently,
>> and would this be difficult to handle?
>>
>> In oahc.datacolleciton.writer.SeqFileWriter there's an exception block
>> with this in it:
>>
>> // We don't want to loose anything
>> log.fatal("IOException when trying to write a chunk, Collector is
>> going to exit!", e);
>> DaemonWatcher.bailout(-1);
>> isRunning = false;
>>
>>
>>
>> On Tue, Oct 12, 2010 at 9:57 AM, Eric Yang <ey...@yahoo-inc.com> wrote:
>>> I thought that is what it is currently doing with one twist, the commit and response is async.  Collector exits if the file system is unavailable for extensive period of time.  If it is not doing what's described above, then we definitely should fix it.
>>>
>>> Regards,
>>> Eric
>>>
>>>
>>> On 10/11/10 10:49 PM, "Ariel Rabkin" <as...@gmail.com> wrote:
>>>
>>> Howdy.
>>>
>>> This is an answer to a question Bill asked me recently: can we
>>> redesign the Collector process to behave better if the filesystem is
>>> unavailable?
>>>
>>> I think we can do this by backpressure. If the write fails, the
>>> collector should return an error to the agent. And the agent should
>>> treat it like a post failure, and retry.  Thoughts?
>>>
>>> --Ar
>>>
>>> --
>>> Ari Rabkin asrabkin@gmail.com
>>> UC Berkeley Computer Science Department
>>>
>>>
>>
>
>
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>

Re: a more fault-tolerant collector

Posted by Ariel Rabkin <as...@gmail.com>.
There is support for assuming writes are asynchronous. I don't know if
it's ever been tried in production though.  It also doesn't quite
solve the problem.

Right now, collectors only return on success; on a write failure, they exit.
We would need to figure out if a write failure means a 200 OK with a
special body of the response, or a different HTTP response code.
Likewise, agents need to be modified to handle that return code.

And we have a bit of a design question. If a collector is up, but
can't write, should the agent fail over to a new one, or just retry
later?

--Ari

On Tue, Oct 12, 2010 at 10:25 AM, Bill Graham <bi...@gmail.com> wrote:
> We had problems with a single DN the other day and all collectors
> ultimately died after trying N failed attempts. I believe at least one
> of the failures was was during a commit.
>
> I think backpressure sounds like the right approach, bit it seems like
> there would be some practical challenges, particularly around async
> writes or commits to HDFS. Ari, how does this behavior work currently,
> and would this be difficult to handle?
>
> In oahc.datacolleciton.writer.SeqFileWriter there's an exception block
> with this in it:
>
> // We don't want to loose anything
> log.fatal("IOException when trying to write a chunk, Collector is
> going to exit!", e);
> DaemonWatcher.bailout(-1);
> isRunning = false;
>
>
>
> On Tue, Oct 12, 2010 at 9:57 AM, Eric Yang <ey...@yahoo-inc.com> wrote:
>> I thought that is what it is currently doing with one twist, the commit and response is async.  Collector exits if the file system is unavailable for extensive period of time.  If it is not doing what's described above, then we definitely should fix it.
>>
>> Regards,
>> Eric
>>
>>
>> On 10/11/10 10:49 PM, "Ariel Rabkin" <as...@gmail.com> wrote:
>>
>> Howdy.
>>
>> This is an answer to a question Bill asked me recently: can we
>> redesign the Collector process to behave better if the filesystem is
>> unavailable?
>>
>> I think we can do this by backpressure. If the write fails, the
>> collector should return an error to the agent. And the agent should
>> treat it like a post failure, and retry.  Thoughts?
>>
>> --Ar
>>
>> --
>> Ari Rabkin asrabkin@gmail.com
>> UC Berkeley Computer Science Department
>>
>>
>



-- 
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department

Re: a more fault-tolerant collector

Posted by Bill Graham <bi...@gmail.com>.
We had problems with a single DN the other day and all collectors
ultimately died after trying N failed attempts. I believe at least one
of the failures was was during a commit.

I think backpressure sounds like the right approach, bit it seems like
there would be some practical challenges, particularly around async
writes or commits to HDFS. Ari, how does this behavior work currently,
and would this be difficult to handle?

In oahc.datacolleciton.writer.SeqFileWriter there's an exception block
with this in it:

// We don't want to loose anything
log.fatal("IOException when trying to write a chunk, Collector is
going to exit!", e);
DaemonWatcher.bailout(-1);
isRunning = false;



On Tue, Oct 12, 2010 at 9:57 AM, Eric Yang <ey...@yahoo-inc.com> wrote:
> I thought that is what it is currently doing with one twist, the commit and response is async.  Collector exits if the file system is unavailable for extensive period of time.  If it is not doing what's described above, then we definitely should fix it.
>
> Regards,
> Eric
>
>
> On 10/11/10 10:49 PM, "Ariel Rabkin" <as...@gmail.com> wrote:
>
> Howdy.
>
> This is an answer to a question Bill asked me recently: can we
> redesign the Collector process to behave better if the filesystem is
> unavailable?
>
> I think we can do this by backpressure. If the write fails, the
> collector should return an error to the agent. And the agent should
> treat it like a post failure, and retry.  Thoughts?
>
> --Ar
>
> --
> Ari Rabkin asrabkin@gmail.com
> UC Berkeley Computer Science Department
>
>

Re: a more fault-tolerant collector

Posted by Eric Yang <ey...@yahoo-inc.com>.
I thought that is what it is currently doing with one twist, the commit and response is async.  Collector exits if the file system is unavailable for extensive period of time.  If it is not doing what's described above, then we definitely should fix it.

Regards,
Eric


On 10/11/10 10:49 PM, "Ariel Rabkin" <as...@gmail.com> wrote:

Howdy.

This is an answer to a question Bill asked me recently: can we
redesign the Collector process to behave better if the filesystem is
unavailable?

I think we can do this by backpressure. If the write fails, the
collector should return an error to the agent. And the agent should
treat it like a post failure, and retry.  Thoughts?

--Ar

--
Ari Rabkin asrabkin@gmail.com
UC Berkeley Computer Science Department