You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2011/03/02 22:01:21 UTC

Speculative execution

I realize that the intended purpose of speculative execution is to overcome individual slow tasks...and I have read that it explicitly is *not* intended to start copies of a task simultaneously and to then race them, but rather to start copies of tasks that "seem slow" after running for a while.

...but aside from merely being slow, sometimes tasks arbitrarily fail, and not in data-driven or otherwise deterministic ways.  A task may fail and then succeed on a subsequent attempt...but the total job time is extended by the time wasted during the initial failed task attempt.

It would super-swell to run copies of a task simultaneously from the starting line and simply kill the copies after the winner finishes.  While is is "wasteful" in some sense (that is the argument offered for not running speculative execution this way to begin with), it would more precise to say that different users may have different priorities under various use-case scenarios.  The "wasting" of duplicate tasks on extra cores may be an acceptable cost toward the higher priority of minimizing job times for a given application.

Is there any notion of this in Hadoop?

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________

Re: Speculative execution

Posted by Keith Wiley <kw...@keithwiley.com>.

On Mar 3, 2011, at 3:29 PM, Jacob R Rideout wrote:

> On Thu, Mar 3, 2011 at 2:04 PM, Keith Wiley <kw...@keithwiley.com> wrote:
>> On Mar 3, 2011, at 2:51 AM, Steve Loughran wrote:
>> 
>>> yes, but the problem is determining which one will fail. Ideally you should find the route cause, which is often some race condition or hardware fault. If it's the same server ever time, turn it off.
>> 
>>> You can play with the specex parameters, maybe change when they get kicked off. The assumption in the code is that the slowness is caused by H/W problems (especially HDD issues) and it tries to avoid duplicate work. If every Map was duplicated, you'd be doubling the effective cost of each query, and annoying everyone else in the cluster. Plus increased disk and network IO might slow things down.
>>> 
>>> Look at the options, have a play and see. If it doesn't have the feature, you can always try coding it in -if the scheduler API lets it do it, you wont' be breaking anyone else's code.
>>> 
>>> -steve
>> 
>> 
>> Thanks.  I'll take it under consideration.  In my case, it would be really beneficial to duplicate the work.  That task in question is a single task on a single node (numerous mappers feed data into a single reducer), so duplicating the reducer represents very will duplicated effort while mitigating a potential bottleneck in the job's performance since the job simply is not done until the single reducer finishes.  I would really like to be able to do what I am suggesting, to duplicate the reducer and kill the clones after the winner finishes.
>> 
>> Anyway, thanks.
>> 
> 
> What is your reason for needing a single reducer? I'd first try to see
> how I could parallelize that work first if possible.

No no no, I don't want to debate my high-level design.  I'm just trying to hone and sharpen the current approach.  First and foremost, the reducer algorithm is not very amenable to parallelization.  It is theoretically parallelizable but only at the cost of significant overhead which will probably mitigate the benefits.  I appreciate your curiosity but that's my goal here, my goal was simply to inquire about a seemingly obvious optimization, i.e., to race concurrent tasks such that if one fails, the total job time is not impeded.

If I find some free time I'll try to write up a longer description of our program, but I don't have time for that now, I'm sorry.  I don't mean to sound rude, I just don't have time for that...kinda the same reason I was hoping to make the reducer more failure-tolerant in the first place.  I'm trying to get this data processed super fast.

Anyway, thanks for the input, sounds like I've got the answer: Hadoop does not natively support what I'm suggesting, although if I want to try to patch it perhaps I can find a way...at some point.

Cheers!  Thanks for the input on the matter.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________

Re: Speculative execution

Posted by Jacob R Rideout <ap...@jacobrideout.net>.

On Thu, Mar 3, 2011 at 2:04 PM, Keith Wiley <kw...@keithwiley.com> wrote:
> On Mar 3, 2011, at 2:51 AM, Steve Loughran wrote:
>
>> yes, but the problem is determining which one will fail. Ideally you should find the route cause, which is often some race condition or hardware fault. If it's the same server ever time, turn it off.
>
>> You can play with the specex parameters, maybe change when they get kicked off. The assumption in the code is that the slowness is caused by H/W problems (especially HDD issues) and it tries to avoid duplicate work. If every Map was duplicated, you'd be doubling the effective cost of each query, and annoying everyone else in the cluster. Plus increased disk and network IO might slow things down.
>>
>> Look at the options, have a play and see. If it doesn't have the feature, you can always try coding it in -if the scheduler API lets it do it, you wont' be breaking anyone else's code.
>>
>> -steve
>
>
> Thanks.  I'll take it under consideration.  In my case, it would be really beneficial to duplicate the work.  That task in question is a single task on a single node (numerous mappers feed data into a single reducer), so duplicating the reducer represents very will duplicated effort while mitigating a potential bottleneck in the job's performance since the job simply is not done until the single reducer finishes.  I would really like to be able to do what I am suggesting, to duplicate the reducer and kill the clones after the winner finishes.
>
> Anyway, thanks.
>

What is your reason for needing a single reducer? I'd first try to see
how I could parallelize that work first if possible.

Jacob

Re: Speculative execution

Posted by Keith Wiley <kw...@keithwiley.com>.

On Mar 3, 2011, at 2:51 AM, Steve Loughran wrote:

> yes, but the problem is determining which one will fail. Ideally you should find the route cause, which is often some race condition or hardware fault. If it's the same server ever time, turn it off.

> You can play with the specex parameters, maybe change when they get kicked off. The assumption in the code is that the slowness is caused by H/W problems (especially HDD issues) and it tries to avoid duplicate work. If every Map was duplicated, you'd be doubling the effective cost of each query, and annoying everyone else in the cluster. Plus increased disk and network IO might slow things down.
> 
> Look at the options, have a play and see. If it doesn't have the feature, you can always try coding it in -if the scheduler API lets it do it, you wont' be breaking anyone else's code.
> 
> -steve

Thanks.  I'll take it under consideration.  In my case, it would be really beneficial to duplicate the work.  That task in question is a single task on a single node (numerous mappers feed data into a single reducer), so duplicating the reducer represents very will duplicated effort while mitigating a potential bottleneck in the job's performance since the job simply is not done until the single reducer finishes.  I would really like to be able to do what I am suggesting, to duplicate the reducer and kill the clones after the winner finishes.

Anyway, thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________

Re: Speculative execution

Posted by Steve Loughran <st...@apache.org>.

On 02/03/11 21:01, Keith Wiley wrote:
> I realize that the intended purpose of speculative execution is to overcome individual slow tasks...and I have read that it explicitly is *not* intended to start copies of a task simultaneously and to then race them, but rather to start copies of tasks that "seem slow" after running for a while.
>
> ...but aside from merely being slow, sometimes tasks arbitrarily fail, and not in data-driven or otherwise deterministic ways.  A task may fail and then succeed on a subsequent attempt...but the total job time is extended by the time wasted during the initial failed task attempt.

yes, but the problem is determining which one will fail. Ideally you 
should find the route cause, which is often some race condition or 
hardware fault. If it's the same server ever time, turn it off.

>
> It would super-swell to run copies of a task simultaneously from the starting line and simply kill the copies after the winner finishes.  While is is "wasteful" in some sense (that is the argument offered for not running speculative execution this way to begin with), it would more precise to say that different users may have different priorities under various use-case scenarios.  The "wasting" of duplicate tasks on extra cores may be an acceptable cost toward the higher priority of minimizing job times for a given application.
 >
> Is there any notion of this in Hadoop?

You can play with the specex parameters, maybe change when they get 
kicked off. The assumption in the code is that the slowness is caused by 
H/W problems (especially HDD issues) and it tries to avoid duplicate 
work. If every Map was duplicated, you'd be doubling the effective cost 
of each query, and annoying everyone else in the cluster. Plus increased 
disk and network IO might slow things down.

Look at the options, have a play and see. If it doesn't have the 
feature, you can always try coding it in -if the scheduler API lets it 
do it, you wont' be breaking anyone else's code.

-steve