You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-dev@hadoop.apache.org by Tim St Clair <ts...@redhat.com> on 2013/04/19 16:47:02 UTC

Omega vs. YARN

I recently read Googles Omega paper, and wondering if any of the YARN developers were planning to address some of the items considered as key points.  

http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf

Cheers,
Tim

Re: Omega vs. YARN

Posted by Robert Evans <ev...@yahoo-inc.com>.

Tim,

Answers inline below

On 4/19/13 1:42 PM, "Tim St Clair" <ts...@redhat.com> wrote:

>Robert, 
>
>Thank you for your response.
>I've placed some questions and comments inline below.
>
>Cheers,
>Tim
>
>----- Original Message -----
>> From: "Robert Evans" <ev...@yahoo-inc.com>
>> To: yarn-dev@hadoop.apache.org
>> Sent: Friday, April 19, 2013 12:34:52 PM
>> Subject: Re: Omega vs. YARN
>> 
>> Tim,
>> 
>> They are very interesting points.  From a scalability point I don't
>>think
>> we have really run into those situations yet but they are coming.  YARN
>> currently has some very "simplistic" scheduling for the RM.  All of the
>> complexity comes out in the AM.  There have been a number of JIRA to
>>make
>> requests more complex, to help support more "picky" applications like
>>the
>> paper says.  These would make YARN shift a bit more from a two-level
>> scheduler towards a Monolithic one, and thereby reducing some of the
>> scalability of the system, but making it support more complex scheduling
>> patterns.  The largest YARN cluster I know of right now is about 4000
>> nodes. On it we are hitting some bottlenecks with the current scheduler.
>> We have looked at some ways to speed it up with more conventional
>> approaches like allowing the scheduler to me multithreaded.  We expect
>>to
>> be able to easily support 4000-6000 nodes through YARN with a few
>> optimizations. Going to tens of thousands of nodes would require some
>>more
>> significant changes.
>
>If there are JIRA(s) which outline the limitations I would be interested
>in knowing more.

YARN-397 is kind of a roll up JIRA for some of the scheduler API
enhancements.  But there are also YARN-314, YARN-56, YARN-110, and
YARN-238.  But this does not include the ones that I am most interested in
which is gang scheduling. I just haven't filed a JIRA for that yet.

There is also preemption as an option in YARN-397, although not strictly
part of the scheduling request API, but I believe includes informing the
AM that resources are going to be taken back if it does not release some
of them.

>
>> 
>> As far as utilization is concerned the presented architecture does
>>provide
>> some very interesting points, but all of that can be addressed with a
>> Monolithic scheduler so long as we don't have to scale very large. It
>>also
>> would probably require a complete redesign of YARN and the MR AM, which
>>is
>> not a small undertaking.  There is also the question of trusted code.
>>In
>> a shared state system where all of the various schedulers are peers how
>> would we enforce resource constraints?
>
>I think the biggest open questions I have with a distributed approach,
>are; priority, preemption policies, and fragmentation.

Yes, those become more difficult in a distributed environment, but I don't
think they are overwhelmingly difficult.  These are hard problems to solve
for any scheduler.  This is because we are trying to come up with
heuristics for a problem that is practically impossible to solve.  I am
not a mathematician but I believe that optimally scheduling resources is
an NP-Hard problem.  What is more we don't know what the resource
utilization is going to be up front, despite the users' resource
request/hint, so it is an NP-Hard problem once we have solved the halting
problem.  This is where priority and preemption come in as ways to try to
offload some of the complexity on to the user, and then also to fix
mistakes that the heuristic made while scheduling.

Moving this to a distributed environment you can solve this in a number of
way. The paper talks about optimistic scheduling vs pessimistic
scheduling.  With optimistic scheduling each scheduler acts kind of like
it is the only one there is, and then cleans things up afterwards if there
is a collision. In pessimistic scheduling it works hard to avoid all
collisions most likely through locking. The paper also talked about
auditing the schedulers after the fact to detect if any of them are doing
something that does not fit with the policy instead of trying to enforce
it.  If we want to go with enforcement you could have specific schedulers
with priority over other schedulers.  So that by convention lower priority
schedulers could not preempt higher priority ones, but then the resource
utilization, in theory, would go down.

On fragmentation I don't think any of the hadoop schedulers right now try
to do anything about fragmentation, except pretend it does not exits.  In
fact we have seen a very rare live lock situation where the MR AM thinks
there is enough headroom to schedule a map task so it does not bother to
shoot a reducer, but because the headroom is fragmented between various
machines the map task will never actually be scheduled.

>
>> Each of the schedulers would have
>> to enforce them themselves, and as such would have to be trusted code.
>> This makes adding in new application types on the fly difficult.
>> 
>> I suppose we could do a hybrid approach, where the RM is a single type
>>of
>> scheduler among many.  It would provide the same API that currently
>>exists
>> for YARN applications, but MR applications could have one or more
>> "JobTracker" like schedulers that share state with the RM, and what
>>other
>> "schedulers" there are out.  That would be something fun to try out, but
>> sadly I really don't have time to even get started thinking about a
>>proof
>> of concept on something like that. At least that is until we hit a
>> significant business use case that would drive it over the architecture
>>we
>> already have.  
>>
>> For example needing 10s of thousands of nodes in a
>> cluster, or a huge shift in different types of jobs on to YARN so that
>>we
>> are doing a lot more than just MR on the same cluster.
>
>Something tells me it may come fast, if/when the YARN application space
>expands.

I agree that it may come fast. I just don't know if the mix of
applications that Google has will actually come to Hadoop.  I can see a
lot of batch processing applications begin run on top of YARN, because
even though YARN is generic it makes a lot of batch processing
assumptions.  Because of this I just don't know about other types of
processing.   It is a bit of a chicken/egg problem.

I am in the process of doing a basic port of storm to run on top of YARN.
Because the resource scheduling/isolation is not that great on YARN (Mesos
too for that matter) we request entire nodes from YARN and bring up a
predefined number of machines instead of letting the cluster grow and
shrink on demand, because we need to be sure we can get the resources when
we need them.  Honestly it is a lot simpler to do what we are doing
through OpenStack or some other VM management system than through YARN.
Even looking at tools like Impala or Hbase, it would be very difficult
with there current architecture to think about using YARN for
scheduling/deployment.  For example when security is enabled HDFS sets
limits on how long a delegation token is good for.  Once 2 weeks,
configurable, are up your HDFS delegation token is done and no new
containers will be abel to be launched because the distributed cache will
no longer be able to download your jars.

I just see a lot of difficulty in using YARN for long running processes,
and it being a lot simpler to use something else like OpenStack for that.
Now with that said, if we could some how have a distributed scheduler with
both Hadoop and OpenStack sharing the same large cluster.  That would be
awesome.  But again I have to convince my management that is the right way
to go, it will save us X million dollars a year, and I have to convince
myself that it is worth spending my time on that instead of the other fun
stuff I have been doing. :)

>
>> 
>> --Bobby
>> 
>> On 4/19/13 9:47 AM, "Tim St Clair" <ts...@redhat.com> wrote:
>> 
>> >I recently read Googles Omega paper, and wondering if any of the YARN
>> >developers were planning to address some of the items considered as key
>> >points.
>> >
>> 
>>>http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.p
>>>df
>> >
>> >Cheers,
>> >Tim
>> 
>>

Re: Omega vs. YARN

Posted by Tim St Clair <ts...@redhat.com>.

Robert, 

Thank you for your response.  
I've placed some questions and comments inline below.

Cheers,
Tim

----- Original Message -----
> From: "Robert Evans" <ev...@yahoo-inc.com>
> To: yarn-dev@hadoop.apache.org
> Sent: Friday, April 19, 2013 12:34:52 PM
> Subject: Re: Omega vs. YARN
> 
> Tim,
> 
> They are very interesting points.  From a scalability point I don't think
> we have really run into those situations yet but they are coming.  YARN
> currently has some very "simplistic" scheduling for the RM.  All of the
> complexity comes out in the AM.  There have been a number of JIRA to make
> requests more complex, to help support more "picky" applications like the
> paper says.  These would make YARN shift a bit more from a two-level
> scheduler towards a Monolithic one, and thereby reducing some of the
> scalability of the system, but making it support more complex scheduling
> patterns.  The largest YARN cluster I know of right now is about 4000
> nodes. On it we are hitting some bottlenecks with the current scheduler.
> We have looked at some ways to speed it up with more conventional
> approaches like allowing the scheduler to me multithreaded.  We expect to
> be able to easily support 4000-6000 nodes through YARN with a few
> optimizations. Going to tens of thousands of nodes would require some more
> significant changes.

If there are JIRA(s) which outline the limitations I would be interested in knowing more.

> 
> As far as utilization is concerned the presented architecture does provide
> some very interesting points, but all of that can be addressed with a
> Monolithic scheduler so long as we don't have to scale very large. It also
> would probably require a complete redesign of YARN and the MR AM, which is
> not a small undertaking.  There is also the question of trusted code.  In
> a shared state system where all of the various schedulers are peers how
> would we enforce resource constraints?  

I think the biggest open questions I have with a distributed approach, are; priority, preemption policies, and fragmentation.

> Each of the schedulers would have
> to enforce them themselves, and as such would have to be trusted code.
> This makes adding in new application types on the fly difficult.
> 
> I suppose we could do a hybrid approach, where the RM is a single type of
> scheduler among many.  It would provide the same API that currently exists
> for YARN applications, but MR applications could have one or more
> "JobTracker" like schedulers that share state with the RM, and what other
> "schedulers" there are out.  That would be something fun to try out, but
> sadly I really don't have time to even get started thinking about a proof
> of concept on something like that. At least that is until we hit a
> significant business use case that would drive it over the architecture we
> already have.  
>
> For example needing 10s of thousands of nodes in a
> cluster, or a huge shift in different types of jobs on to YARN so that we
> are doing a lot more than just MR on the same cluster.

Something tells me it may come fast, if/when the YARN application space expands.

> 
> --Bobby
> 
> On 4/19/13 9:47 AM, "Tim St Clair" <ts...@redhat.com> wrote:
> 
> >I recently read Googles Omega paper, and wondering if any of the YARN
> >developers were planning to address some of the items considered as key
> >points.
> >
> >http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
> >
> >Cheers,
> >Tim
> 
>

Re: Omega vs. YARN

Posted by Robert Evans <ev...@yahoo-inc.com>.

Tim,

They are very interesting points.  From a scalability point I don't think
we have really run into those situations yet but they are coming.  YARN
currently has some very "simplistic" scheduling for the RM.  All of the
complexity comes out in the AM.  There have been a number of JIRA to make
requests more complex, to help support more "picky" applications like the
paper says.  These would make YARN shift a bit more from a two-level
scheduler towards a Monolithic one, and thereby reducing some of the
scalability of the system, but making it support more complex scheduling
patterns.  The largest YARN cluster I know of right now is about 4000
nodes. On it we are hitting some bottlenecks with the current scheduler.
We have looked at some ways to speed it up with more conventional
approaches like allowing the scheduler to me multithreaded.  We expect to
be able to easily support 4000-6000 nodes through YARN with a few
optimizations. Going to tens of thousands of nodes would require some more
significant changes.

As far as utilization is concerned the presented architecture does provide
some very interesting points, but all of that can be addressed with a
Monolithic scheduler so long as we don't have to scale very large. It also
would probably require a complete redesign of YARN and the MR AM, which is
not a small undertaking.  There is also the question of trusted code.  In
a shared state system where all of the various schedulers are peers how
would we enforce resource constraints?  Each of the schedulers would have
to enforce them themselves, and as such would have to be trusted code.
This makes adding in new application types on the fly difficult.

I suppose we could do a hybrid approach, where the RM is a single type of
scheduler among many.  It would provide the same API that currently exists
for YARN applications, but MR applications could have one or more
"JobTracker" like schedulers that share state with the RM, and what other
"schedulers" there are out.  That would be something fun to try out, but
sadly I really don't have time to even get started thinking about a proof
of concept on something like that. At least that is until we hit a
significant business use case that would drive it over the architecture we
already have.  For example needing 10s of thousands of nodes in a
cluster, or a huge shift in different types of jobs on to YARN so that we
are doing a lot more than just MR on the same cluster.

--Bobby

On 4/19/13 9:47 AM, "Tim St Clair" <ts...@redhat.com> wrote:

>I recently read Googles Omega paper, and wondering if any of the YARN
>developers were planning to address some of the items considered as key
>points.  
>
>http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf
>
>Cheers,
>Tim