You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "hsy541@gmail.com" <hs...@gmail.com> on 2014/11/05 21:21:00 UTC

Is it able to make AM to restart from previous state?

Hi guys,

I noticed in the code when a container fails it will try to relaunch from
the same node. My question is if I restart whole application(Ex. AM got
killed, or manually restart the app). Does slider try to launch all
containers from the nodes where it was running?

Thanks!

Best,
Siyuan

Re: Is it able to make AM to restart from previous state?

Posted by Steve Loughran <st...@hortonworks.com>.
On 5 November 2014 21:44, hsy541@gmail.com <hs...@gmail.com> wrote:

> Thanks Steve,
>
> Is No 1  a new feature in YARN (Not released yet)?
>

Came out in Hadoop 2.4; only got working in Hadoop 2.5. We were the first
users.


>
> And you mentioned slider saves the location in history files. What are the
> history files and where is it stored? Is it in HDFS?
>
> If the one of the previous machines is gone, will it try to get resource
> from new labeled machine?
>

yes. When it asks for a machine it doesn't say "must be on this machine",
but says "we'd prefer it on this machine". If the component is being
requested on a labelled set, YARN will restrict scheduling to the other
nodes with those labels.

labels are new to hadoop 2.6, of course


>
> Thanks
>
>
> On Wed, Nov 5, 2014 at 12:46 PM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
> > On 5 November 2014 20:21, hsy541@gmail.com <hs...@gmail.com> wrote:
> >
> > > Hi guys,
> > >
> > > I noticed in the code when a container fails it will try to relaunch
> from
> > > the same node. My question is if I restart whole application(Ex. AM got
> > > killed, or manually restart the app). Does slider try to launch all
> > > containers from the nodes where it was running?
> > >
> > >
> > 1. If the AM crashes then YARN will restart it. The containers will keep
> > working. When the AM comes back up it will work out its state and all
> > running containers will stay live. Any containers that were part way
> > through starting will be released and new ones requested (there's no
> record
> > of what state they were in, so a clean destroy is simpler)
> >
> >
> > If you stop/start the app then it asks for the nodes back on the same
> > machines they were on. It saves the locations (look in the history
> subdir)
> > to see the history files.
> >
> > Slider tries to read the last entry, going back to previous ones if the
> > last one doesn't load. It then asks YARN for containers on those
> machines.
> > There's no guarantee you get them though.
> >
> > Looking at the history code last week I noticed one little quirk: it
> > doesn't reload the histories if the number of component types has
> > increased. It just indexes the entries; more entries means it doesn't
> know
> > how to handle them.
> >
> > To avoid this problem define all your components from the outset, setting
> > the instances count 0 for ones you don't currently want
> >
> > Thanks!
> > >
> > > Best,
> > > Siyuan
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Is it able to make AM to restart from previous state?

Posted by Steve Loughran <st...@hortonworks.com>.
On 6 November 2014 00:31, hsy541@gmail.com <hs...@gmail.com> wrote:

> Steve,
>
> I found out from the code that everything is kept in history folder in
> hdfs.
>
> You mentioned that if I add new component, the history layout would be
> discarded. What if I add more component instances in configuration? Do you
> try to launch instance from previous node and add new instance from new
> node?
> What if you decrease the instance number?
>
>
we dont' care about #of instances; its just the indexing of component types
that is weak.


   1. It doesn't matter if you change the instance count and ask for more;
   slider just asks for more containers without specifiying a specific machine
   2. If there is already an instance of that component on the specific
   machine, it doesn't explicitly ask for a second. This is to boost
   anti-affinity and spread the load.
   3. if you flex down the cluster size, slider remembers the containers
   last used, and keeps track of when a node was last used for a role. When
   you flex back up, it asks for it back. This should hold even if you stop
   and restart the cluster in between.
   4. we do have some extra logic to to try and detect and avoid unreliable
   nodes. If we get a container off YARN and it fails during startup, we do
   not re-request a container on that node. This is to avoid developing a bias
   to one particularly unreliable machine in a cluster.

The failure tracking and blacklisting is an interesting problem. I'm
confident we could do better —I'm hoping get experience from real world use
to see what problems occur and how best to address them. In the MR layer,
they detect and blacklist "slow" machines, those whose disks are getting
slow. Its a sign of imminent HDD failure, and kills MR job performance as
those stragglers block the whole workload. For slider we'd need to think
about something application specific -as an example, if requests made
against an HBase region server on node 26 took longer than requests made
against others. Measuring things like that is a hard problem -and so app
specific I'd almost say "let other layers deal with it". If we delegate it
to application-layer monitoring tools, we could add some operations to let
those layers tell slider that a specific node is slow or unreliable, and
have slider react by (a) releasing that container, (b) asking for new one
elsewhere and (c) telling YARN its having problems with that node


> Thanks!
>
> Best,
> Siyuan
>
> On Wed, Nov 5, 2014 at 1:44 PM, hsy541@gmail.com <hs...@gmail.com> wrote:
>
> > Thanks Steve,
> >
> > Is No 1  a new feature in YARN (Not released yet)?
> >
> > And you mentioned slider saves the location in history files. What are
> the
> > history files and where is it stored? Is it in HDFS?
> >
> > If the one of the previous machines is gone, will it try to get resource
> > from new labeled machine?
> >
> > Thanks
> >
> >
> > On Wed, Nov 5, 2014 at 12:46 PM, Steve Loughran <st...@hortonworks.com>
> > wrote:
> >
> >> On 5 November 2014 20:21, hsy541@gmail.com <hs...@gmail.com> wrote:
> >>
> >> > Hi guys,
> >> >
> >> > I noticed in the code when a container fails it will try to relaunch
> >> from
> >> > the same node. My question is if I restart whole application(Ex. AM
> got
> >> > killed, or manually restart the app). Does slider try to launch all
> >> > containers from the nodes where it was running?
> >> >
> >> >
> >> 1. If the AM crashes then YARN will restart it. The containers will keep
> >> working. When the AM comes back up it will work out its state and all
> >> running containers will stay live. Any containers that were part way
> >> through starting will be released and new ones requested (there's no
> >> record
> >> of what state they were in, so a clean destroy is simpler)
> >>
> >>
> >> If you stop/start the app then it asks for the nodes back on the same
> >> machines they were on. It saves the locations (look in the history
> subdir)
> >> to see the history files.
> >>
> >> Slider tries to read the last entry, going back to previous ones if the
> >> last one doesn't load. It then asks YARN for containers on those
> machines.
> >> There's no guarantee you get them though.
> >>
> >> Looking at the history code last week I noticed one little quirk: it
> >> doesn't reload the histories if the number of component types has
> >> increased. It just indexes the entries; more entries means it doesn't
> know
> >> how to handle them.
> >>
> >> To avoid this problem define all your components from the outset,
> setting
> >> the instances count 0 for ones you don't currently want
> >>
> >> Thanks!
> >> >
> >> > Best,
> >> > Siyuan
> >> >
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or entity
> >> to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> >> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> >> immediately
> >> and delete it from your system. Thank You.
> >>
> >
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Is it able to make AM to restart from previous state?

Posted by "hsy541@gmail.com" <hs...@gmail.com>.
Steve,

I found out from the code that everything is kept in history folder in
hdfs.

You mentioned that if I add new component, the history layout would be
discarded. What if I add more component instances in configuration? Do you
try to launch instance from previous node and add new instance from new
node?
What if you decrease the instance number?

Thanks!

Best,
Siyuan

On Wed, Nov 5, 2014 at 1:44 PM, hsy541@gmail.com <hs...@gmail.com> wrote:

> Thanks Steve,
>
> Is No 1  a new feature in YARN (Not released yet)?
>
> And you mentioned slider saves the location in history files. What are the
> history files and where is it stored? Is it in HDFS?
>
> If the one of the previous machines is gone, will it try to get resource
> from new labeled machine?
>
> Thanks
>
>
> On Wed, Nov 5, 2014 at 12:46 PM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>> On 5 November 2014 20:21, hsy541@gmail.com <hs...@gmail.com> wrote:
>>
>> > Hi guys,
>> >
>> > I noticed in the code when a container fails it will try to relaunch
>> from
>> > the same node. My question is if I restart whole application(Ex. AM got
>> > killed, or manually restart the app). Does slider try to launch all
>> > containers from the nodes where it was running?
>> >
>> >
>> 1. If the AM crashes then YARN will restart it. The containers will keep
>> working. When the AM comes back up it will work out its state and all
>> running containers will stay live. Any containers that were part way
>> through starting will be released and new ones requested (there's no
>> record
>> of what state they were in, so a clean destroy is simpler)
>>
>>
>> If you stop/start the app then it asks for the nodes back on the same
>> machines they were on. It saves the locations (look in the history subdir)
>> to see the history files.
>>
>> Slider tries to read the last entry, going back to previous ones if the
>> last one doesn't load. It then asks YARN for containers on those machines.
>> There's no guarantee you get them though.
>>
>> Looking at the history code last week I noticed one little quirk: it
>> doesn't reload the histories if the number of component types has
>> increased. It just indexes the entries; more entries means it doesn't know
>> how to handle them.
>>
>> To avoid this problem define all your components from the outset, setting
>> the instances count 0 for ones you don't currently want
>>
>> Thanks!
>> >
>> > Best,
>> > Siyuan
>> >
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified
>> that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender
>> immediately
>> and delete it from your system. Thank You.
>>
>
>

Re: Is it able to make AM to restart from previous state?

Posted by "hsy541@gmail.com" <hs...@gmail.com>.
Thanks Steve,

Is No 1  a new feature in YARN (Not released yet)?

And you mentioned slider saves the location in history files. What are the
history files and where is it stored? Is it in HDFS?

If the one of the previous machines is gone, will it try to get resource
from new labeled machine?

Thanks


On Wed, Nov 5, 2014 at 12:46 PM, Steve Loughran <st...@hortonworks.com>
wrote:

> On 5 November 2014 20:21, hsy541@gmail.com <hs...@gmail.com> wrote:
>
> > Hi guys,
> >
> > I noticed in the code when a container fails it will try to relaunch from
> > the same node. My question is if I restart whole application(Ex. AM got
> > killed, or manually restart the app). Does slider try to launch all
> > containers from the nodes where it was running?
> >
> >
> 1. If the AM crashes then YARN will restart it. The containers will keep
> working. When the AM comes back up it will work out its state and all
> running containers will stay live. Any containers that were part way
> through starting will be released and new ones requested (there's no record
> of what state they were in, so a clean destroy is simpler)
>
>
> If you stop/start the app then it asks for the nodes back on the same
> machines they were on. It saves the locations (look in the history subdir)
> to see the history files.
>
> Slider tries to read the last entry, going back to previous ones if the
> last one doesn't load. It then asks YARN for containers on those machines.
> There's no guarantee you get them though.
>
> Looking at the history code last week I noticed one little quirk: it
> doesn't reload the histories if the number of component types has
> increased. It just indexes the entries; more entries means it doesn't know
> how to handle them.
>
> To avoid this problem define all your components from the outset, setting
> the instances count 0 for ones you don't currently want
>
> Thanks!
> >
> > Best,
> > Siyuan
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Is it able to make AM to restart from previous state?

Posted by Steve Loughran <st...@hortonworks.com>.
On 5 November 2014 20:21, hsy541@gmail.com <hs...@gmail.com> wrote:

> Hi guys,
>
> I noticed in the code when a container fails it will try to relaunch from
> the same node. My question is if I restart whole application(Ex. AM got
> killed, or manually restart the app). Does slider try to launch all
> containers from the nodes where it was running?
>
>
1. If the AM crashes then YARN will restart it. The containers will keep
working. When the AM comes back up it will work out its state and all
running containers will stay live. Any containers that were part way
through starting will be released and new ones requested (there's no record
of what state they were in, so a clean destroy is simpler)


If you stop/start the app then it asks for the nodes back on the same
machines they were on. It saves the locations (look in the history subdir)
to see the history files.

Slider tries to read the last entry, going back to previous ones if the
last one doesn't load. It then asks YARN for containers on those machines.
There's no guarantee you get them though.

Looking at the history code last week I noticed one little quirk: it
doesn't reload the histories if the number of component types has
increased. It just indexes the entries; more entries means it doesn't know
how to handle them.

To avoid this problem define all your components from the outset, setting
the instances count 0 for ones you don't currently want

Thanks!
>
> Best,
> Siyuan
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.