You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by Bhupesh Chawda <bh...@datatorrent.com> on 2017/03/14 12:59:38 UTC

Re: Improving Apex relaunch time.

Hi All,

The PR: https://github.com/apache/apex-core/pull/422 to solve this issue
looks good to me.
If there are no other comments, will merge this PR soon.

~ Bhupesh


_______________________________________________________

Bhupesh Chawda

E: bhupesh@datatorrent.com | Twitter: @bhupeshsc

www.datatorrent.com  |  apex.apache.org



On Wed, Sep 21, 2016 at 8:16 PM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> Relaunching from the same location can be one of the options.
>
> On Tue, Sep 20, 2016, 10:17 PM Tushar Gosavi <tu...@datatorrent.com>
> wrote:
>
> > In case of application failure, we will like to have ability to
> > quickly restart the application while keeping the old state for
> > failure
> > analysis. Also the problem remains the same when we want to start from
> > savepoint, where we will need to copy state from
> > savepoint to application.
> >
> > -Tushar.
> >
> >
> >
> > On Tue, Sep 20, 2016 at 8:34 PM, Sandesh Hegde <sa...@datatorrent.com>
> > wrote:
> > > How about re-launching the app from the same location?
> > >
> > > If at all they want to store the state we can provide savepoint
> feature.
> > >
> > > On Tue, Sep 20, 2016 at 4:39 AM Tushar Gosavi <tu...@datatorrent.com>
> > > wrote:
> > >
> > >> We have observed that application relaunch takes long time.
> > >> The one major reason for delay in application startup during relaunch
> > >> is time taken to copy state of exisitng application to new
> application.
> > >> This state could grow in GBs and copy is performed in single thread
> > before
> > >> new application is submitted to Yarn.
> > >>
> > >> The state of previous application constists
> > >> - jars
> > >> - stram checkpoint/recovery file.
> > >> - events
> > >> - container file
> > >> - stats recording if enabled.
> > >> - operator checkpoints
> > >> - operator data.
> > >>
> > >> We could avoid copying debugging data like stat recording which could
> > >> run in TB for long
> > >> running application and is not required for functioning of new
> > application.
> > >>
> > >> Similarly operator checkpoints could be read in parallel when they are
> > >> launched for first time,
> > >> This will also help in copying only required checkpoints and will be
> > >> done in parallel
> > >> by multiple containers/threads.
> > >>
> > >> For operator data stored in application directory, we could copy it
> > >> completely for now, but
> > >> in future we could provide an callback which will allow operator
> > >> partition to read only
> > >> required state from previous location.
> > >>
> > >> let me know your though on this.
> > >>
> > >> Regards,
> > >> - Tushar.
> > >>
> >
>