You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by John Smith <ja...@gmail.com> on 2019/06/12 17:02:29 UTC

How to restart/recover on reboot?

The installation instructions do not indicate how to create systemd
services.

1- When task nodes fail, will the job leader detect this and ssh and
restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node
and run start-cluster.sh and the script is smart enough to figure out what
is missing?
3- Or do we need to create systemd services and if so on which command do
we start the service on?

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Posted by John Smith <ja...@gmail.com>.

Ok I tried it works! I can setup my cluster with terraform and enable
systemd services! i think I got confused when I looked and it was doing
leader election because all service came up quick!



On Tue, 18 Jun 2019 at 22:35, John Smith <ja...@gmail.com> wrote:

> Ah ok we need to pass --host. The command line help sais jobmanager.sh
> <host>?!?! If I recall. I have to go check tomorrow...
>
> On Tue., Jun. 18, 2019, 10:05 p.m. PoolakkalMukkath, Shakir, <
> Shakir_PoolakkalMukkath@comcast.com> wrote:
>
>> Hi Nick,
>>
>>
>>
>> It works that way by explicitly setting the –host. I got mislead by the
>> *“only”* word in doc and did not try. Thanks for the help
>>
>>
>>
>> Thanks,
>>
>> Shakir
>>
>> *From: *"Martin, Nick" <Ni...@ngc.com>
>> *Date: *Tuesday, June 18, 2019 at 6:31 PM
>> *To: *"PoolakkalMukkath, Shakir" <Sh...@comcast.com>,
>> Till Rohrmann <tr...@apache.org>, John Smith <ja...@gmail.com>
>> *Cc: *user <us...@flink.apache.org>
>> *Subject: *RE: [EXTERNAL] Re: How to restart/recover on reboot?
>>
>>
>>
>> Jobmanager.sh takes an optional argument for the hostname to bind to, and
>> start-cluster uses it. If you leave it blank it, the script will use
>> whatever is in flink-conf.yaml (localhost is the default value that ships
>> with flink).
>>
>>
>>
>> The dockerized version of flink runs pretty much the way you’re trying to
>> operate (i.e. each node starts itself), so the entrypoint script out of
>> that is probably a good source of information about how to set it up.
>>
>>
>>
>> *From:* PoolakkalMukkath, Shakir [mailto:
>> Shakir_PoolakkalMukkath@comcast.com]
>> *Sent:* Tuesday, June 18, 2019 2:15 PM
>> *To:* Till Rohrmann <tr...@apache.org>; John Smith <
>> java.dev.mtl@gmail.com>
>> *Cc:* user <us...@flink.apache.org>
>> *Subject:* EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?
>>
>>
>>
>> Hi Tim,John,
>>
>>
>>
>> I do agree with the issue John mentioned and have the same problem.
>>
>>
>>
>> We can only *start* a standalone HA cluster with ./start-cluster.sh
>> script. And then when there are failures, we can *restart* those
>> components individually by calling jobmanager.sh/ jobmanager.sh.  This
>> works great
>>
>> But , Like John mentioned, If we want to start the cluster initially
>> itself by running the jobmanager.sh on each JobManager nodes, it is not
>> working. It binds to local and not forming the HA cluster.
>>
>>
>>
>> Thanks,
>>
>> Shakir
>>
>>
>>
>> *From: *Till Rohrmann <tr...@apache.org>
>> *Date: *Tuesday, June 18, 2019 at 4:23 PM
>> *To: *John Smith <ja...@gmail.com>
>> *Cc: *user <us...@flink.apache.org>
>> *Subject: *[EXTERNAL] Re: How to restart/recover on reboot?
>>
>>
>>
>> I guess it should work if you installed a systemd service which simply
>> calls `jobmanager.sh start` or `taskmanager.sh start`.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Tue, Jun 18, 2019 at 4:29 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>> Yes, that is understood. But I don't see why we cannot call jobmanager.sh
>> and taskmanager.sh to build the cluster and have them run as systemd units.
>>
>> I looked at start-cluster.sh and all it does is SSH and call
>> jobmanager.sh which then cascades to taskmanager.sh I just have to pin
>> point what's missing to have systemd service working. In fact calling
>> jobmanager.sh as systemd service actually sees the shared masters, slaves
>> and flink-conf.yaml. But it binds to local host.
>>
>>
>>
>> Maybe one way to do it would be to bootstrap the cluster with
>> ./start-cluster.sh and then install systemd services for jobmanager.sh and
>> tsakmanager.sh
>>
>>
>>
>> Like I said I don't want to have some process in place to remind admins
>> they need to manually start a node every time they patch or a host goes
>> down for what ever reason.
>>
>>
>>
>> On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <tr...@apache.org> wrote:
>>
>> When a single machine fails you should rather call `taskmanager.sh
>> start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
>> will start multiple processes on different machines.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>> Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
>> restarts. And same goes for the job manager. I don't want/need to have to
>> document/remember some start process for sys admins/devops.
>>
>> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
>> into all the specified nodes and starts the processes using the jobmanager
>> and taskmanager scripts. I don't see anything special in any of the sh
>> scripts.
>> I configured passwordless ssh through terraform and all that works great
>> only when trying to do the manual start through systemd. I may have
>> something missing...
>>
>>
>>
>> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org> wrote:
>>
>> Hi John,
>>
>>
>>
>> I have not much experience wrt setting Flink up via systemd services. Why
>> do you want to do it like that?
>>
>>
>>
>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>> This only works on Yarn and Mesos atm.
>>
>> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
>> This script simply starts a new TaskManager process.
>>
>> 3. I guess you could use systemd to bring up a Flink TaskManager process
>> on start up.
>>
>>
>>
>> Cheers,
>>
>> Till
>>
>>
>>
>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>> I looked into the start-cluster.sh and I don't see anything special. So
>> technically it should be as easy as installing Systemd services to run
>> jobamanger.sh and taskmanager.sh respectively?
>>
>>
>>
>> On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com> wrote:
>>
>> The installation instructions do not indicate how to create systemd
>> services.
>>
>>
>>
>> 1- When task nodes fail, will the job leader detect this and ssh and
>> restart the task node? From my testing it doesn't seem like it.
>>
>> 2- How do we recover a lost node? Do we simply go back to the master node
>> and run start-cluster.sh and the script is smart enough to figure out what
>> is missing?
>>
>> 3- Or do we need to create systemd services and if so on which command do
>> we start the service on?
>>
>>
>> ------------------------------
>>
>> Notice: This e-mail is intended solely for use of the individual or
>> entity to which it is addressed and may contain information that is
>> proprietary, privileged and/or exempt from disclosure under applicable law.
>> If the reader is not the intended recipient or agent responsible for
>> delivering the message to the intended recipient, you are hereby notified
>> that any dissemination, distribution or copying of this communication is
>> strictly prohibited. This communication may also contain data subject to
>> U.S. export laws. If so, data subject to the International Traffic in Arms
>> Regulation cannot be disseminated, distributed, transferred, or copied,
>> whether incorporated or in its original form, to foreign nationals residing
>> in the U.S. or abroad, absent the express prior approval of the U.S.
>> Department of State. Data subject to the Export Administration Act may not
>> be disseminated, distributed, transferred or copied contrary to U. S.
>> Department of Commerce regulations. If you have received this communication
>> in error, please notify the sender by reply e-mail and destroy the e-mail
>> message and any physical copies made of the communication.
>>  Thank you.
>> *********************
>>
>

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Posted by John Smith <ja...@gmail.com>.

Ah ok we need to pass --host. The command line help sais jobmanager.sh
<host>?!?! If I recall. I have to go check tomorrow...

On Tue., Jun. 18, 2019, 10:05 p.m. PoolakkalMukkath, Shakir, <
Shakir_PoolakkalMukkath@comcast.com> wrote:

> Hi Nick,
>
>
>
> It works that way by explicitly setting the –host. I got mislead by the
> *“only”* word in doc and did not try. Thanks for the help
>
>
>
> Thanks,
>
> Shakir
>
> *From: *"Martin, Nick" <Ni...@ngc.com>
> *Date: *Tuesday, June 18, 2019 at 6:31 PM
> *To: *"PoolakkalMukkath, Shakir" <Sh...@comcast.com>,
> Till Rohrmann <tr...@apache.org>, John Smith <ja...@gmail.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *RE: [EXTERNAL] Re: How to restart/recover on reboot?
>
>
>
> Jobmanager.sh takes an optional argument for the hostname to bind to, and
> start-cluster uses it. If you leave it blank it, the script will use
> whatever is in flink-conf.yaml (localhost is the default value that ships
> with flink).
>
>
>
> The dockerized version of flink runs pretty much the way you’re trying to
> operate (i.e. each node starts itself), so the entrypoint script out of
> that is probably a good source of information about how to set it up.
>
>
>
> *From:* PoolakkalMukkath, Shakir [mailto:
> Shakir_PoolakkalMukkath@comcast.com]
> *Sent:* Tuesday, June 18, 2019 2:15 PM
> *To:* Till Rohrmann <tr...@apache.org>; John Smith <
> java.dev.mtl@gmail.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?
>
>
>
> Hi Tim,John,
>
>
>
> I do agree with the issue John mentioned and have the same problem.
>
>
>
> We can only *start* a standalone HA cluster with ./start-cluster.sh
> script. And then when there are failures, we can *restart* those
> components individually by calling jobmanager.sh/ jobmanager.sh.  This
> works great
>
> But , Like John mentioned, If we want to start the cluster initially
> itself by running the jobmanager.sh on each JobManager nodes, it is not
> working. It binds to local and not forming the HA cluster.
>
>
>
> Thanks,
>
> Shakir
>
>
>
> *From: *Till Rohrmann <tr...@apache.org>
> *Date: *Tuesday, June 18, 2019 at 4:23 PM
> *To: *John Smith <ja...@gmail.com>
> *Cc: *user <us...@flink.apache.org>
> *Subject: *[EXTERNAL] Re: How to restart/recover on reboot?
>
>
>
> I guess it should work if you installed a systemd service which simply
> calls `jobmanager.sh start` or `taskmanager.sh start`.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Tue, Jun 18, 2019 at 4:29 PM John Smith <ja...@gmail.com> wrote:
>
> Yes, that is understood. But I don't see why we cannot call jobmanager.sh
> and taskmanager.sh to build the cluster and have them run as systemd units.
>
> I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh
> which then cascades to taskmanager.sh I just have to pin point what's
> missing to have systemd service working. In fact calling jobmanager.sh as
> systemd service actually sees the shared masters, slaves and
> flink-conf.yaml. But it binds to local host.
>
>
>
> Maybe one way to do it would be to bootstrap the cluster with
> ./start-cluster.sh and then install systemd services for jobmanager.sh and
> tsakmanager.sh
>
>
>
> Like I said I don't want to have some process in place to remind admins
> they need to manually start a node every time they patch or a host goes
> down for what ever reason.
>
>
>
> On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <tr...@apache.org> wrote:
>
> When a single machine fails you should rather call `taskmanager.sh
> start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
> will start multiple processes on different machines.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com> wrote:
>
> Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
> restarts. And same goes for the job manager. I don't want/need to have to
> document/remember some start process for sys admins/devops.
>
> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
> into all the specified nodes and starts the processes using the jobmanager
> and taskmanager scripts. I don't see anything special in any of the sh
> scripts.
> I configured passwordless ssh through terraform and all that works great
> only when trying to do the manual start through systemd. I may have
> something missing...
>
>
>
> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org> wrote:
>
> Hi John,
>
>
>
> I have not much experience wrt setting Flink up via systemd services. Why
> do you want to do it like that?
>
>
>
> 1. In standalone mode, Flink won't automatically restart TaskManagers.
> This only works on Yarn and Mesos atm.
>
> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
> This script simply starts a new TaskManager process.
>
> 3. I guess you could use systemd to bring up a Flink TaskManager process
> on start up.
>
>
>
> Cheers,
>
> Till
>
>
>
> On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com> wrote:
>
> I looked into the start-cluster.sh and I don't see anything special. So
> technically it should be as easy as installing Systemd services to run
> jobamanger.sh and taskmanager.sh respectively?
>
>
>
> On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com> wrote:
>
> The installation instructions do not indicate how to create systemd
> services.
>
>
>
> 1- When task nodes fail, will the job leader detect this and ssh and
> restart the task node? From my testing it doesn't seem like it.
>
> 2- How do we recover a lost node? Do we simply go back to the master node
> and run start-cluster.sh and the script is smart enough to figure out what
> is missing?
>
> 3- Or do we need to create systemd services and if so on which command do
> we start the service on?
>
>
> ------------------------------
>
> Notice: This e-mail is intended solely for use of the individual or entity
> to which it is addressed and may contain information that is proprietary,
> privileged and/or exempt from disclosure under applicable law. If the
> reader is not the intended recipient or agent responsible for delivering
> the message to the intended recipient, you are hereby notified that any
> dissemination, distribution or copying of this communication is strictly
> prohibited. This communication may also contain data subject to U.S. export
> laws. If so, data subject to the International Traffic in Arms Regulation
> cannot be disseminated, distributed, transferred, or copied, whether
> incorporated or in its original form, to foreign nationals residing in the
> U.S. or abroad, absent the express prior approval of the U.S. Department of
> State. Data subject to the Export Administration Act may not be
> disseminated, distributed, transferred or copied contrary to U. S.
> Department of Commerce regulations. If you have received this communication
> in error, please notify the sender by reply e-mail and destroy the e-mail
> message and any physical copies made of the communication.
>  Thank you.
> *********************
>

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Posted by "PoolakkalMukkath, Shakir" <Sh...@comcast.com>.

Hi Nick,

It works that way by explicitly setting the –host. I got mislead by the “only” word in doc and did not try. Thanks for the help

Thanks,
Shakir
From: "Martin, Nick" <Ni...@ngc.com>
Date: Tuesday, June 18, 2019 at 6:31 PM
To: "PoolakkalMukkath, Shakir" <Sh...@comcast.com>, Till Rohrmann <tr...@apache.org>, John Smith <ja...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: RE: [EXTERNAL] Re: How to restart/recover on reboot?

Jobmanager.sh takes an optional argument for the hostname to bind to, and start-cluster uses it. If you leave it blank it, the script will use whatever is in flink-conf.yaml (localhost is the default value that ships with flink).

The dockerized version of flink runs pretty much the way you’re trying to operate (i.e. each node starts itself), so the entrypoint script out of that is probably a good source of information about how to set it up.

From: PoolakkalMukkath, Shakir [mailto:Shakir_PoolakkalMukkath@comcast.com]
Sent: Tuesday, June 18, 2019 2:15 PM
To: Till Rohrmann <tr...@apache.org>; John Smith <ja...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

We can only start a standalone HA cluster with ./start-cluster.sh script. And then when there are failures, we can restart those components individually by calling jobmanager.sh/ jobmanager.sh.  This works great

But , Like John mentioned, If we want to start the cluster initially itself by running the jobmanager.sh on each JobManager nodes, it is not working. It binds to local and not forming the HA cluster.

Thanks,
Shakir

From: Till Rohrmann <tr...@apache.org>>
Date: Tuesday, June 18, 2019 at 4:23 PM
To: John Smith <ja...@gmail.com>>
Cc: user <us...@flink.apache.org>>
Subject: [EXTERNAL] Re: How to restart/recover on reboot?

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,
Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <ja...@gmail.com>> wrote:
Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <tr...@apache.org>> wrote:
When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,
Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com>> wrote:
Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org>> wrote:
Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com>> wrote:
I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com>> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

________________________________
Notice: This e-mail is intended solely for use of the individual or entity to which it is addressed and may contain information that is proprietary, privileged and/or exempt from disclosure under applicable law. If the reader is not the intended recipient or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. This communication may also contain data subject to U.S. export laws. If so, data subject to the International Traffic in Arms Regulation cannot be disseminated, distributed, transferred, or copied, whether incorporated or in its original form, to foreign nationals residing in the U.S. or abroad, absent the express prior approval of the U.S. Department of State. Data subject to the Export Administration Act may not be disseminated, distributed, transferred or copied contrary to U. S. Department of Commerce regulations. If you have received this communication in error, please notify the sender by reply e-mail and destroy the e-mail message and any physical copies made of the communication.
 Thank you.
*********************

RE: [EXTERNAL] Re: How to restart/recover on reboot?

Posted by "Martin, Nick" <Ni...@ngc.com>.

Jobmanager.sh takes an optional argument for the hostname to bind to, and start-cluster uses it. If you leave it blank it, the script will use whatever is in flink-conf.yaml (localhost is the default value that ships with flink).

The dockerized version of flink runs pretty much the way you’re trying to operate (i.e. each node starts itself), so the entrypoint script out of that is probably a good source of information about how to set it up.

From: PoolakkalMukkath, Shakir [mailto:Shakir_PoolakkalMukkath@comcast.com]
Sent: Tuesday, June 18, 2019 2:15 PM
To: Till Rohrmann <tr...@apache.org>; John Smith <ja...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: EXT :Re: [EXTERNAL] Re: How to restart/recover on reboot?

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

We can only start a standalone HA cluster with ./start-cluster.sh script. And then when there are failures, we can restart those components individually by calling jobmanager.sh/ jobmanager.sh.  This works great

But , Like John mentioned, If we want to start the cluster initially itself by running the jobmanager.sh on each JobManager nodes, it is not working. It binds to local and not forming the HA cluster.

Thanks,
Shakir

From: Till Rohrmann <tr...@apache.org>>
Date: Tuesday, June 18, 2019 at 4:23 PM
To: John Smith <ja...@gmail.com>>
Cc: user <us...@flink.apache.org>>
Subject: [EXTERNAL] Re: How to restart/recover on reboot?

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,
Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <ja...@gmail.com>> wrote:
Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <tr...@apache.org>> wrote:
When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,
Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com>> wrote:
Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org>> wrote:
Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com>> wrote:
I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com>> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

------------------------------------------------------------------------------

Notice: This e-mail is intended solely for use of the individual or entity to which it is addressed and may contain information that is proprietary, privileged and/or exempt from disclosure under applicable law. If the reader is not the intended recipient or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. This communication may also contain data subject to U.S. export laws. If so, data subject to the International Traffic in Arms Regulation cannot be disseminated, distributed, transferred, or copied, whether incorporated or in its original form, to foreign nationals residing in the U.S. or abroad, absent the express prior approval of the U.S. Department of State. Data subject to the Export Administration Act may not be disseminated, distributed, transferred or copied contrary to U. S. Department of Commerce regulations. If you have received this communication in error, please notify the sender by reply e-mail and destroy the e-mail message and any physical copies made of the communication.
 Thank you. 
*********************

Re: [EXTERNAL] Re: How to restart/recover on reboot?

Posted by "PoolakkalMukkath, Shakir" <Sh...@comcast.com>.

Hi Tim,John,

I do agree with the issue John mentioned and have the same problem.

We can only start a standalone HA cluster with ./start-cluster.sh script. And then when there are failures, we can restart those components individually by calling jobmanager.sh/ jobmanager.sh.  This works great

But , Like John mentioned, If we want to start the cluster initially itself by running the jobmanager.sh on each JobManager nodes, it is not working. It binds to local and not forming the HA cluster.

Thanks,
Shakir

From: Till Rohrmann <tr...@apache.org>
Date: Tuesday, June 18, 2019 at 4:23 PM
To: John Smith <ja...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: [EXTERNAL] Re: How to restart/recover on reboot?

I guess it should work if you installed a systemd service which simply calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,
Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <ja...@gmail.com>> wrote:
Yes, that is understood. But I don't see why we cannot call jobmanager.sh and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh which then cascades to taskmanager.sh I just have to pin point what's missing to have systemd service working. In fact calling jobmanager.sh as systemd service actually sees the shared masters, slaves and flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with ./start-cluster.sh and then install systemd services for jobmanager.sh and tsakmanager.sh

Like I said I don't want to have some process in place to remind admins they need to manually start a node every time they patch or a host goes down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <tr...@apache.org>> wrote:
When a single machine fails you should rather call `taskmanager.sh start`/`jobmanager.sh start` to start a single process. `start-cluster.sh` will start multiple processes on different machines.

Cheers,
Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com>> wrote:
Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the specified nodes and starts the processes using the jobmanager and taskmanager scripts. I don't see anything special in any of the sh scripts.
I configured passwordless ssh through terraform and all that works great only when trying to do the manual start through systemd. I may have something missing...


On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org>> wrote:
Hi John,

I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`. This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com>> wrote:
I looked into the start-cluster.sh and I don't see anything special. So technically it should be as easy as installing Systemd services to run jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com>> wrote:
The installation instructions do not indicate how to create systemd services.

1- When task nodes fail, will the job leader detect this and ssh and restart the task node? From my testing it doesn't seem like it.
2- How do we recover a lost node? Do we simply go back to the master node and run start-cluster.sh and the script is smart enough to figure out what is missing?
3- Or do we need to create systemd services and if so on which command do we start the service on?

Re: How to restart/recover on reboot?

Posted by Till Rohrmann <tr...@apache.org>.

I guess it should work if you installed a systemd service which simply
calls `jobmanager.sh start` or `taskmanager.sh start`.

Cheers,
Till

On Tue, Jun 18, 2019 at 4:29 PM John Smith <ja...@gmail.com> wrote:

> Yes, that is understood. But I don't see why we cannot call jobmanager.sh
> and taskmanager.sh to build the cluster and have them run as systemd units.
>
> I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh
> which then cascades to taskmanager.sh I just have to pin point what's
> missing to have systemd service working. In fact calling jobmanager.sh as
> systemd service actually sees the shared masters, slaves and
> flink-conf.yaml. But it binds to local host.
>
> Maybe one way to do it would be to bootstrap the cluster with
> ./start-cluster.sh and then install systemd services for jobmanager.sh and
> tsakmanager.sh
>
> Like I said I don't want to have some process in place to remind admins
> they need to manually start a node every time they patch or a host goes
> down for what ever reason.
>
> On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <tr...@apache.org> wrote:
>
>> When a single machine fails you should rather call `taskmanager.sh
>> start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
>> will start multiple processes on different machines.
>>
>> Cheers,
>> Till
>>
>> On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Well some reasons, machine reboots/maintenance etc... Host/VM crashes
>>> and restarts. And same goes for the job manager. I don't want/need to have
>>> to document/remember some start process for sys admins/devops.
>>>
>>> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
>>> into all the specified nodes and starts the processes using the jobmanager
>>> and taskmanager scripts. I don't see anything special in any of the sh
>>> scripts.
>>> I configured passwordless ssh through terraform and all that works great
>>> only when trying to do the manual start through systemd. I may have
>>> something missing...
>>>
>>>
>>>
>>> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> I have not much experience wrt setting Flink up via systemd services.
>>>> Why do you want to do it like that?
>>>>
>>>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>>>> This only works on Yarn and Mesos atm.
>>>> 2. In case of a lost TaskManager, you should run `taskmanager.sh
>>>> start`. This script simply starts a new TaskManager process.
>>>> 3. I guess you could use systemd to bring up a Flink TaskManager
>>>> process on start up.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked into the start-cluster.sh and I don't see anything special.
>>>>> So technically it should be as easy as installing Systemd services to run
>>>>> jobamanger.sh and taskmanager.sh respectively?
>>>>>
>>>>> On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The installation instructions do not indicate how to create systemd
>>>>>> services.
>>>>>>
>>>>>> 1- When task nodes fail, will the job leader detect this and ssh and
>>>>>> restart the task node? From my testing it doesn't seem like it.
>>>>>> 2- How do we recover a lost node? Do we simply go back to the master
>>>>>> node and run start-cluster.sh and the script is smart enough to figure out
>>>>>> what is missing?
>>>>>> 3- Or do we need to create systemd services and if so on which
>>>>>> command do we start the service on?
>>>>>>
>>>>>

Re: How to restart/recover on reboot?

Posted by John Smith <ja...@gmail.com>.

Yes, that is understood. But I don't see why we cannot call jobmanager.sh
and taskmanager.sh to build the cluster and have them run as systemd units.

I looked at start-cluster.sh and all it does is SSH and call jobmanager.sh
which then cascades to taskmanager.sh I just have to pin point what's
missing to have systemd service working. In fact calling jobmanager.sh as
systemd service actually sees the shared masters, slaves and
flink-conf.yaml. But it binds to local host.

Maybe one way to do it would be to bootstrap the cluster with
./start-cluster.sh and then install systemd services for jobmanager.sh and
tsakmanager.sh

Like I said I don't want to have some process in place to remind admins
they need to manually start a node every time they patch or a host goes
down for what ever reason.

On Tue, 18 Jun 2019 at 04:31, Till Rohrmann <tr...@apache.org> wrote:

> When a single machine fails you should rather call `taskmanager.sh
> start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
> will start multiple processes on different machines.
>
> Cheers,
> Till
>
> On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com> wrote:
>
>> Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
>> restarts. And same goes for the job manager. I don't want/need to have to
>> document/remember some start process for sys admins/devops.
>>
>> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
>> into all the specified nodes and starts the processes using the jobmanager
>> and taskmanager scripts. I don't see anything special in any of the sh
>> scripts.
>> I configured passwordless ssh through terraform and all that works great
>> only when trying to do the manual start through systemd. I may have
>> something missing...
>>
>>
>>
>> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org> wrote:
>>
>>> Hi John,
>>>
>>> I have not much experience wrt setting Flink up via systemd services.
>>> Why do you want to do it like that?
>>>
>>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>>> This only works on Yarn and Mesos atm.
>>> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
>>> This script simply starts a new TaskManager process.
>>> 3. I guess you could use systemd to bring up a Flink TaskManager process
>>> on start up.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> I looked into the start-cluster.sh and I don't see anything special. So
>>>> technically it should be as easy as installing Systemd services to run
>>>> jobamanger.sh and taskmanager.sh respectively?
>>>>
>>>> On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> The installation instructions do not indicate how to create systemd
>>>>> services.
>>>>>
>>>>> 1- When task nodes fail, will the job leader detect this and ssh and
>>>>> restart the task node? From my testing it doesn't seem like it.
>>>>> 2- How do we recover a lost node? Do we simply go back to the master
>>>>> node and run start-cluster.sh and the script is smart enough to figure out
>>>>> what is missing?
>>>>> 3- Or do we need to create systemd services and if so on which command
>>>>> do we start the service on?
>>>>>
>>>>

Re: How to restart/recover on reboot?

Posted by Till Rohrmann <tr...@apache.org>.

When a single machine fails you should rather call `taskmanager.sh
start`/`jobmanager.sh start` to start a single process. `start-cluster.sh`
will start multiple processes on different machines.

Cheers,
Till

On Mon, Jun 17, 2019 at 4:30 PM John Smith <ja...@gmail.com> wrote:

> Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
> restarts. And same goes for the job manager. I don't want/need to have to
> document/remember some start process for sys admins/devops.
>
> So far I have looked at ./start-cluster.sh and all it seems to do is SSH
> into all the specified nodes and starts the processes using the jobmanager
> and taskmanager scripts. I don't see anything special in any of the sh
> scripts.
> I configured passwordless ssh through terraform and all that works great
> only when trying to do the manual start through systemd. I may have
> something missing...
>
>
>
> On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org> wrote:
>
>> Hi John,
>>
>> I have not much experience wrt setting Flink up via systemd services. Why
>> do you want to do it like that?
>>
>> 1. In standalone mode, Flink won't automatically restart TaskManagers.
>> This only works on Yarn and Mesos atm.
>> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
>> This script simply starts a new TaskManager process.
>> 3. I guess you could use systemd to bring up a Flink TaskManager process
>> on start up.
>>
>> Cheers,
>> Till
>>
>> On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> I looked into the start-cluster.sh and I don't see anything special. So
>>> technically it should be as easy as installing Systemd services to run
>>> jobamanger.sh and taskmanager.sh respectively?
>>>
>>> On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com> wrote:
>>>
>>>> The installation instructions do not indicate how to create systemd
>>>> services.
>>>>
>>>> 1- When task nodes fail, will the job leader detect this and ssh and
>>>> restart the task node? From my testing it doesn't seem like it.
>>>> 2- How do we recover a lost node? Do we simply go back to the master
>>>> node and run start-cluster.sh and the script is smart enough to figure out
>>>> what is missing?
>>>> 3- Or do we need to create systemd services and if so on which command
>>>> do we start the service on?
>>>>
>>>

Re: How to restart/recover on reboot?

Posted by John Smith <ja...@gmail.com>.

Well some reasons, machine reboots/maintenance etc... Host/VM crashes and
restarts. And same goes for the job manager. I don't want/need to have to
document/remember some start process for sys admins/devops.

So far I have looked at ./start-cluster.sh and all it seems to do is SSH
into all the specified nodes and starts the processes using the jobmanager
and taskmanager scripts. I don't see anything special in any of the sh
scripts.
I configured passwordless ssh through terraform and all that works great
only when trying to do the manual start through systemd. I may have
something missing...

On Mon, 17 Jun 2019 at 09:41, Till Rohrmann <tr...@apache.org> wrote:

> Hi John,
>
> I have not much experience wrt setting Flink up via systemd services. Why
> do you want to do it like that?
>
> 1. In standalone mode, Flink won't automatically restart TaskManagers.
> This only works on Yarn and Mesos atm.
> 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
> This script simply starts a new TaskManager process.
> 3. I guess you could use systemd to bring up a Flink TaskManager process
> on start up.
>
> Cheers,
> Till
>
> On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com> wrote:
>
>> I looked into the start-cluster.sh and I don't see anything special. So
>> technically it should be as easy as installing Systemd services to run
>> jobamanger.sh and taskmanager.sh respectively?
>>
>> On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com> wrote:
>>
>>> The installation instructions do not indicate how to create systemd
>>> services.
>>>
>>> 1- When task nodes fail, will the job leader detect this and ssh and
>>> restart the task node? From my testing it doesn't seem like it.
>>> 2- How do we recover a lost node? Do we simply go back to the master
>>> node and run start-cluster.sh and the script is smart enough to figure out
>>> what is missing?
>>> 3- Or do we need to create systemd services and if so on which command
>>> do we start the service on?
>>>
>>

Re: How to restart/recover on reboot?

Posted by Till Rohrmann <tr...@apache.org>.

Hi John,

I have not much experience wrt setting Flink up via systemd services. Why
do you want to do it like that?

1. In standalone mode, Flink won't automatically restart TaskManagers. This
only works on Yarn and Mesos atm.
2. In case of a lost TaskManager, you should run `taskmanager.sh start`.
This script simply starts a new TaskManager process.
3. I guess you could use systemd to bring up a Flink TaskManager process on
start up.

Cheers,
Till

On Fri, Jun 14, 2019 at 5:56 PM John Smith <ja...@gmail.com> wrote:

> I looked into the start-cluster.sh and I don't see anything special. So
> technically it should be as easy as installing Systemd services to run
> jobamanger.sh and taskmanager.sh respectively?
>
> On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com> wrote:
>
>> The installation instructions do not indicate how to create systemd
>> services.
>>
>> 1- When task nodes fail, will the job leader detect this and ssh and
>> restart the task node? From my testing it doesn't seem like it.
>> 2- How do we recover a lost node? Do we simply go back to the master node
>> and run start-cluster.sh and the script is smart enough to figure out what
>> is missing?
>> 3- Or do we need to create systemd services and if so on which command do
>> we start the service on?
>>
>

Re: How to restart/recover on reboot?

Posted by John Smith <ja...@gmail.com>.

I looked into the start-cluster.sh and I don't see anything special. So
technically it should be as easy as installing Systemd services to run
jobamanger.sh and taskmanager.sh respectively?

On Wed, 12 Jun 2019 at 13:02, John Smith <ja...@gmail.com> wrote:

> The installation instructions do not indicate how to create systemd
> services.
>
> 1- When task nodes fail, will the job leader detect this and ssh and
> restart the task node? From my testing it doesn't seem like it.
> 2- How do we recover a lost node? Do we simply go back to the master node
> and run start-cluster.sh and the script is smart enough to figure out what
> is missing?
> 3- Or do we need to create systemd services and if so on which command do
> we start the service on?
>