You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Julio Biason <ju...@azion.com> on 2018/05/02 12:52:29 UTC

Cannot submit jobs on a HA Standalone JobManager

Hello all,

I'm building a standalone cluster with HA JobManager. So far, everything
seems to work, but when i try to `flink run` my job, it fails with the
following error:

Caused by:
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could
not retrieve the leader gateway.

So far, I have two different machines running the JobManager and, looking
at the logs, I can't see any problem whatsoever to explain why the flink
command is refusing to run the pipeline...

Any ideas where I should look?

-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Re: Cannot submit jobs on a HA Standalone JobManager

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Julio,

I agree that the job submission should work in HA mode if you manually
specify
the JobManager. At the minimum a proper error message should be shown. Feel
free
to open an issue in JIRA.

You already stated that you can maintain multiple configuration directories
as a
workaround. It is possible to switch between them by setting the
FLINK_CONF_DIR
environment variable, e.g,

  FLINK_CONF_DIR=/path/to/conf-dir-1 bin/flink run ...
  FLINK_CONF_DIR=/path/to/conf-dir-2 bin/flink run ...

Beginning from 1.5 this should be a non-issue because the job submission
happens
through HTTP and every non-leading master redirects requests to the leading
master.

Best,
Gary

On Thu, May 3, 2018 at 10:23 PM, Julio Biason <ju...@azion.com>
wrote:

> Hey Gary (again),
>
> Yup, that worked. Now I can launch apps again.
>
> ... but that's not something actually good.
>
> I mean, I have my own test environment, which doesn't need HA -- after
> all, I don't need to worry about this, this is a framework job, not my
> pipeline job. Which means now I'll need to either keep two different
> configuration files and keep switching between them -- because the `flink`
> command doesn't not accept a configuration file (or, at least, it's not
> listed on `--help`) or I'll have to first copy to the prod/staging machines
> and then run there -- which seems a waste, since it seems using `flink run
> -m` already adds the file in the blob server and then runs and, doing the
> copy-to-machine step means there are two copies going on.
>
> I mean, if I say "hey flink, run this job _there_", flink should be smart
> enough to read how "there" is running things and adjust itself. The
> environment which started the run may not follow the same rules as the
> target run machine.
>
> ... and, in the end, it seems this is mostly useless discussion, as
> JobManager is changing completely on 1.5 -- but I kinda worry if I will
> have the same issue with the new ResourceManager...
>
> On Thu, May 3, 2018 at 11:00 AM, Julio Biason <ju...@azion.com>
> wrote:
>
>> Hey Gary,
>>
>> Yes, I was still running with the `-m` flag on my dev machine --
>> partially configured like prod, but without the HA stuff. I never thought
>> it could be a problem, since even the web interface can redirect from the
>> secondary back to primary.
>>
>> Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon
>> as I can fix this).
>>
>> I'll try again with the HA/ZooKeeper properly set up on my machine and,
>> if it still balks, I'll send the (updated) logs.
>>
>> On Thu, May 3, 2018 at 9:36 AM, Gary Yao <ga...@data-artisans.com> wrote:
>>
>>> Hi Julio,
>>>
>>> Are you using the -m flag of "bin/flink run" by any chance? In HA mode,
>>> you
>>> cannot manually specify the JobManager address. The client determines
>>> the leader
>>> through ZooKeeper. If you did not configure the ZooKeeper quorum in the
>>> flink-conf.yaml on the machine from which you are submitting, this might
>>> explain
>>> the error message.
>>>
>>> > But that didn't solve my problem. So far, the `flink run` still fails
>>> with the same message (I'm adding the full stacktrace of the failure in the
>>> end, just in case), but now I'm also seeing this message in the JobManager
>>> logs:
>>> Unfortunately, the error message in your previous email is different. If
>>> the
>>> above does not solve your problem, can you attach the logs of the client
>>> and
>>> JobManager?
>>>
>>> Lastly, what Flink version are you running?
>>>
>>> Best,
>>> Gary
>>>
>>> On Wed, May 2, 2018 at 6:51 PM, Julio Biason <ju...@azion.com>
>>> wrote:
>>>
>>>> Hey guys and gals,
>>>>
>>>> So, after a bit more digging, I found out that once HA is enabled,
>>>> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
>>>> but I was expecting this). Because I set the `high-availability.jobmanager.port`
>>>> to `50010-50015`, my RPC port also changed (the docs made me think this
>>>> would only affect the HA communication, not ALL communications). This can
>>>> be checked on the Dashboard, under the JobManager configuration option.
>>>>
>>>> But that didn't solve my problem. So far, the `flink run` still fails
>>>> with the same message (I'm adding the full stacktrace of the failure in the
>>>> end, just in case), but now I'm also seeing this message in the JobManager
>>>> logs:
>>>>
>>>> 2018-05-02 16:44:32,373 WARN  org.apache.flink.runtime.jobma
>>>> nager.JobManager                - Discard message
>>>> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
>>>> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected
>>>> leader session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal
>>>> the received leader session ID 00000000-0000-0000-0000-000000
>>>> 000000.
>>>>
>>>>
>>>> So, I'm still lost on where to go forward.
>>>>
>>>>
>>>> Failure when using `flink run`:
>>>>
>>>> org.apache.flink.client.program.ProgramInvocationException: The
>>>> program execution failed: JobManager did not respond within 60000
>>>> ms
>>>>
>>>>         at org.apache.flink.client.program.ClusterClient.runDetached(Cl
>>>> usterClient.java:524)
>>>>         at org.apache.flink.client.program.StandaloneClusterClient.subm
>>>> itJob(StandaloneClusterClient.java:103)
>>>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>>>> ent.java:456)
>>>>         at org.apache.flink.client.program.DetachedEnvironment.finalize
>>>> Execute(DetachedEnvironment.java:77)
>>>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>>>> ent.java:402)
>>>>         at org.apache.flink.client.CliFrontend.executeProgram(CliFronte
>>>> nd.java:802)
>>>>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282
>>>> )
>>>>         at org.apache.flink.client.CliFrontend.parseParameters(CliFront
>>>> end.java:1054)
>>>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>>>> 1101)
>>>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>>>> 1098)
>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>>> upInformation.java:1698)
>>>>         at org.apache.flink.runtime.security.HadoopSecurityContext.runS
>>>> ecured(HadoopSecurityContext.java:41)
>>>>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:10
>>>> 98)
>>>> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
>>>> JobManager did not respond within 60000 ms
>>>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>>>> JobClient.java:437)
>>>>         at org.apache.flink.client.program.ClusterClient.runDetached(Cl
>>>> usterClient.java:516)
>>>>         ... 14 more
>>>> Caused by: java.util.concurrent.TimeoutException
>>>>         at java.util.concurrent.CompletableFuture.timedGet(CompletableF
>>>> uture.java:1771)
>>>>         at java.util.concurrent.CompletableFuture.get(CompletableFuture
>>>> .java:1915)
>>>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>>>> JobClient.java:435)
>>>>         ... 15 more
>>>>
>>>>
>>>> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <ju...@azion.com>
>>>> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I'm building a standalone cluster with HA JobManager. So far,
>>>>> everything seems to work, but when i try to `flink run` my job, it fails
>>>>> with the following error:
>>>>>
>>>>> Caused by: org.apache.flink.runtime.leade
>>>>> rretrieval.LeaderRetrievalException: Could not retrieve the leader
>>>>> gateway.
>>>>>
>>>>> So far, I have two different machines running the JobManager and,
>>>>> looking at the logs, I can't see any problem whatsoever to explain why the
>>>>> flink command is refusing to run the pipeline...
>>>>>
>>>>> Any ideas where I should look?
>>>>>
>>>>> --
>>>>> *Julio Biason*, Sofware Engineer
>>>>> *AZION*  |  Deliver. Accelerate. Protect.
>>>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>>>> <callto:+5551996209291>*99907 0554*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Julio Biason*, Sofware Engineer
>>>> *AZION*  |  Deliver. Accelerate. Protect.
>>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>>> <callto:+5551996209291>*99907 0554*
>>>>
>>>
>>>
>>
>>
>> --
>> *Julio Biason*, Sofware Engineer
>> *AZION*  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>> <callto:+5551996209291>*99907 0554*
>>
>
>
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>

Re: Cannot submit jobs on a HA Standalone JobManager

Posted by Julio Biason <ju...@azion.com>.

Hey Gary (again),

Yup, that worked. Now I can launch apps again.

... but that's not something actually good.

I mean, I have my own test environment, which doesn't need HA -- after all,
I don't need to worry about this, this is a framework job, not my pipeline
job. Which means now I'll need to either keep two different configuration
files and keep switching between them -- because the `flink` command
doesn't not accept a configuration file (or, at least, it's not listed on
`--help`) or I'll have to first copy to the prod/staging machines and then
run there -- which seems a waste, since it seems using `flink run -m`
already adds the file in the blob server and then runs and, doing the
copy-to-machine step means there are two copies going on.

I mean, if I say "hey flink, run this job _there_", flink should be smart
enough to read how "there" is running things and adjust itself. The
environment which started the run may not follow the same rules as the
target run machine.

... and, in the end, it seems this is mostly useless discussion, as
JobManager is changing completely on 1.5 -- but I kinda worry if I will
have the same issue with the new ResourceManager...

On Thu, May 3, 2018 at 11:00 AM, Julio Biason <ju...@azion.com>
wrote:

> Hey Gary,
>
> Yes, I was still running with the `-m` flag on my dev machine -- partially
> configured like prod, but without the HA stuff. I never thought it could be
> a problem, since even the web interface can redirect from the secondary
> back to primary.
>
> Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon
> as I can fix this).
>
> I'll try again with the HA/ZooKeeper properly set up on my machine and, if
> it still balks, I'll send the (updated) logs.
>
> On Thu, May 3, 2018 at 9:36 AM, Gary Yao <ga...@data-artisans.com> wrote:
>
>> Hi Julio,
>>
>> Are you using the -m flag of "bin/flink run" by any chance? In HA mode,
>> you
>> cannot manually specify the JobManager address. The client determines the
>> leader
>> through ZooKeeper. If you did not configure the ZooKeeper quorum in the
>> flink-conf.yaml on the machine from which you are submitting, this might
>> explain
>> the error message.
>>
>> > But that didn't solve my problem. So far, the `flink run` still fails
>> with the same message (I'm adding the full stacktrace of the failure in the
>> end, just in case), but now I'm also seeing this message in the JobManager
>> logs:
>> Unfortunately, the error message in your previous email is different. If
>> the
>> above does not solve your problem, can you attach the logs of the client
>> and
>> JobManager?
>>
>> Lastly, what Flink version are you running?
>>
>> Best,
>> Gary
>>
>> On Wed, May 2, 2018 at 6:51 PM, Julio Biason <ju...@azion.com>
>> wrote:
>>
>>> Hey guys and gals,
>>>
>>> So, after a bit more digging, I found out that once HA is enabled,
>>> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
>>> but I was expecting this). Because I set the `high-availability.jobmanager.port`
>>> to `50010-50015`, my RPC port also changed (the docs made me think this
>>> would only affect the HA communication, not ALL communications). This can
>>> be checked on the Dashboard, under the JobManager configuration option.
>>>
>>> But that didn't solve my problem. So far, the `flink run` still fails
>>> with the same message (I'm adding the full stacktrace of the failure in the
>>> end, just in case), but now I'm also seeing this message in the JobManager
>>> logs:
>>>
>>> 2018-05-02 16:44:32,373 WARN  org.apache.flink.runtime.jobma
>>> nager.JobManager                - Discard message
>>> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
>>> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected
>>> leader session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal
>>> the received leader session ID 00000000-0000-0000-0000-000000
>>> 000000.
>>>
>>>
>>> So, I'm still lost on where to go forward.
>>>
>>>
>>> Failure when using `flink run`:
>>>
>>> org.apache.flink.client.program.ProgramInvocationException: The program
>>> execution failed: JobManager did not respond within 60000
>>> ms
>>>
>>>         at org.apache.flink.client.program.ClusterClient.runDetached(Cl
>>> usterClient.java:524)
>>>         at org.apache.flink.client.program.StandaloneClusterClient.subm
>>> itJob(StandaloneClusterClient.java:103)
>>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>>> ent.java:456)
>>>         at org.apache.flink.client.program.DetachedEnvironment.finalize
>>> Execute(DetachedEnvironment.java:77)
>>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>>> ent.java:402)
>>>         at org.apache.flink.client.CliFrontend.executeProgram(CliFronte
>>> nd.java:802)
>>>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
>>>         at org.apache.flink.client.CliFrontend.parseParameters(CliFront
>>> end.java:1054)
>>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>>> 1101)
>>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>>> 1098)
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>>> upInformation.java:1698)
>>>         at org.apache.flink.runtime.security.HadoopSecurityContext.runS
>>> ecured(HadoopSecurityContext.java:41)
>>>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:10
>>> 98)
>>> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
>>> JobManager did not respond within 60000 ms
>>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>>> JobClient.java:437)
>>>         at org.apache.flink.client.program.ClusterClient.runDetached(Cl
>>> usterClient.java:516)
>>>         ... 14 more
>>> Caused by: java.util.concurrent.TimeoutException
>>>         at java.util.concurrent.CompletableFuture.timedGet(CompletableF
>>> uture.java:1771)
>>>         at java.util.concurrent.CompletableFuture.get(CompletableFuture
>>> .java:1915)
>>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>>> JobClient.java:435)
>>>         ... 15 more
>>>
>>>
>>> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <ju...@azion.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'm building a standalone cluster with HA JobManager. So far,
>>>> everything seems to work, but when i try to `flink run` my job, it fails
>>>> with the following error:
>>>>
>>>> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
>>>> Could not retrieve the leader gateway.
>>>>
>>>> So far, I have two different machines running the JobManager and,
>>>> looking at the logs, I can't see any problem whatsoever to explain why the
>>>> flink command is refusing to run the pipeline...
>>>>
>>>> Any ideas where I should look?
>>>>
>>>> --
>>>> *Julio Biason*, Sofware Engineer
>>>> *AZION*  |  Deliver. Accelerate. Protect.
>>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>>> <callto:+5551996209291>*99907 0554*
>>>>
>>>
>>>
>>>
>>> --
>>> *Julio Biason*, Sofware Engineer
>>> *AZION*  |  Deliver. Accelerate. Protect.
>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>> <callto:+5551996209291>*99907 0554*
>>>
>>
>>
>
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>



-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Re: Cannot submit jobs on a HA Standalone JobManager

Posted by Julio Biason <ju...@azion.com>.

Hey Gary,

Yes, I was still running with the `-m` flag on my dev machine -- partially
configured like prod, but without the HA stuff. I never thought it could be
a problem, since even the web interface can redirect from the secondary
back to primary.

Currently I'm still running 1.4.0 (and I plan to upgrade to 1.4.2 as soon
as I can fix this).

I'll try again with the HA/ZooKeeper properly set up on my machine and, if
it still balks, I'll send the (updated) logs.

On Thu, May 3, 2018 at 9:36 AM, Gary Yao <ga...@data-artisans.com> wrote:

> Hi Julio,
>
> Are you using the -m flag of "bin/flink run" by any chance? In HA mode, you
> cannot manually specify the JobManager address. The client determines the
> leader
> through ZooKeeper. If you did not configure the ZooKeeper quorum in the
> flink-conf.yaml on the machine from which you are submitting, this might
> explain
> the error message.
>
> > But that didn't solve my problem. So far, the `flink run` still fails
> with the same message (I'm adding the full stacktrace of the failure in the
> end, just in case), but now I'm also seeing this message in the JobManager
> logs:
> Unfortunately, the error message in your previous email is different. If
> the
> above does not solve your problem, can you attach the logs of the client
> and
> JobManager?
>
> Lastly, what Flink version are you running?
>
> Best,
> Gary
>
> On Wed, May 2, 2018 at 6:51 PM, Julio Biason <ju...@azion.com>
> wrote:
>
>> Hey guys and gals,
>>
>> So, after a bit more digging, I found out that once HA is enabled,
>> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
>> but I was expecting this). Because I set the `high-availability.jobmanager.port`
>> to `50010-50015`, my RPC port also changed (the docs made me think this
>> would only affect the HA communication, not ALL communications). This can
>> be checked on the Dashboard, under the JobManager configuration option.
>>
>> But that didn't solve my problem. So far, the `flink run` still fails
>> with the same message (I'm adding the full stacktrace of the failure in the
>> end, just in case), but now I'm also seeing this message in the JobManager
>> logs:
>>
>> 2018-05-02 16:44:32,373 WARN  org.apache.flink.runtime.jobma
>> nager.JobManager                - Discard message
>> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
>> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected leader
>> session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal the
>> received leader session ID 00000000-0000-0000-0000-000000
>> 000000.
>>
>>
>> So, I'm still lost on where to go forward.
>>
>>
>> Failure when using `flink run`:
>>
>> org.apache.flink.client.program.ProgramInvocationException: The program
>> execution failed: JobManager did not respond within 60000
>> ms
>>
>>         at org.apache.flink.client.program.ClusterClient.runDetached(
>> ClusterClient.java:524)
>>         at org.apache.flink.client.program.StandaloneClusterClient.subm
>> itJob(StandaloneClusterClient.java:103)
>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>> ent.java:456)
>>         at org.apache.flink.client.program.DetachedEnvironment.finalize
>> Execute(DetachedEnvironment.java:77)
>>         at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>> ent.java:402)
>>         at org.apache.flink.client.CliFrontend.executeProgram(CliFronte
>> nd.java:802)
>>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
>>         at org.apache.flink.client.CliFrontend.parseParameters(CliFront
>> end.java:1054)
>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>> 1101)
>>         at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:
>> 1098)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> upInformation.java:1698)
>>         at org.apache.flink.runtime.security.HadoopSecurityContext.runS
>> ecured(HadoopSecurityContext.java:41)
>>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:
>> 1098)
>> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
>> JobManager did not respond within 60000 ms
>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>> JobClient.java:437)
>>         at org.apache.flink.client.program.ClusterClient.runDetached(
>> ClusterClient.java:516)
>>         ... 14 more
>> Caused by: java.util.concurrent.TimeoutException
>>         at java.util.concurrent.CompletableFuture.timedGet(CompletableF
>> uture.java:1771)
>>         at java.util.concurrent.CompletableFuture.get(CompletableFuture
>> .java:1915)
>>         at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>> JobClient.java:435)
>>         ... 15 more
>>
>>
>> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <ju...@azion.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I'm building a standalone cluster with HA JobManager. So far, everything
>>> seems to work, but when i try to `flink run` my job, it fails with the
>>> following error:
>>>
>>> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
>>> Could not retrieve the leader gateway.
>>>
>>> So far, I have two different machines running the JobManager and,
>>> looking at the logs, I can't see any problem whatsoever to explain why the
>>> flink command is refusing to run the pipeline...
>>>
>>> Any ideas where I should look?
>>>
>>> --
>>> *Julio Biason*, Sofware Engineer
>>> *AZION*  |  Deliver. Accelerate. Protect.
>>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>>> <callto:+5551996209291>*99907 0554*
>>>
>>
>>
>>
>> --
>> *Julio Biason*, Sofware Engineer
>> *AZION*  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>> <callto:+5551996209291>*99907 0554*
>>
>
>


-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*

Re: Cannot submit jobs on a HA Standalone JobManager

Posted by Gary Yao <ga...@data-artisans.com>.

Hi Julio,

Are you using the -m flag of "bin/flink run" by any chance? In HA mode, you
cannot manually specify the JobManager address. The client determines the
leader
through ZooKeeper. If you did not configure the ZooKeeper quorum in the
flink-conf.yaml on the machine from which you are submitting, this might
explain
the error message.

> But that didn't solve my problem. So far, the `flink run` still fails
with the same message (I'm adding the full stacktrace of the failure in the
end, just in case), but now I'm also seeing this message in the JobManager
logs:
Unfortunately, the error message in your previous email is different. If the
above does not solve your problem, can you attach the logs of the client and
JobManager?

Lastly, what Flink version are you running?

Best,
Gary

On Wed, May 2, 2018 at 6:51 PM, Julio Biason <ju...@azion.com> wrote:

> Hey guys and gals,
>
> So, after a bit more digging, I found out that once HA is enabled,
> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
> but I was expecting this). Because I set the `high-availability.jobmanager.port`
> to `50010-50015`, my RPC port also changed (the docs made me think this
> would only affect the HA communication, not ALL communications). This can
> be checked on the Dashboard, under the JobManager configuration option.
>
> But that didn't solve my problem. So far, the `flink run` still fails with
> the same message (I'm adding the full stacktrace of the failure in the end,
> just in case), but now I'm also seeing this message in the JobManager logs:
>
> 2018-05-02 16:44:32,373 WARN  org.apache.flink.runtime.
> jobmanager.JobManager                - Discard message
> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected leader
> session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal the
> received leader session ID 00000000-0000-0000-0000-
> 000000000000.
>
>
> So, I'm still lost on where to go forward.
>
>
> Failure when using `flink run`:
>
> org.apache.flink.client.program.ProgramInvocationException: The program
> execution failed: JobManager did not respond within 60000
> ms
>
>         at org.apache.flink.client.program.ClusterClient.
> runDetached(ClusterClient.java:524)
>         at org.apache.flink.client.program.StandaloneClusterClient.
> submitJob(StandaloneClusterClient.java:103)
>         at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:456)
>         at org.apache.flink.client.program.DetachedEnvironment.
> finalizeExecute(DetachedEnvironment.java:77)
>         at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:402)
>         at org.apache.flink.client.CliFrontend.executeProgram(
> CliFrontend.java:802)
>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
>         at org.apache.flink.client.CliFrontend.parseParameters(
> CliFrontend.java:1054)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1101)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1098)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1698)
>         at org.apache.flink.runtime.security.HadoopSecurityContext.
> runSecured(HadoopSecurityContext.java:41)
>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
> JobManager did not respond within 60000 ms
>         at org.apache.flink.runtime.client.JobClient.
> submitJobDetached(JobClient.java:437)
>         at org.apache.flink.client.program.ClusterClient.
> runDetached(ClusterClient.java:516)
>         ... 14 more
> Caused by: java.util.concurrent.TimeoutException
>         at java.util.concurrent.CompletableFuture.timedGet(
> CompletableFuture.java:1771)
>         at java.util.concurrent.CompletableFuture.get(
> CompletableFuture.java:1915)
>         at org.apache.flink.runtime.client.JobClient.
> submitJobDetached(JobClient.java:435)
>         ... 15 more
>
>
> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <ju...@azion.com>
> wrote:
>
>> Hello all,
>>
>> I'm building a standalone cluster with HA JobManager. So far, everything
>> seems to work, but when i try to `flink run` my job, it fails with the
>> following error:
>>
>> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
>> Could not retrieve the leader gateway.
>>
>> So far, I have two different machines running the JobManager and, looking
>> at the logs, I can't see any problem whatsoever to explain why the flink
>> command is refusing to run the pipeline...
>>
>> Any ideas where I should look?
>>
>> --
>> *Julio Biason*, Sofware Engineer
>> *AZION*  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>> <callto:+5551996209291>*99907 0554*
>>
>
>
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>

Re: Cannot submit jobs on a HA Standalone JobManager

Posted by Julio Biason <ju...@azion.com>.

Hey guys and gals,

So, after a bit more digging, I found out that once HA is enabled,
`jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
but I was expecting this). Because I set the
`high-availability.jobmanager.port` to `50010-50015`, my RPC port also
changed (the docs made me think this would only affect the HA
communication, not ALL communications). This can be checked on the
Dashboard, under the JobManager configuration option.

But that didn't solve my problem. So far, the `flink run` still fails with
the same message (I'm adding the full stacktrace of the failure in the end,
just in case), but now I'm also seeing this message in the JobManager logs:

2018-05-02 16:44:32,373 WARN
org.apache.flink.runtime.jobmanager.JobManager                - Discard
message
LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected leader
session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal the received
leader session ID
00000000-0000-0000-0000-000000000000.


So, I'm still lost on where to go forward.


Failure when using `flink run`:

org.apache.flink.client.program.ProgramInvocationException: The program
execution failed: JobManager did not respond within 60000
ms

        at
org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:524)
        at
org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:103)
        at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
        at
org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
        at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:402)
        at
org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
        at
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.client.JobTimeoutException: JobManager
did not respond within 60000 ms
        at
org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:437)
        at
org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.java:516)
        ... 14 more
Caused by: java.util.concurrent.TimeoutException
        at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
        at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
        at
org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.java:435)
        ... 15 more


On Wed, May 2, 2018 at 9:52 AM, Julio Biason <ju...@azion.com> wrote:

> Hello all,
>
> I'm building a standalone cluster with HA JobManager. So far, everything
> seems to work, but when i try to `flink run` my job, it fails with the
> following error:
>
> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
> Could not retrieve the leader gateway.
>
> So far, I have two different machines running the JobManager and, looking
> at the logs, I can't see any problem whatsoever to explain why the flink
> command is refusing to run the pipeline...
>
> Any ideas where I should look?
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>



-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
<callto:+5551996209291>*99907 0554*