You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Javier Vegas <jv...@strava.com> on 2021/09/27 17:37:48 UTC

Unable to connect to Mesos on mesos-appmaster.sh start

I am trying to start Flink 1.13.2 on Mesos following the instrucions in
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
and using Marathon to deploy a Docker image with both the Flink and my
binaries.

My entrypoint for the Docker image is:


/opt/flink/bin/mesos-appmaster.sh \

      -Djobmanager.rpc.address=$HOSTNAME \

      -Dmesos.resourcemanager.framework.user=flink \

      -Dmesos.master=10.0.18.246:5050 \

      -Dmesos.resourcemanager.tasks.cpus=6



When mesos-appmaster.sh starts, in the stderr I see this:


I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3

I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090

I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor
on 10.0.20.177

I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0

WARNING: Your kernel does not support swap limit capabilities or the cgroup
is not mounted. Memory limited without swap.

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by
org.apache.hadoop.security.authentication.util.KerberosUtil
(file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
sun.security.krb5.Config.getInstance()

WARNING: Please consider reporting this to the maintainers of
org.apache.hadoop.security.authentication.util.KerberosUtil

WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations

WARNING: All illegal access operations will be denied in a future release

I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3

I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
master@10.0.18.246:5050

I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
Attempting to register without authentication


where the "New master detected" line is promising.

However, on the Flink UI I see only the jobmanager started, and there are
no task managers.  Getting into the Docker container, I see this in the log:

WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
connect to Mesos; still trying...


I have verified that from the container I can access the Mesos container
10.0.18.246:5050


Does any other port besides the web UI port 5050 need to be open for
mesos-appmaster to connect with the Mesos master?


In the appmaster log (attached) I see one exception that I don't know if
they are related to the Mesos connection problem, one is


java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

        at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)

        at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)

        at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)

        at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)

        at
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)

        at
org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)

        at
org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)

        at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)

        at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)

        at
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)

        at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)

        at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)

        at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

        at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
Source)

        at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)

        at java.base/java.lang.reflect.Method.invoke(Unknown Source)

        at
org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)

        at
org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)

        at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)




I am not trying (yet) to run in high availability mode, so I am not sure if
I need to have HADOOP_HOME set or not, but I don't see anything about
HADOOP_HOME in the FLink docs.



Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink can
connect to my Mesos master?


Thanks,


Javier Vegas

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
Update: I fixed my HADOOP_HOME issue (before I was trying to use the
flink-shaded-hadoop-2-uber
jar, I ditched that and installed full hadoop in my Docker image).
Unfortunately that didn't fix the Mesos connection issue, I am still
getting "Unable to connect to Mesos; still trying". New log is attached.

On Mon, Sep 27, 2021 at 10:37 AM Javier Vegas <jv...@strava.com> wrote:

> I am trying to start Flink 1.13.2 on Mesos following the instrucions in
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
> and using Marathon to deploy a Docker image with both the Flink and my
> binaries.
>
> My entrypoint for the Docker image is:
>
>
> /opt/flink/bin/mesos-appmaster.sh \
>
>       -Djobmanager.rpc.address=$HOSTNAME \
>
>       -Dmesos.resourcemanager.framework.user=flink \
>
>       -Dmesos.master=10.0.18.246:5050 \
>
>       -Dmesos.resourcemanager.tasks.cpus=6
>
>
>
> When mesos-appmaster.sh starts, in the stderr I see this:
>
>
> I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>
> I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>
> I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor
> on 10.0.20.177
>
> I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>
> WARNING: Your kernel does not support swap limit capabilities or the
> cgroup is not mounted. Memory limited without swap.
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()
>
> WARNING: Please consider reporting this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtil
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
> I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>
> I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
> master@10.0.18.246:5050
>
> I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
> Attempting to register without authentication
>
>
> where the "New master detected" line is promising.
>
> However, on the Flink UI I see only the jobmanager started, and there are
> no task managers.  Getting into the Docker container, I see this in the log:
>
> WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
> connect to Mesos; still trying...
>
>
> I have verified that from the container I can access the Mesos container
> 10.0.18.246:5050
>
>
> Does any other port besides the web UI port 5050 need to be open for
> mesos-appmaster to connect with the Mesos master?
>
>
> In the appmaster log (attached) I see one exception that I don't know if
> they are related to the Mesos connection problem, one is
>
>
> java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
>
>         at
> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>
>         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>
>         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>
>         at
> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>
>         at
> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>
>         at
> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>
>         at
> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>
>         at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>
>         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
>
>         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
> Source)
>
>         at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
> Source)
>
>         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>
>         at
> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>
>         at
> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>
>         at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>
>
>
>
> I am not trying (yet) to run in high availability mode, so I am not sure
> if I need to have HADOOP_HOME set or not, but I don't see anything about
> HADOOP_HOME in the FLink docs.
>
>
>
> Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink
> can connect to my Mesos master?
>
>
> Thanks,
>
>
> Javier Vegas
>
>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
So I think I am starting to understand the problem, and it is
in jobmanager.rpc variables.

I am deploying Flink to a Docker container using Marathon. If on the Docker
container I execute

/opt/flink/bin/mesos-appmaster.sh -Djobmanager.rpc.address=$HOSTNAME

the Flink binds to port 6123 (the default jobmanager port) on the Docker
container IP, but the problem is that the Mesos master has no way to access
the jobmanager because that IP is in the Docker network, and the Mesos
master could access it only through the port mapping on the Docker host.

So if I run  /opt/flink/bin/mesos-appmaster.sh
-Djobmanager.rpc.address=$HOST  -Djobmanager.rpc.address=$PORT1 it looks
like Flink is trying to bind to the mapped port (let's say 31114 mapped to
6123) on the host IP. Which of course it can not do, because that IP does
not belong to the container, it belongs to the host. So it seems like a
Mesos+Docker setup would need two sets of parameters, one for the
jobmanager to bind to, and another one for the jobmanager to advertise to
the Mesos master. A high availability setup would not solve this problem,
we would still need to publish the host ip+port to zookeeper, different
from the ip+port running in the container. Looking through the mesos
configuration params for Flink I don't see anything that could be useful
for this (there is a  mesos.resourcemanager.tasks.hostname but nothing for
jobmanager)
Does my thinking make sense? Any suggestions to run Flink on
Marathon+Mesos+Docker (which by the way I saw Mesos support was removed in
1.14, so looks like I should be looking for a different way to deploy Flink
anyway)
Thanks,

Javier


On Thu, Sep 30, 2021 at 9:41 AM Javier Vegas <jv...@strava.com> wrote:

> This is my Marathon network configuration:
>
>   "portMappings": [
>     {
>       "containerPort": 8081,
>       "hostPort": 0,
>       "labels": {},
>       "protocol": "tcp",
>       "servicePort": 10756
>     },
>     {
>       "containerPort": 6123,
>       "hostPort": 0,
>       "labels": {},
>       "protocol": "tcp",
>       "servicePort": 10757
>     }
>
>
> so $PORT0 is the port mapped to 8081, and $PORT1 is the port mapped to
> 6123, which is the jobmanager.rpc.port default
>
> I am also using bridge for network mode
>
> "mode": "container/bridge"
>
>
> On Thu, Sep 30, 2021 at 12:18 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> Thanks for sharing. I was wondering why you don't use $PORT0 in your
>> command. And: Are the ports properly configured in the Marathon network
>> configuration [1]? But the error seems to be unrelated to that setting.
>> Other than that, I cannot see any other issue with the configuration. It
>> could be that the HOST IP is blocked?
>>
>> [1]
>> https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports
>>
>> On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas <jv...@strava.com> wrote:
>>
>>>
>>> Full appmaster log in debug mode is attached.
>>> My startup command was
>>> /opt/flink/bin/mesos-appmaster.sh \
>>>       -Drest.bind-port=8081 \
>>>       -Drest.port=8081 \
>>>       -Djobmanager.rpc.address=$HOST \
>>>       -Djobmanager.rpc.port=$PORT1 \
>>>       -Dmesos.resourcemanager.framework.user=flink \
>>>       -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
>>>       -Dmesos.master=10.0.18.246:5050 \
>>>       -Dmesos.resourcemanager.tasks.cpus=4 \
>>>       -Dmesos.resourcemanager.tasks.container.type=docker \
>>>       -Dmesos.resourcemanager.tasks.container.image.name=
>>> docker.strava.com/strava/timeline-populator2:jv-mesos \
>>>       -Dtaskmanager.numberOfTaskSlots=4 ;
>>>
>>> where $PORT1 refers to my second host open port, mapped to 6123 on the
>>> Docker container (first port is mapped to 8081).
>>> I can see in the log that $HOST and $PORT1 resolve to the correct
>>> values, 10.0.20.25 and 31608
>>>
>>> On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> ...and if possible, it would be helpful to provide debug logs as well.
>>>>
>>>> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> May you provide the entire JobManager logs so that we can see what's
>>>>> going on?
>>>>>
>>>>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jv...@strava.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks again, Matthias!
>>>>>>
>>>>>> Putting  -Djobmanager.rpc.address=$HOST and
>>>>>> -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh
>>>>>> I see in tog they seem to transform in the correct values
>>>>>>
>>>>>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>>>>>
>>>>>> but a bit later the appmaster dies with this new error. it is unclear
>>>>>> what address it is trying to bind, I added explicit params
>>>>>> -Drest.bind-port=8081 and
>>>>>>       -Drest.port=8081 in case jobmanager.rpc.port was somehow
>>>>>> interfering, but that didn't help.
>>>>>>
>>>>>> 2021-09-29 10:29:59.845 [main] INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address
>>>>>> 	at java.base/sun.nio.ch.Net.bind0(Native Method)
>>>>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>>>> 	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>>>>> 	at java.base/java.lang.Thread.run(Unknown Source)
>>>>>>
>>>>>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <ma...@ververica.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The port has its separate configuration parameter
>>>>>>> jobmanager.rpc.port [1]
>>>>>>>
>>>>>>> [1]
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>>>>>>
>>>>>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Matthias, thanks for the suggestion! I changed my
>>>>>>>> jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the
>>>>>>>> log I see resolves properly to the host IP and port mapped to 8081
>>>>>>>>
>>>>>>>> 2021-09-29 07:58:05.452 [main] INFO
>>>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>>>>>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>>>>>>
>>>>>>>> which is very promising. But sadly a little bit later appmaster
>>>>>>>> dies with this errror:
>>>>>>>>
>>>>>>>> 2021-09-29 07:58:05.648 [main] INFO
>>>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>>>>>>> cluster services.
>>>>>>>> 2021-09-29 07:58:05.674 [main] INFO
>>>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>>>>>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>>>>>>> Diagnostics org.apache.flink.configurati
>>>>>>>> on.IllegalConfigurationException: The configured hostname is not
>>>>>>>> valid
>>>>>>>> at
>>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>>>>> at
>>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>>>>> at
>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>>>>> at java.base/java.security.AccessController.doPrivileged(Native
>>>>>>>> Method)
>>>>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>>>>> at
>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>>>>> Caused by: java.lang.IllegalArgumentException
>>>>>>>> at
>>>>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>>>>> at
>>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>>>>> ... 17 more
>>>>>>>> .
>>>>>>>> 2021-09-29 07:58:05.685 [main] ERROR
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
>>>>>>>> cluster entrypoint MesosSessionClusterEntrypoint.
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException:
>>>>>>>> Failed to initialize the cluster entrypoint MesosSessionClusterEntrypoint.
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>>>>> at
>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>>>>> Caused by:
>>>>>>>> org.apache.flink.configuration.IllegalConfigurationException: The
>>>>>>>> configured hostname is not valid
>>>>>>>> at
>>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>>>>> at
>>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>>>>> at
>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>>>>> at java.base/java.security.AccessController.doPrivileged(Native
>>>>>>>> Method)
>>>>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>>>>> at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>>>>> at
>>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>>>>> ... 2 common frames omitted
>>>>>>>> Caused by: java.lang.IllegalArgumentException: null
>>>>>>>> at
>>>>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>>>>> at
>>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>>>>> ... 17 common frames omitted
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <
>>>>>>>> matthias@ververica.com> wrote:
>>>>>>>>
>>>>>>>>> One thing that was puzzling me yesterday when reading your post:
>>>>>>>>> Have you tried $HOST instead of $HOSTNAME in the Marathon configuration?
>>>>>>>>> When I played around with Mesos, I remember using HOST to resolve the
>>>>>>>>> host's IP address instead of the host's name. It could be that the hostname
>>>>>>>>> itself cannot be resolved to the right IP address. But I struggled to find
>>>>>>>>> proper documentation to back that up. Only in the recipes section of the
>>>>>>>>> Marathon docs [1], HOST was used as well.
>>>>>>>>>
>>>>>>>>> Matthias
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>>>>>>>>
>>>>>>>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Another update:  Looking more carefully in my appmaster log, I
>>>>>>>>>> see the following
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> - Registering as new framework.
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -
>>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -  Mesos Info:
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Master URL: 10.0.18.246:5050
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -  Framework Info:
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     ID: (none)
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Name: flink-test
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Failover Timeout (secs): 604800.0
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Role: *
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Capabilities: (none)
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Principal: (none)
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Host: 311dcf7fd77c
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -     Web UI: http://311dcf7fd77c:8081
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> -
>>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> which is picking up the mesos.master and
>>>>>>>>>> mesos.resourcemanager.framework.name params I am passing to
>>>>>>>>>> mesos-appmaster.sh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In my Mesos dashboard I can see the framework has been created
>>>>>>>>>> with the right name, but has no associated agents/tasks to it. So at least
>>>>>>>>>> Flink has been able to connect to the Mesos master to create the framework
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Later in the mesos-appmaster log is when I see the Mesos
>>>>>>>>>> connection errors:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> DEBUG o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager
>>>>>>>>>> - Starting the slot manager.
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2]
>>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>> State change (StoppedState -> StoppedState) with data ()
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager
>>>>>>>>>> - Trigger heartbeat request.
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ReconciliationCoordinator
>>>>>>>>>> - State change (Suspended -> Suspended) with data
>>>>>>>>>> ReconciliationData(Map(),0)
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager
>>>>>>>>>> - Trigger heartbeat request.
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>> Connecting to Mesos...
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>> State change (StoppedState -> ConnectingState) with data ()
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>>> - Mesos resource manager started.
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4]
>>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.LaunchCoordinator  -
>>>>>>>>>> State change (Suspended -> Suspended) with data GatherData(List(),List())
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4]
>>>>>>>>>> WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager
>>>>>>>>>> - Trigger heartbeat request.
>>>>>>>>>>
>>>>>>>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager
>>>>>>>>>> - Trigger heartbeat request.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So why the appmaster was able to connect to Mesos master to
>>>>>>>>>> create the framework but failed to connect later to do whatever it does
>>>>>>>>>> later?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> One possible issue I see is that the framework is set with web UI
>>>>>>>>>> in http://311dcf7fd77c:8081 which can not be resolved from the
>>>>>>>>>> Mesos master. 311dcf7fd77c is the result of doing hostname on
>>>>>>>>>> the Docker container, and the Mesos master can not resolve that name. I
>>>>>>>>>> could try to replace the Docker container hostname with the Docker host
>>>>>>>>>> hostname, but the host port that gets mapped to 8081 on the container is a
>>>>>>>>>> random port that I can not know beforehand. Does Mesos master try to reach
>>>>>>>>>> Flink using that Web UI setting? Could this be the issue causing my
>>>>>>>>>> connection problem, or is this a red herring and the problem is a different
>>>>>>>>>> one?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Javier Vegas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Matthias!
>>>>>>>>>>>
>>>>>>>>>>> There are lots of apps deployed to the Mesos cluster, the task
>>>>>>>>>>> manager itself is deployed to Mesos via Marathon.  In the Mesos log I can
>>>>>>>>>>> see the Job manager agent starting, but no error messages related to it. As
>>>>>>>>>>> you say, TaskManagers don't even have the chance to get confused about
>>>>>>>>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>>>>>>>>> tell it to start the Task Managers.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Javier
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <
>>>>>>>>>>> matthias@ververica.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Javier,
>>>>>>>>>>>> I don't see anything that's configured in the wrong way based
>>>>>>>>>>>> on the jobmanager logs you've provided. Have you been able to deploy other
>>>>>>>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>>>>>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>>>>>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>>>>>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>>>>>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>>>>>>>>> not coming up.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Matthias
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <
>>>>>>>>>>>> roman@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> No additional ports need to be open as far as I know.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Probably, $HOSTNAME is substituted for something not
>>>>>>>>>>>>> resolvable on TMs?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please also make sure that the following gets executed before
>>>>>>>>>>>>> mesos-appmaster.sh:
>>>>>>>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>>>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>>>>>>>>> (as per the documentation you linked)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Roman
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <
>>>>>>>>>>>>> jvegas@strava.com> wrote:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>>>>>>>>> instrucions in
>>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>>>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>>>>>>>>> binaries.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > My entrypoint for the Docker image is:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor
>>>>>>>>>>>>> registered on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered
>>>>>>>>>>>>> docker executor on 10.0.20.177
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>>>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WARNING: Your kernel does not support swap limit
>>>>>>>>>>>>> capabilities or the cgroup is not mounted. Memory limited without swap.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WARNING: Illegal reflective access by
>>>>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>>>>>>>>> sun.security.krb5.Config.getInstance()
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WARNING: Please consider reporting this to the maintainers
>>>>>>>>>>>>> of org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of
>>>>>>>>>>>>> further illegal reflective access operations
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WARNING: All illegal access operations will be denied in a
>>>>>>>>>>>>> future release
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master
>>>>>>>>>>>>> detected at master@10.0.18.246:5050
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials
>>>>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > where the "New master detected" line is promising.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > However, on the Flink UI I see only the jobmanager started,
>>>>>>>>>>>>> and there are no task managers.  Getting into the Docker container, I see
>>>>>>>>>>>>> this in the log:
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I have verified that from the container I can access the
>>>>>>>>>>>>> Mesos container 10.0.18.246:5050
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Does any other port besides the web UI port 5050 need to be
>>>>>>>>>>>>> open for mesos-appmaster to connect with the Mesos master?
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > In the appmaster log (attached) I see one exception that I
>>>>>>>>>>>>> don't know if they are related to the Mesos connection problem, one is
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and
>>>>>>>>>>>>> hadoop.home.dir are unset.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>>>>> Method)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>>>> Source)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>>>> Source)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown
>>>>>>>>>>>>> Source)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >         at
>>>>>>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I am not trying (yet) to run in high availability mode, so I
>>>>>>>>>>>>> am not sure if I need to have HADOOP_HOME set or not, but I don't see
>>>>>>>>>>>>> anything about HADOOP_HOME in the FLink docs.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos
>>>>>>>>>>>>> environment so Flink can connect to my Mesos master?
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Javier Vegas
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
This is my Marathon network configuration:

  "portMappings": [
    {
      "containerPort": 8081,
      "hostPort": 0,
      "labels": {},
      "protocol": "tcp",
      "servicePort": 10756
    },
    {
      "containerPort": 6123,
      "hostPort": 0,
      "labels": {},
      "protocol": "tcp",
      "servicePort": 10757
    }


so $PORT0 is the port mapped to 8081, and $PORT1 is the port mapped to
6123, which is the jobmanager.rpc.port default

I am also using bridge for network mode

"mode": "container/bridge"


On Thu, Sep 30, 2021 at 12:18 AM Matthias Pohl <ma...@ververica.com>
wrote:

> Thanks for sharing. I was wondering why you don't use $PORT0 in your
> command. And: Are the ports properly configured in the Marathon network
> configuration [1]? But the error seems to be unrelated to that setting.
> Other than that, I cannot see any other issue with the configuration. It
> could be that the HOST IP is blocked?
>
> [1] https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports
>
> On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas <jv...@strava.com> wrote:
>
>>
>> Full appmaster log in debug mode is attached.
>> My startup command was
>> /opt/flink/bin/mesos-appmaster.sh \
>>       -Drest.bind-port=8081 \
>>       -Drest.port=8081 \
>>       -Djobmanager.rpc.address=$HOST \
>>       -Djobmanager.rpc.port=$PORT1 \
>>       -Dmesos.resourcemanager.framework.user=flink \
>>       -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
>>       -Dmesos.master=10.0.18.246:5050 \
>>       -Dmesos.resourcemanager.tasks.cpus=4 \
>>       -Dmesos.resourcemanager.tasks.container.type=docker \
>>       -Dmesos.resourcemanager.tasks.container.image.name=
>> docker.strava.com/strava/timeline-populator2:jv-mesos \
>>       -Dtaskmanager.numberOfTaskSlots=4 ;
>>
>> where $PORT1 refers to my second host open port, mapped to 6123 on the
>> Docker container (first port is mapped to 8081).
>> I can see in the log that $HOST and $PORT1 resolve to the correct values, 10.0.20.25
>> and 31608
>>
>> On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> ...and if possible, it would be helpful to provide debug logs as well.
>>>
>>> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> May you provide the entire JobManager logs so that we can see what's
>>>> going on?
>>>>
>>>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jv...@strava.com>
>>>> wrote:
>>>>
>>>>> Thanks again, Matthias!
>>>>>
>>>>> Putting  -Djobmanager.rpc.address=$HOST and
>>>>> -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh
>>>>> I see in tog they seem to transform in the correct values
>>>>>
>>>>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>>>>
>>>>> but a bit later the appmaster dies with this new error. it is unclear
>>>>> what address it is trying to bind, I added explicit params
>>>>> -Drest.bind-port=8081 and
>>>>>       -Drest.port=8081 in case jobmanager.rpc.port was somehow
>>>>> interfering, but that didn't help.
>>>>>
>>>>> 2021-09-29 10:29:59.845 [main] INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address
>>>>> 	at java.base/sun.nio.ch.Net.bind0(Native Method)
>>>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>>> 	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>>>> 	at java.base/java.lang.Thread.run(Unknown Source)
>>>>>
>>>>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <ma...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> The port has its separate configuration parameter jobmanager.rpc.port
>>>>>> [1]
>>>>>>
>>>>>> [1]
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>>>>>
>>>>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Matthias, thanks for the suggestion! I changed my
>>>>>>> jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the
>>>>>>> log I see resolves properly to the host IP and port mapped to 8081
>>>>>>>
>>>>>>> 2021-09-29 07:58:05.452 [main] INFO
>>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>>>>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>>>>>
>>>>>>> which is very promising. But sadly a little bit later appmaster dies
>>>>>>> with this errror:
>>>>>>>
>>>>>>> 2021-09-29 07:58:05.648 [main] INFO
>>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>>>>>> cluster services.
>>>>>>> 2021-09-29 07:58:05.674 [main] INFO
>>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>>>>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>>>>>> Diagnostics org.apache.flink.configurati
>>>>>>> on.IllegalConfigurationException: The configured hostname is not
>>>>>>> valid
>>>>>>> at
>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>>>> at
>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>>>> at
>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>>>> at java.base/java.security.AccessController.doPrivileged(Native
>>>>>>> Method)
>>>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>>>> at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>>>> at
>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>>>> Caused by: java.lang.IllegalArgumentException
>>>>>>> at
>>>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>>>> at
>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>>>> ... 17 more
>>>>>>> .
>>>>>>> 2021-09-29 07:58:05.685 [main] ERROR
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
>>>>>>> cluster entrypoint MesosSessionClusterEntrypoint.
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException:
>>>>>>> Failed to initialize the cluster entrypoint MesosSessionClusterEntrypoint.
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>>>> at
>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>>>> Caused by:
>>>>>>> org.apache.flink.configuration.IllegalConfigurationException: The
>>>>>>> configured hostname is not valid
>>>>>>> at
>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>>>> at
>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>>>> at
>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>>>> at java.base/java.security.AccessController.doPrivileged(Native
>>>>>>> Method)
>>>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>>>> at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>>>> at
>>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>>>> ... 2 common frames omitted
>>>>>>> Caused by: java.lang.IllegalArgumentException: null
>>>>>>> at
>>>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>>>> at
>>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>>>> ... 17 common frames omitted
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <
>>>>>>> matthias@ververica.com> wrote:
>>>>>>>
>>>>>>>> One thing that was puzzling me yesterday when reading your post:
>>>>>>>> Have you tried $HOST instead of $HOSTNAME in the Marathon configuration?
>>>>>>>> When I played around with Mesos, I remember using HOST to resolve the
>>>>>>>> host's IP address instead of the host's name. It could be that the hostname
>>>>>>>> itself cannot be resolved to the right IP address. But I struggled to find
>>>>>>>> proper documentation to back that up. Only in the recipes section of the
>>>>>>>> Marathon docs [1], HOST was used as well.
>>>>>>>>
>>>>>>>> Matthias
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>>>>>>>
>>>>>>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Another update:  Looking more carefully in my appmaster log, I see
>>>>>>>>> the following
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> - Registering as new framework.
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -
>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -  Mesos Info:
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Master URL: 10.0.18.246:5050
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -  Framework Info:
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     ID: (none)
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Name: flink-test
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Failover Timeout (secs): 604800.0
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Role: *
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Capabilities: (none)
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Principal: (none)
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Host: 311dcf7fd77c
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -     Web UI: http://311dcf7fd77c:8081
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> -
>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> which is picking up the mesos.master and
>>>>>>>>> mesos.resourcemanager.framework.name params I am passing to
>>>>>>>>> mesos-appmaster.sh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In my Mesos dashboard I can see the framework has been created
>>>>>>>>> with the right name, but has no associated agents/tasks to it. So at least
>>>>>>>>> Flink has been able to connect to the Mesos master to create the framework
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Later in the mesos-appmaster log is when I see the Mesos
>>>>>>>>> connection errors:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> DEBUG o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  -
>>>>>>>>> Starting the slot manager.
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2]
>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  - State
>>>>>>>>> change (StoppedState -> StoppedState) with data ()
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>>> Trigger heartbeat request.
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ReconciliationCoordinator  -
>>>>>>>>> State change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>>> Trigger heartbeat request.
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>> Connecting to Mesos...
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  - State
>>>>>>>>> change (StoppedState -> ConnectingState) with data ()
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> INFO  o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver
>>>>>>>>> - Mesos resource manager started.
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4]
>>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.LaunchCoordinator  - State
>>>>>>>>> change (Suspended -> Suspended) with data GatherData(List(),List())
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4]
>>>>>>>>> WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>>> Trigger heartbeat request.
>>>>>>>>>
>>>>>>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3]
>>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>>> Trigger heartbeat request.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So why the appmaster was able to connect to Mesos master to create
>>>>>>>>> the framework but failed to connect later to do whatever it does later?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> One possible issue I see is that the framework is set with web UI
>>>>>>>>> in http://311dcf7fd77c:8081 which can not be resolved from the
>>>>>>>>> Mesos master. 311dcf7fd77c is the result of doing hostname on the
>>>>>>>>> Docker container, and the Mesos master can not resolve that name. I could
>>>>>>>>> try to replace the Docker container hostname with the Docker host hostname,
>>>>>>>>> but the host port that gets mapped to 8081 on the container is a random
>>>>>>>>> port that I can not know beforehand. Does Mesos master try to reach Flink
>>>>>>>>> using that Web UI setting? Could this be the issue causing my connection
>>>>>>>>> problem, or is this a red herring and the problem is a different one?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Javier Vegas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks, Matthias!
>>>>>>>>>>
>>>>>>>>>> There are lots of apps deployed to the Mesos cluster, the task
>>>>>>>>>> manager itself is deployed to Mesos via Marathon.  In the Mesos log I can
>>>>>>>>>> see the Job manager agent starting, but no error messages related to it. As
>>>>>>>>>> you say, TaskManagers don't even have the chance to get confused about
>>>>>>>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>>>>>>>> tell it to start the Task Managers.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Javier
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <
>>>>>>>>>> matthias@ververica.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Javier,
>>>>>>>>>>> I don't see anything that's configured in the wrong way based on
>>>>>>>>>>> the jobmanager logs you've provided. Have you been able to deploy other
>>>>>>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>>>>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>>>>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>>>>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>>>>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>>>>>>>> not coming up.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Matthias
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <
>>>>>>>>>>> roman@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> No additional ports need to be open as far as I know.
>>>>>>>>>>>>
>>>>>>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable
>>>>>>>>>>>> on TMs?
>>>>>>>>>>>>
>>>>>>>>>>>> Please also make sure that the following gets executed before
>>>>>>>>>>>> mesos-appmaster.sh:
>>>>>>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>>>>>>>> (as per the documentation you linked)
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Roman
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>>>>>>>> instrucions in
>>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>>>>>>>> binaries.
>>>>>>>>>>>> >
>>>>>>>>>>>> > My entrypoint for the Docker image is:
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>>>>>>>> >
>>>>>>>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>>>>>>>> >
>>>>>>>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>>>>>>>> >
>>>>>>>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>>>>>>>> >
>>>>>>>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>>>>>>>> >
>>>>>>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor
>>>>>>>>>>>> registered on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>>>>>>>> >
>>>>>>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered
>>>>>>>>>>>> docker executor on 10.0.20.177
>>>>>>>>>>>> >
>>>>>>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>>>>>>>> >
>>>>>>>>>>>> > WARNING: Your kernel does not support swap limit capabilities
>>>>>>>>>>>> or the cgroup is not mounted. Memory limited without swap.
>>>>>>>>>>>> >
>>>>>>>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>>>>>>>> >
>>>>>>>>>>>> > WARNING: Illegal reflective access by
>>>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>>>>>>>> sun.security.krb5.Config.getInstance()
>>>>>>>>>>>> >
>>>>>>>>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>>>> >
>>>>>>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of
>>>>>>>>>>>> further illegal reflective access operations
>>>>>>>>>>>> >
>>>>>>>>>>>> > WARNING: All illegal access operations will be denied in a
>>>>>>>>>>>> future release
>>>>>>>>>>>> >
>>>>>>>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>>>>>>>> >
>>>>>>>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master
>>>>>>>>>>>> detected at master@10.0.18.246:5050
>>>>>>>>>>>> >
>>>>>>>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials
>>>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > where the "New master detected" line is promising.
>>>>>>>>>>>> >
>>>>>>>>>>>> > However, on the Flink UI I see only the jobmanager started,
>>>>>>>>>>>> and there are no task managers.  Getting into the Docker container, I see
>>>>>>>>>>>> this in the log:
>>>>>>>>>>>> >
>>>>>>>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > I have verified that from the container I can access the
>>>>>>>>>>>> Mesos container 10.0.18.246:5050
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > Does any other port besides the web UI port 5050 need to be
>>>>>>>>>>>> open for mesos-appmaster to connect with the Mesos master?
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > In the appmaster log (attached) I see one exception that I
>>>>>>>>>>>> don't know if they are related to the Mesos connection problem, one is
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and
>>>>>>>>>>>> hadoop.home.dir are unset.
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>>>> Method)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>>> Source)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>>> Source)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown
>>>>>>>>>>>> Source)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>>>>>>>> >
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > I am not trying (yet) to run in high availability mode, so I
>>>>>>>>>>>> am not sure if I need to have HADOOP_HOME set or not, but I don't see
>>>>>>>>>>>> anything about HADOOP_HOME in the FLink docs.
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos
>>>>>>>>>>>> environment so Flink can connect to my Mesos master?
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > Javier Vegas
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Matthias Pohl <ma...@ververica.com>.
Thanks for sharing. I was wondering why you don't use $PORT0 in your
command. And: Are the ports properly configured in the Marathon network
configuration [1]? But the error seems to be unrelated to that setting.
Other than that, I cannot see any other issue with the configuration. It
could be that the HOST IP is blocked?

[1] https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports

On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas <jv...@strava.com> wrote:

>
> Full appmaster log in debug mode is attached.
> My startup command was
> /opt/flink/bin/mesos-appmaster.sh \
>       -Drest.bind-port=8081 \
>       -Drest.port=8081 \
>       -Djobmanager.rpc.address=$HOST \
>       -Djobmanager.rpc.port=$PORT1 \
>       -Dmesos.resourcemanager.framework.user=flink \
>       -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
>       -Dmesos.master=10.0.18.246:5050 \
>       -Dmesos.resourcemanager.tasks.cpus=4 \
>       -Dmesos.resourcemanager.tasks.container.type=docker \
>       -Dmesos.resourcemanager.tasks.container.image.name=
> docker.strava.com/strava/timeline-populator2:jv-mesos \
>       -Dtaskmanager.numberOfTaskSlots=4 ;
>
> where $PORT1 refers to my second host open port, mapped to 6123 on the
> Docker container (first port is mapped to 8081).
> I can see in the log that $HOST and $PORT1 resolve to the correct values, 10.0.20.25
> and 31608
>
> On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> ...and if possible, it would be helpful to provide debug logs as well.
>>
>> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> May you provide the entire JobManager logs so that we can see what's
>>> going on?
>>>
>>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jv...@strava.com> wrote:
>>>
>>>> Thanks again, Matthias!
>>>>
>>>> Putting  -Djobmanager.rpc.address=$HOST and
>>>> -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh
>>>> I see in tog they seem to transform in the correct values
>>>>
>>>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>>>
>>>> but a bit later the appmaster dies with this new error. it is unclear
>>>> what address it is trying to bind, I added explicit params
>>>> -Drest.bind-port=8081 and
>>>>       -Drest.port=8081 in case jobmanager.rpc.port was somehow
>>>> interfering, but that didn't help.
>>>>
>>>> 2021-09-29 10:29:59.845 [main] INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address
>>>> 	at java.base/sun.nio.ch.Net.bind0(Native Method)
>>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>>> 	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>>> 	at java.base/java.lang.Thread.run(Unknown Source)
>>>>
>>>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> The port has its separate configuration parameter jobmanager.rpc.port
>>>>> [1]
>>>>>
>>>>> [1]
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>>>>
>>>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com>
>>>>> wrote:
>>>>>
>>>>>> Matthias, thanks for the suggestion! I changed my
>>>>>> jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the
>>>>>> log I see resolves properly to the host IP and port mapped to 8081
>>>>>>
>>>>>> 2021-09-29 07:58:05.452 [main] INFO
>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>>>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>>>>
>>>>>> which is very promising. But sadly a little bit later appmaster dies
>>>>>> with this errror:
>>>>>>
>>>>>> 2021-09-29 07:58:05.648 [main] INFO
>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>>>>> cluster services.
>>>>>> 2021-09-29 07:58:05.674 [main] INFO
>>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>>>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>>>>> Diagnostics org.apache.flink.configurati
>>>>>> on.IllegalConfigurationException: The configured hostname is not valid
>>>>>> at
>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>>> at
>>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>>> at
>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>>> at
>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>>> at
>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>>> at
>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>>> at
>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>>> at
>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>>> at java.base/java.security.AccessController.doPrivileged(Native
>>>>>> Method)
>>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>>> at
>>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>>> at
>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>>> Caused by: java.lang.IllegalArgumentException
>>>>>> at
>>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>>> at
>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>>> ... 17 more
>>>>>> .
>>>>>> 2021-09-29 07:58:05.685 [main] ERROR
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
>>>>>> cluster entrypoint MesosSessionClusterEntrypoint.
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException:
>>>>>> Failed to initialize the cluster entrypoint MesosSessionClusterEntrypoint.
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>>> at
>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>>> Caused by:
>>>>>> org.apache.flink.configuration.IllegalConfigurationException: The
>>>>>> configured hostname is not valid
>>>>>> at
>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>>> at
>>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>>> at
>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>>> at
>>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>>> at
>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>>> at
>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>>> at
>>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>>> at
>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>>> at java.base/java.security.AccessController.doPrivileged(Native
>>>>>> Method)
>>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>>> at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>>> at
>>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>>> at
>>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>>> ... 2 common frames omitted
>>>>>> Caused by: java.lang.IllegalArgumentException: null
>>>>>> at
>>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>>> at
>>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>>> ... 17 common frames omitted
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <
>>>>>> matthias@ververica.com> wrote:
>>>>>>
>>>>>>> One thing that was puzzling me yesterday when reading your post:
>>>>>>> Have you tried $HOST instead of $HOSTNAME in the Marathon configuration?
>>>>>>> When I played around with Mesos, I remember using HOST to resolve the
>>>>>>> host's IP address instead of the host's name. It could be that the hostname
>>>>>>> itself cannot be resolved to the right IP address. But I struggled to find
>>>>>>> proper documentation to back that up. Only in the recipes section of the
>>>>>>> Marathon docs [1], HOST was used as well.
>>>>>>>
>>>>>>> Matthias
>>>>>>>
>>>>>>> [1]
>>>>>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>>>>>>
>>>>>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Another update:  Looking more carefully in my appmaster log, I see
>>>>>>>> the following
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>>> Registering as new framework.
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>>>>>>>> Info:
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
>>>>>>>> URL: 10.0.18.246:5050
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>>>>>>>> Info:
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
>>>>>>>> (none)
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
>>>>>>>> flink-test
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
>>>>>>>> Timeout (secs): 604800.0
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role:
>>>>>>>> *
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
>>>>>>>> (none)
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
>>>>>>>> (none)
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
>>>>>>>> 311dcf7fd77c
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
>>>>>>>> UI: http://311dcf7fd77c:8081
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>>
>>>>>>>> which is picking up the mesos.master and
>>>>>>>> mesos.resourcemanager.framework.name params I am passing to
>>>>>>>> mesos-appmaster.sh
>>>>>>>>
>>>>>>>>
>>>>>>>> In my Mesos dashboard I can see the framework has been created with
>>>>>>>> the right name, but has no associated agents/tasks to it. So at least Flink
>>>>>>>> has been able to connect to the Mesos master to create the framework
>>>>>>>>
>>>>>>>>
>>>>>>>> Later in the mesos-appmaster log is when I see the Mesos connection
>>>>>>>> errors:
>>>>>>>>
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3]
>>>>>>>> DEBUG o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  -
>>>>>>>> Starting the slot manager.
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2]
>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  - State
>>>>>>>> change (StoppedState -> StoppedState) with data ()
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>> Trigger heartbeat request.
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ReconciliationCoordinator  -
>>>>>>>> State change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>> Trigger heartbeat request.
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting
>>>>>>>> to Mesos...
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3]
>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  - State
>>>>>>>> change (StoppedState -> ConnectingState) with data ()
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>>> Mesos resource manager started.
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4]
>>>>>>>> DEBUG org.apache.flink.mesos.scheduler.LaunchCoordinator  - State
>>>>>>>> change (Suspended -> Suspended) with data GatherData(List(),List())
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
>>>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>>>>>>> connect to Mesos; still trying...
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3]
>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>> Trigger heartbeat request.
>>>>>>>>
>>>>>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3]
>>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>>> Trigger heartbeat request.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So why the appmaster was able to connect to Mesos master to create
>>>>>>>> the framework but failed to connect later to do whatever it does later?
>>>>>>>>
>>>>>>>>
>>>>>>>> One possible issue I see is that the framework is set with web UI
>>>>>>>> in http://311dcf7fd77c:8081 which can not be resolved from the
>>>>>>>> Mesos master. 311dcf7fd77c is the result of doing hostname on the
>>>>>>>> Docker container, and the Mesos master can not resolve that name. I could
>>>>>>>> try to replace the Docker container hostname with the Docker host hostname,
>>>>>>>> but the host port that gets mapped to 8081 on the container is a random
>>>>>>>> port that I can not know beforehand. Does Mesos master try to reach Flink
>>>>>>>> using that Web UI setting? Could this be the issue causing my connection
>>>>>>>> problem, or is this a red herring and the problem is a different one?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>> Javier Vegas
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Matthias!
>>>>>>>>>
>>>>>>>>> There are lots of apps deployed to the Mesos cluster, the task
>>>>>>>>> manager itself is deployed to Mesos via Marathon.  In the Mesos log I can
>>>>>>>>> see the Job manager agent starting, but no error messages related to it. As
>>>>>>>>> you say, TaskManagers don't even have the chance to get confused about
>>>>>>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>>>>>>> tell it to start the Task Managers.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Javier
>>>>>>>>>
>>>>>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <
>>>>>>>>> matthias@ververica.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Javier,
>>>>>>>>>> I don't see anything that's configured in the wrong way based on
>>>>>>>>>> the jobmanager logs you've provided. Have you been able to deploy other
>>>>>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>>>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>>>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>>>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>>>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>>>>>>> not coming up.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Matthias
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <
>>>>>>>>>> roman@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> No additional ports need to be open as far as I know.
>>>>>>>>>>>
>>>>>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable
>>>>>>>>>>> on TMs?
>>>>>>>>>>>
>>>>>>>>>>> Please also make sure that the following gets executed before
>>>>>>>>>>> mesos-appmaster.sh:
>>>>>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>>>>>>> (as per the documentation you linked)
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Roman
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>>>>>>> instrucions in
>>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>>>>>>> binaries.
>>>>>>>>>>> >
>>>>>>>>>>> > My entrypoint for the Docker image is:
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>>>>>>> >
>>>>>>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>>>>>>> >
>>>>>>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>>>>>>> >
>>>>>>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>>>>>>> >
>>>>>>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>>>>>>> >
>>>>>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered
>>>>>>>>>>> on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>>>>>>> >
>>>>>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered
>>>>>>>>>>> docker executor on 10.0.20.177
>>>>>>>>>>> >
>>>>>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>>>>>>> >
>>>>>>>>>>> > WARNING: Your kernel does not support swap limit capabilities
>>>>>>>>>>> or the cgroup is not mounted. Memory limited without swap.
>>>>>>>>>>> >
>>>>>>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>>>>>>> >
>>>>>>>>>>> > WARNING: Illegal reflective access by
>>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>>>>>>> sun.security.krb5.Config.getInstance()
>>>>>>>>>>> >
>>>>>>>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>>> >
>>>>>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of
>>>>>>>>>>> further illegal reflective access operations
>>>>>>>>>>> >
>>>>>>>>>>> > WARNING: All illegal access operations will be denied in a
>>>>>>>>>>> future release
>>>>>>>>>>> >
>>>>>>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>>>>>>> >
>>>>>>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected
>>>>>>>>>>> at master@10.0.18.246:5050
>>>>>>>>>>> >
>>>>>>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials
>>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > where the "New master detected" line is promising.
>>>>>>>>>>> >
>>>>>>>>>>> > However, on the Flink UI I see only the jobmanager started,
>>>>>>>>>>> and there are no task managers.  Getting into the Docker container, I see
>>>>>>>>>>> this in the log:
>>>>>>>>>>> >
>>>>>>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > I have verified that from the container I can access the Mesos
>>>>>>>>>>> container 10.0.18.246:5050
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Does any other port besides the web UI port 5050 need to be
>>>>>>>>>>> open for mesos-appmaster to connect with the Mesos master?
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > In the appmaster log (attached) I see one exception that I
>>>>>>>>>>> don't know if they are related to the Mesos connection problem, one is
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir
>>>>>>>>>>> are unset.
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>>> Method)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>> Source)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>>>>> Source)
>>>>>>>>>>> >
>>>>>>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown
>>>>>>>>>>> Source)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>>>>>>> >
>>>>>>>>>>> >         at
>>>>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > I am not trying (yet) to run in high availability mode, so I
>>>>>>>>>>> am not sure if I need to have HADOOP_HOME set or not, but I don't see
>>>>>>>>>>> anything about HADOOP_HOME in the FLink docs.
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment
>>>>>>>>>>> so Flink can connect to my Mesos master?
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Thanks,
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Javier Vegas
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
Full appmaster log in debug mode is attached.
My startup command was
/opt/flink/bin/mesos-appmaster.sh \
      -Drest.bind-port=8081 \
      -Drest.port=8081 \
      -Djobmanager.rpc.address=$HOST \
      -Djobmanager.rpc.port=$PORT1 \
      -Dmesos.resourcemanager.framework.user=flink \
      -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
      -Dmesos.master=10.0.18.246:5050 \
      -Dmesos.resourcemanager.tasks.cpus=4 \
      -Dmesos.resourcemanager.tasks.container.type=docker \
      -Dmesos.resourcemanager.tasks.container.image.name=
docker.strava.com/strava/timeline-populator2:jv-mesos \
      -Dtaskmanager.numberOfTaskSlots=4 ;

where $PORT1 refers to my second host open port, mapped to 6123 on the
Docker container (first port is mapped to 8081).
I can see in the log that $HOST and $PORT1 resolve to the correct
values, 10.0.20.25
and 31608

On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl <ma...@ververica.com>
wrote:

> ...and if possible, it would be helpful to provide debug logs as well.
>
> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> May you provide the entire JobManager logs so that we can see what's
>> going on?
>>
>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jv...@strava.com> wrote:
>>
>>> Thanks again, Matthias!
>>>
>>> Putting  -Djobmanager.rpc.address=$HOST and
>>> -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh
>>> I see in tog they seem to transform in the correct values
>>>
>>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>>
>>> but a bit later the appmaster dies with this new error. it is unclear
>>> what address it is trying to bind, I added explicit params
>>> -Drest.bind-port=8081 and
>>>       -Drest.port=8081 in case jobmanager.rpc.port was somehow
>>> interfering, but that didn't help.
>>>
>>> 2021-09-29 10:29:59.845 [main] INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address
>>> 	at java.base/sun.nio.ch.Net.bind0(Native Method)
>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>>> 	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>>> 	at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>>> 	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>> 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>> 	at java.base/java.lang.Thread.run(Unknown Source)
>>>
>>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> The port has its separate configuration parameter jobmanager.rpc.port
>>>> [1]
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>>>
>>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com>
>>>> wrote:
>>>>
>>>>> Matthias, thanks for the suggestion! I changed my
>>>>> jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the
>>>>> log I see resolves properly to the host IP and port mapped to 8081
>>>>>
>>>>> 2021-09-29 07:58:05.452 [main] INFO
>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>>>
>>>>> which is very promising. But sadly a little bit later appmaster dies
>>>>> with this errror:
>>>>>
>>>>> 2021-09-29 07:58:05.648 [main] INFO
>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>>>> cluster services.
>>>>> 2021-09-29 07:58:05.674 [main] INFO
>>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>>>> Diagnostics org.apache.flink.configurati
>>>>> on.IllegalConfigurationException: The configured hostname is not valid
>>>>> at
>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>> at
>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>> at
>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>> at
>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>> at
>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>> at
>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>> at
>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>> Caused by: java.lang.IllegalArgumentException
>>>>> at
>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>> at
>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>> ... 17 more
>>>>> .
>>>>> 2021-09-29 07:58:05.685 [main] ERROR
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
>>>>> cluster entrypoint MesosSessionClusterEntrypoint.
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed
>>>>> to initialize the cluster entrypoint MesosSessionClusterEntrypoint.
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>>> at
>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>>> Caused by:
>>>>> org.apache.flink.configuration.IllegalConfigurationException: The
>>>>> configured hostname is not valid
>>>>> at
>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>>> at
>>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>>> at
>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>>> at
>>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>>> at
>>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>>> at
>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>>> at
>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>>> at
>>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>>> at
>>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>>> ... 2 common frames omitted
>>>>> Caused by: java.lang.IllegalArgumentException: null
>>>>> at
>>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>>> at
>>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>>> ... 17 common frames omitted
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <ma...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> One thing that was puzzling me yesterday when reading your post: Have
>>>>>> you tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
>>>>>> played around with Mesos, I remember using HOST to resolve the host's IP
>>>>>> address instead of the host's name. It could be that the hostname itself
>>>>>> cannot be resolved to the right IP address. But I struggled to find proper
>>>>>> documentation to back that up. Only in the recipes section of the Marathon
>>>>>> docs [1], HOST was used as well.
>>>>>>
>>>>>> Matthias
>>>>>>
>>>>>> [1]
>>>>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>>>>>
>>>>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Another update:  Looking more carefully in my appmaster log, I see
>>>>>>> the following
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>> Registering as new framework.
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>> -----------------------------------------------------------------------------
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>>>>>>> Info:
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
>>>>>>> URL: 10.0.18.246:5050
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>>>>>>> Info:
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
>>>>>>> (none)
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
>>>>>>> flink-test
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
>>>>>>> Timeout (secs): 604800.0
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role:
>>>>>>> *
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
>>>>>>> (none)
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
>>>>>>> (none)
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
>>>>>>> 311dcf7fd77c
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
>>>>>>> UI: http://311dcf7fd77c:8081
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>> -----------------------------------------------------------------------------
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>>
>>>>>>> which is picking up the mesos.master and
>>>>>>> mesos.resourcemanager.framework.name params I am passing to
>>>>>>> mesos-appmaster.sh
>>>>>>>
>>>>>>>
>>>>>>> In my Mesos dashboard I can see the framework has been created with
>>>>>>> the right name, but has no associated agents/tasks to it. So at least Flink
>>>>>>> has been able to connect to the Mesos master to create the framework
>>>>>>>
>>>>>>>
>>>>>>> Later in the mesos-appmaster log is when I see the Mesos connection
>>>>>>> errors:
>>>>>>>
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3]
>>>>>>> DEBUG o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  -
>>>>>>> Starting the slot manager.
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2]
>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  - State
>>>>>>> change (StoppedState -> StoppedState) with data ()
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>> Trigger heartbeat request.
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ReconciliationCoordinator  -
>>>>>>> State change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3]
>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>> Trigger heartbeat request.
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
>>>>>>> Mesos...
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3]
>>>>>>> DEBUG org.apache.flink.mesos.scheduler.ConnectionMonitor  - State
>>>>>>> change (StoppedState -> ConnectingState) with data ()
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>>> Mesos resource manager started.
>>>>>>>
>>>>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4]
>>>>>>> DEBUG org.apache.flink.mesos.scheduler.LaunchCoordinator  - State
>>>>>>> change (Suspended -> Suspended) with data GatherData(List(),List())
>>>>>>>
>>>>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
>>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>>>>>> connect to Mesos; still trying...
>>>>>>>
>>>>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3]
>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>> Trigger heartbeat request.
>>>>>>>
>>>>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3]
>>>>>>> DEBUG o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>>> Trigger heartbeat request.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So why the appmaster was able to connect to Mesos master to create
>>>>>>> the framework but failed to connect later to do whatever it does later?
>>>>>>>
>>>>>>>
>>>>>>> One possible issue I see is that the framework is set with web UI in
>>>>>>> http://311dcf7fd77c:8081 which can not be resolved from the Mesos
>>>>>>> master. 311dcf7fd77c is the result of doing hostname on the Docker
>>>>>>> container, and the Mesos master can not resolve that name. I could try to
>>>>>>> replace the Docker container hostname with the Docker host hostname, but
>>>>>>> the host port that gets mapped to 8081 on the container is a random port
>>>>>>> that I can not know beforehand. Does Mesos master try to reach Flink using
>>>>>>> that Web UI setting? Could this be the issue causing my connection problem,
>>>>>>> or is this a red herring and the problem is a different one?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>> Javier Vegas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks, Matthias!
>>>>>>>>
>>>>>>>> There are lots of apps deployed to the Mesos cluster, the task
>>>>>>>> manager itself is deployed to Mesos via Marathon.  In the Mesos log I can
>>>>>>>> see the Job manager agent starting, but no error messages related to it. As
>>>>>>>> you say, TaskManagers don't even have the chance to get confused about
>>>>>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>>>>>> tell it to start the Task Managers.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Javier
>>>>>>>>
>>>>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <
>>>>>>>> matthias@ververica.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Javier,
>>>>>>>>> I don't see anything that's configured in the wrong way based on
>>>>>>>>> the jobmanager logs you've provided. Have you been able to deploy other
>>>>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>>>>>> not coming up.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Matthias
>>>>>>>>>
>>>>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <
>>>>>>>>> roman@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> No additional ports need to be open as far as I know.
>>>>>>>>>>
>>>>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable
>>>>>>>>>> on TMs?
>>>>>>>>>>
>>>>>>>>>> Please also make sure that the following gets executed before
>>>>>>>>>> mesos-appmaster.sh:
>>>>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>>>>>> (as per the documentation you linked)
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Roman
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >
>>>>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>>>>>> instrucions in
>>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>>>>>> binaries.
>>>>>>>>>> >
>>>>>>>>>> > My entrypoint for the Docker image is:
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>>>>>> >
>>>>>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>>>>>> >
>>>>>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>>>>>> >
>>>>>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>>>>>> >
>>>>>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>>>>>> >
>>>>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered
>>>>>>>>>> on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>>>>>> >
>>>>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered
>>>>>>>>>> docker executor on 10.0.20.177
>>>>>>>>>> >
>>>>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>>>>>> >
>>>>>>>>>> > WARNING: Your kernel does not support swap limit capabilities
>>>>>>>>>> or the cgroup is not mounted. Memory limited without swap.
>>>>>>>>>> >
>>>>>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>>>>>> >
>>>>>>>>>> > WARNING: Illegal reflective access by
>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>>>>>> sun.security.krb5.Config.getInstance()
>>>>>>>>>> >
>>>>>>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>>> >
>>>>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of
>>>>>>>>>> further illegal reflective access operations
>>>>>>>>>> >
>>>>>>>>>> > WARNING: All illegal access operations will be denied in a
>>>>>>>>>> future release
>>>>>>>>>> >
>>>>>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>>>>>> >
>>>>>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected
>>>>>>>>>> at master@10.0.18.246:5050
>>>>>>>>>> >
>>>>>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials
>>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > where the "New master detected" line is promising.
>>>>>>>>>> >
>>>>>>>>>> > However, on the Flink UI I see only the jobmanager started, and
>>>>>>>>>> there are no task managers.  Getting into the Docker container, I see this
>>>>>>>>>> in the log:
>>>>>>>>>> >
>>>>>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > I have verified that from the container I can access the Mesos
>>>>>>>>>> container 10.0.18.246:5050
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Does any other port besides the web UI port 5050 need to be
>>>>>>>>>> open for mesos-appmaster to connect with the Mesos master?
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > In the appmaster log (attached) I see one exception that I
>>>>>>>>>> don't know if they are related to the Mesos connection problem, one is
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir
>>>>>>>>>> are unset.
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>>>>>> >
>>>>>>>>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>> Method)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>>>>>> Source)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>>>> Source)
>>>>>>>>>> >
>>>>>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown
>>>>>>>>>> Source)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>>>>>> >
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > I am not trying (yet) to run in high availability mode, so I am
>>>>>>>>>> not sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>>>>>>>>> about HADOOP_HOME in the FLink docs.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment
>>>>>>>>>> so Flink can connect to my Mesos master?
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Javier Vegas
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Matthias Pohl <ma...@ververica.com>.
...and if possible, it would be helpful to provide debug logs as well.

On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl <ma...@ververica.com>
wrote:

> May you provide the entire JobManager logs so that we can see what's going
> on?
>
> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jv...@strava.com> wrote:
>
>> Thanks again, Matthias!
>>
>> Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
>> as params for appmaster.sh
>> I see in tog they seem to transform in the correct values
>>
>> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>>
>> but a bit later the appmaster dies with this new error. it is unclear
>> what address it is trying to bind, I added explicit params
>> -Drest.bind-port=8081 and
>>       -Drest.port=8081 in case jobmanager.rpc.port was somehow
>> interfering, but that didn't help.
>>
>> 2021-09-29 10:29:59.845 [main] INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address
>> 	at java.base/sun.nio.ch.Net.bind0(Native Method)
>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
>> 	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
>> 	at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>> 	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>> 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>> 	at java.base/java.lang.Thread.run(Unknown Source)
>>
>> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> The port has its separate configuration parameter jobmanager.rpc.port [1]
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>>
>>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com> wrote:
>>>
>>>> Matthias, thanks for the suggestion! I changed my
>>>> jobmanager.rpc.address param from $HOSTNAME to $HOST:$PORT0 which in the
>>>> log I see resolves properly to the host IP and port mapped to 8081
>>>>
>>>> 2021-09-29 07:58:05.452 [main] INFO
>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>>
>>>> which is very promising. But sadly a little bit later appmaster dies
>>>> with this errror:
>>>>
>>>> 2021-09-29 07:58:05.648 [main] INFO
>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>>> cluster services.
>>>> 2021-09-29 07:58:05.674 [main] INFO
>>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>>> Diagnostics org.apache.flink.configurati
>>>> on.IllegalConfigurationException: The configured hostname is not valid
>>>> at
>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>> at
>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>> at
>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>> at
>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>> at
>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>> at
>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>> at
>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>> at
>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>> at
>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>> at
>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>> Caused by: java.lang.IllegalArgumentException
>>>> at
>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>> at
>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>> ... 17 more
>>>> .
>>>> 2021-09-29 07:58:05.685 [main] ERROR
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
>>>> cluster entrypoint MesosSessionClusterEntrypoint.
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed
>>>> to initialize the cluster entrypoint MesosSessionClusterEntrypoint.
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>>> at
>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>>> Caused by:
>>>> org.apache.flink.configuration.IllegalConfigurationException: The
>>>> configured hostname is not valid
>>>> at
>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>>> at
>>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>>> at
>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>>> at
>>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>>> at
>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>>> at
>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>>> at
>>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>>> at
>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>>> at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>>> at
>>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>> at
>>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>>> ... 2 common frames omitted
>>>> Caused by: java.lang.IllegalArgumentException: null
>>>> at
>>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>>> at
>>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>>> ... 17 common frames omitted
>>>>
>>>>
>>>>
>>>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> One thing that was puzzling me yesterday when reading your post: Have
>>>>> you tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
>>>>> played around with Mesos, I remember using HOST to resolve the host's IP
>>>>> address instead of the host's name. It could be that the hostname itself
>>>>> cannot be resolved to the right IP address. But I struggled to find proper
>>>>> documentation to back that up. Only in the recipes section of the Marathon
>>>>> docs [1], HOST was used as well.
>>>>>
>>>>> Matthias
>>>>>
>>>>> [1]
>>>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>>>>
>>>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com>
>>>>> wrote:
>>>>>
>>>>>> Another update:  Looking more carefully in my appmaster log, I see
>>>>>> the following
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>> Registering as new framework.
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>> -----------------------------------------------------------------------------
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>>>>>> Info:
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
>>>>>> URL: 10.0.18.246:5050
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>>>>>> Info:
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
>>>>>> (none)
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
>>>>>> flink-test
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
>>>>>> Timeout (secs): 604800.0
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role:
>>>>>> *
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
>>>>>> (none)
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
>>>>>> (none)
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
>>>>>> 311dcf7fd77c
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
>>>>>> UI: http://311dcf7fd77c:8081
>>>>>>
>>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>>> -----------------------------------------------------------------------------
>>>>>>
>>>>>> ---
>>>>>>
>>>>>>
>>>>>> which is picking up the mesos.master and
>>>>>> mesos.resourcemanager.framework.name params I am passing to
>>>>>> mesos-appmaster.sh
>>>>>>
>>>>>>
>>>>>> In my Mesos dashboard I can see the framework has been created with
>>>>>> the right name, but has no associated agents/tasks to it. So at least Flink
>>>>>> has been able to connect to the Mesos master to create the framework
>>>>>>
>>>>>>
>>>>>> Later in the mesos-appmaster log is when I see the Mesos connection
>>>>>> errors:
>>>>>>
>>>>>>
>>>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>>> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  -
>>>>>> Starting the slot manager.
>>>>>>
>>>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>>>>> (StoppedState -> StoppedState) with data ()
>>>>>>
>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>> Trigger heartbeat request.
>>>>>>
>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>>> org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State
>>>>>> change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>>>>>
>>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>> Trigger heartbeat request.
>>>>>>
>>>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
>>>>>> Mesos...
>>>>>>
>>>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>>>>> (StoppedState -> ConnectingState) with data ()
>>>>>>
>>>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
>>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
>>>>>> resource manager started.
>>>>>>
>>>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
>>>>>> org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
>>>>>> (Suspended -> Suspended) with data GatherData(List(),List())
>>>>>>
>>>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
>>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>>>>> connect to Mesos; still trying...
>>>>>>
>>>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>> Trigger heartbeat request.
>>>>>>
>>>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  -
>>>>>> Trigger heartbeat request.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> So why the appmaster was able to connect to Mesos master to create
>>>>>> the framework but failed to connect later to do whatever it does later?
>>>>>>
>>>>>>
>>>>>> One possible issue I see is that the framework is set with web UI in h
>>>>>> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos
>>>>>> master. 311dcf7fd77c is the result of doing hostname on the Docker
>>>>>> container, and the Mesos master can not resolve that name. I could try to
>>>>>> replace the Docker container hostname with the Docker host hostname, but
>>>>>> the host port that gets mapped to 8081 on the container is a random port
>>>>>> that I can not know beforehand. Does Mesos master try to reach Flink using
>>>>>> that Web UI setting? Could this be the issue causing my connection problem,
>>>>>> or is this a red herring and the problem is a different one?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> Javier Vegas
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Matthias!
>>>>>>>
>>>>>>> There are lots of apps deployed to the Mesos cluster, the task
>>>>>>> manager itself is deployed to Mesos via Marathon.  In the Mesos log I can
>>>>>>> see the Job manager agent starting, but no error messages related to it. As
>>>>>>> you say, TaskManagers don't even have the chance to get confused about
>>>>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>>>>> tell it to start the Task Managers.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Javier
>>>>>>>
>>>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <
>>>>>>> matthias@ververica.com> wrote:
>>>>>>>
>>>>>>>> Hi Javier,
>>>>>>>> I don't see anything that's configured in the wrong way based on
>>>>>>>> the jobmanager logs you've provided. Have you been able to deploy other
>>>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>>>>> not coming up.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Matthias
>>>>>>>>
>>>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> No additional ports need to be open as far as I know.
>>>>>>>>>
>>>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable on
>>>>>>>>> TMs?
>>>>>>>>>
>>>>>>>>> Please also make sure that the following gets executed before
>>>>>>>>> mesos-appmaster.sh:
>>>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>>>>> (as per the documentation you linked)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Roman
>>>>>>>>>
>>>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>>>>> instrucions in
>>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>>>>> binaries.
>>>>>>>>> >
>>>>>>>>> > My entrypoint for the Docker image is:
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>>>>> >
>>>>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>>>>> >
>>>>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>>>>> >
>>>>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>>>>> >
>>>>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>>>>> >
>>>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered
>>>>>>>>> on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>>>>> >
>>>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>>>>>>>> executor on 10.0.20.177
>>>>>>>>> >
>>>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>>>>> >
>>>>>>>>> > WARNING: Your kernel does not support swap limit capabilities or
>>>>>>>>> the cgroup is not mounted. Memory limited without swap.
>>>>>>>>> >
>>>>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>>>>> >
>>>>>>>>> > WARNING: Illegal reflective access by
>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>>>>> sun.security.krb5.Config.getInstance()
>>>>>>>>> >
>>>>>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>>> >
>>>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>>>>>>>> illegal reflective access operations
>>>>>>>>> >
>>>>>>>>> > WARNING: All illegal access operations will be denied in a
>>>>>>>>> future release
>>>>>>>>> >
>>>>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>>>>> >
>>>>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected
>>>>>>>>> at master@10.0.18.246:5050
>>>>>>>>> >
>>>>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials
>>>>>>>>> provided. Attempting to register without authentication
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > where the "New master detected" line is promising.
>>>>>>>>> >
>>>>>>>>> > However, on the Flink UI I see only the jobmanager started, and
>>>>>>>>> there are no task managers.  Getting into the Docker container, I see this
>>>>>>>>> in the log:
>>>>>>>>> >
>>>>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > I have verified that from the container I can access the Mesos
>>>>>>>>> container 10.0.18.246:5050
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Does any other port besides the web UI port 5050 need to be open
>>>>>>>>> for mesos-appmaster to connect with the Mesos master?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > In the appmaster log (attached) I see one exception that I don't
>>>>>>>>> know if they are related to the Mesos connection problem, one is
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir
>>>>>>>>> are unset.
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>>>>> >
>>>>>>>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>> Method)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>>>>> Source)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>>> Source)
>>>>>>>>> >
>>>>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown
>>>>>>>>> Source)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>>>>> >
>>>>>>>>> >         at
>>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > I am not trying (yet) to run in high availability mode, so I am
>>>>>>>>> not sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>>>>>>>> about HADOOP_HOME in the FLink docs.
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment
>>>>>>>>> so Flink can connect to my Mesos master?
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Javier Vegas
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Matthias Pohl <ma...@ververica.com>.
May you provide the entire JobManager logs so that we can see what's going
on?

On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas <jv...@strava.com> wrote:

> Thanks again, Matthias!
>
> Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
> as params for appmaster.sh
> I see in tog they seem to transform in the correct values
>
> -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009
>
> but a bit later the appmaster dies with this new error. it is unclear what
> address it is trying to bind, I added explicit params
> -Drest.bind-port=8081 and
>       -Drest.port=8081 in case jobmanager.rpc.port was somehow
> interfering, but that didn't help.
>
> 2021-09-29 10:29:59.845 [main] INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address
> 	at java.base/sun.nio.ch.Net.bind0(Native Method)
> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
> 	at java.base/sun.nio.ch.Net.bind(Unknown Source)
> 	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
> 	at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> 	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> 	at java.base/java.lang.Thread.run(Unknown Source)
>
> On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> The port has its separate configuration parameter jobmanager.rpc.port [1]
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>>
>> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com> wrote:
>>
>>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
>>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
>>> properly to the host IP and port mapped to 8081
>>>
>>> 2021-09-29 07:58:05.452 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>>> -Djobmanager.rpc.address=10.0.22.114:31894
>>>
>>> which is very promising. But sadly a little bit later appmaster dies
>>> with this errror:
>>>
>>> 2021-09-29 07:58:05.648 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>>> cluster services.
>>> 2021-09-29 07:58:05.674 [main] INFO
>>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>>> MesosSessionClusterEntrypoint down with application status FAILED.
>>> Diagnostics org.apache.flink.configurati
>>> on.IllegalConfigurationException: The configured hostname is not valid
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>> at
>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>> at
>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>> at
>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>> at
>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>> at
>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>> Caused by: java.lang.IllegalArgumentException
>>> at
>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>> ... 17 more
>>> .
>>> 2021-09-29 07:58:05.685 [main] ERROR
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
>>> cluster entrypoint MesosSessionClusterEntrypoint.
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed
>>> to initialize the cluster entrypoint MesosSessionClusterEntrypoint.
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>>> at
>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>>> Caused by: org.apache.flink.configuration.IllegalConfigurationException:
>>> The configured hostname is not valid
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>>> at
>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>>> at
>>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>>> at
>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>> at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>>> at
>>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>> at
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>>> ... 2 common frames omitted
>>> Caused by: java.lang.IllegalArgumentException: null
>>> at
>>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>>> at
>>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>>> ... 17 common frames omitted
>>>
>>>
>>>
>>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> One thing that was puzzling me yesterday when reading your post: Have
>>>> you tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
>>>> played around with Mesos, I remember using HOST to resolve the host's IP
>>>> address instead of the host's name. It could be that the hostname itself
>>>> cannot be resolved to the right IP address. But I struggled to find proper
>>>> documentation to back that up. Only in the recipes section of the Marathon
>>>> docs [1], HOST was used as well.
>>>>
>>>> Matthias
>>>>
>>>> [1]
>>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>>>
>>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com> wrote:
>>>>
>>>>> Another update:  Looking more carefully in my appmaster log, I see the
>>>>> following
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>> Registering as new framework.
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>> -----------------------------------------------------------------------------
>>>>>
>>>>> ---
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>>>>> Info:
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
>>>>> URL: 10.0.18.246:5050
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>>>>> Info:
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
>>>>> (none)
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
>>>>> flink-test
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
>>>>> Timeout (secs): 604800.0
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role:
>>>>> *
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
>>>>> (none)
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
>>>>> (none)
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
>>>>> 311dcf7fd77c
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
>>>>> UI: http://311dcf7fd77c:8081
>>>>>
>>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>>> -----------------------------------------------------------------------------
>>>>>
>>>>> ---
>>>>>
>>>>>
>>>>> which is picking up the mesos.master and
>>>>> mesos.resourcemanager.framework.name params I am passing to
>>>>> mesos-appmaster.sh
>>>>>
>>>>>
>>>>> In my Mesos dashboard I can see the framework has been created with
>>>>> the right name, but has no associated agents/tasks to it. So at least Flink
>>>>> has been able to connect to the Mesos master to create the framework
>>>>>
>>>>>
>>>>> Later in the mesos-appmaster log is when I see the Mesos connection
>>>>> errors:
>>>>>
>>>>>
>>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  -
>>>>> Starting the slot manager.
>>>>>
>>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>>>> (StoppedState -> StoppedState) with data ()
>>>>>
>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>>> heartbeat request.
>>>>>
>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>> org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State
>>>>> change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>>>>
>>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>>> heartbeat request.
>>>>>
>>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
>>>>> Mesos...
>>>>>
>>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>>>> (StoppedState -> ConnectingState) with data ()
>>>>>
>>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
>>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
>>>>> resource manager started.
>>>>>
>>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
>>>>> org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
>>>>> (Suspended -> Suspended) with data GatherData(List(),List())
>>>>>
>>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
>>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>>>> connect to Mesos; still trying...
>>>>>
>>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>>> heartbeat request.
>>>>>
>>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>>> heartbeat request.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> So why the appmaster was able to connect to Mesos master to create the
>>>>> framework but failed to connect later to do whatever it does later?
>>>>>
>>>>>
>>>>> One possible issue I see is that the framework is set with web UI in h
>>>>> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos
>>>>> master. 311dcf7fd77c is the result of doing hostname on the Docker
>>>>> container, and the Mesos master can not resolve that name. I could try to
>>>>> replace the Docker container hostname with the Docker host hostname, but
>>>>> the host port that gets mapped to 8081 on the container is a random port
>>>>> that I can not know beforehand. Does Mesos master try to reach Flink using
>>>>> that Web UI setting? Could this be the issue causing my connection problem,
>>>>> or is this a red herring and the problem is a different one?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Javier Vegas
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks, Matthias!
>>>>>>
>>>>>> There are lots of apps deployed to the Mesos cluster, the task
>>>>>> manager itself is deployed to Mesos via Marathon.  In the Mesos log I can
>>>>>> see the Job manager agent starting, but no error messages related to it. As
>>>>>> you say, TaskManagers don't even have the chance to get confused about
>>>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>>>> tell it to start the Task Managers.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Javier
>>>>>>
>>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <ma...@ververica.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Javier,
>>>>>>> I don't see anything that's configured in the wrong way based on the
>>>>>>> jobmanager logs you've provided. Have you been able to deploy other
>>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>>>> not coming up.
>>>>>>>
>>>>>>> Best,
>>>>>>> Matthias
>>>>>>>
>>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> No additional ports need to be open as far as I know.
>>>>>>>>
>>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable on
>>>>>>>> TMs?
>>>>>>>>
>>>>>>>> Please also make sure that the following gets executed before
>>>>>>>> mesos-appmaster.sh:
>>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>>>> (as per the documentation you linked)
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Roman
>>>>>>>>
>>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>>>> instrucions in
>>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>>>> binaries.
>>>>>>>> >
>>>>>>>> > My entrypoint for the Docker image is:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>>>> >
>>>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>>>> >
>>>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>>>> >
>>>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>>>> >
>>>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>>>> >
>>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>>>>>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>>>> >
>>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>>>>>>> executor on 10.0.20.177
>>>>>>>> >
>>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>>>> >
>>>>>>>> > WARNING: Your kernel does not support swap limit capabilities or
>>>>>>>> the cgroup is not mounted. Memory limited without swap.
>>>>>>>> >
>>>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>>>> >
>>>>>>>> > WARNING: Illegal reflective access by
>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>>>> sun.security.krb5.Config.getInstance()
>>>>>>>> >
>>>>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>>> >
>>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>>>>>>> illegal reflective access operations
>>>>>>>> >
>>>>>>>> > WARNING: All illegal access operations will be denied in a future
>>>>>>>> release
>>>>>>>> >
>>>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>>>> >
>>>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>>>>>>>> master@10.0.18.246:5050
>>>>>>>> >
>>>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials
>>>>>>>> provided. Attempting to register without authentication
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > where the "New master detected" line is promising.
>>>>>>>> >
>>>>>>>> > However, on the Flink UI I see only the jobmanager started, and
>>>>>>>> there are no task managers.  Getting into the Docker container, I see this
>>>>>>>> in the log:
>>>>>>>> >
>>>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  -
>>>>>>>> Unable to connect to Mesos; still trying...
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I have verified that from the container I can access the Mesos
>>>>>>>> container 10.0.18.246:5050
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Does any other port besides the web UI port 5050 need to be open
>>>>>>>> for mesos-appmaster to connect with the Mesos master?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > In the appmaster log (attached) I see one exception that I don't
>>>>>>>> know if they are related to the Mesos connection problem, one is
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir
>>>>>>>> are unset.
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>>>> >
>>>>>>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>> Method)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>>>> Source)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>>> Source)
>>>>>>>> >
>>>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown
>>>>>>>> Source)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>>>> >
>>>>>>>> >         at
>>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > I am not trying (yet) to run in high availability mode, so I am
>>>>>>>> not sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>>>>>>> about HADOOP_HOME in the FLink docs.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so
>>>>>>>> Flink can connect to my Mesos master?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Javier Vegas
>>>>>>>> >
>>>>>>>> >
>>>>>>>
>>>>>>>
>>>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
Thanks again, Matthias!

Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
as params for appmaster.sh
I see in tog they seem to transform in the correct values

-Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009

but a bit later the appmaster dies with this new error. it is unclear what
address it is trying to bind, I added explicit params
-Drest.bind-port=8081 and
      -Drest.port=8081 in case jobmanager.rpc.port was somehow interfering,
but that didn't help.

2021-09-29 10:29:59.845 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
MesosSessionClusterEntrypoint down with application status FAILED.
Diagnostics java.net.BindException: Cannot assign requested address
	at java.base/sun.nio.ch.Net.bind0(Native Method)
	at java.base/sun.nio.ch.Net.bind(Unknown Source)
	at java.base/sun.nio.ch.Net.bind(Unknown Source)
	at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
	at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)


.


On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl <ma...@ververica.com>
wrote:

> The port has its separate configuration parameter jobmanager.rpc.port [1]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>
> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com> wrote:
>
>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
>> properly to the host IP and port mapped to 8081
>>
>> 2021-09-29 07:58:05.452 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>> -Djobmanager.rpc.address=10.0.22.114:31894
>>
>> which is very promising. But sadly a little bit later appmaster dies with
>> this errror:
>>
>> 2021-09-29 07:58:05.648 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>> cluster services.
>> 2021-09-29 07:58:05.674 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>> MesosSessionClusterEntrypoint down with application status FAILED.
>> Diagnostics org.apache.flink.configurati
>> on.IllegalConfigurationException: The configured hostname is not valid
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>> at
>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>> at
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>> at
>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>> Caused by: java.lang.IllegalArgumentException
>> at
>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>> ... 17 more
>> .
>> 2021-09-29 07:58:05.685 [main] ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
>> cluster entrypoint MesosSessionClusterEntrypoint.
>> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to
>> initialize the cluster entrypoint MesosSessionClusterEntrypoint.
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
>> at
>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
>> Caused by: org.apache.flink.configuration.IllegalConfigurationException:
>> The configured hostname is not valid
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
>> at
>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
>> at java.base/java.security.AccessController.doPrivileged(Native Method)
>> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
>> at
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> at
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
>> ... 2 common frames omitted
>> Caused by: java.lang.IllegalArgumentException: null
>> at
>> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
>> ... 17 common frames omitted
>>
>>
>>
>> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> One thing that was puzzling me yesterday when reading your post: Have
>>> you tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
>>> played around with Mesos, I remember using HOST to resolve the host's IP
>>> address instead of the host's name. It could be that the hostname itself
>>> cannot be resolved to the right IP address. But I struggled to find proper
>>> documentation to back that up. Only in the recipes section of the Marathon
>>> docs [1], HOST was used as well.
>>>
>>> Matthias
>>>
>>> [1]
>>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>>
>>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com> wrote:
>>>
>>>> Another update:  Looking more carefully in my appmaster log, I see the
>>>> following
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>> Registering as new framework.
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>> -----------------------------------------------------------------------------
>>>>
>>>> ---
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>>>> Info:
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
>>>> URL: 10.0.18.246:5050
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>>>> Info:
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
>>>> (none)
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
>>>> flink-test
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
>>>> Timeout (secs): 604800.0
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role:
>>>> *
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
>>>> (none)
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
>>>> (none)
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
>>>> 311dcf7fd77c
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
>>>> UI: http://311dcf7fd77c:8081
>>>>
>>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>>> -----------------------------------------------------------------------------
>>>>
>>>> ---
>>>>
>>>>
>>>> which is picking up the mesos.master and
>>>> mesos.resourcemanager.framework.name params I am passing to
>>>> mesos-appmaster.sh
>>>>
>>>>
>>>> In my Mesos dashboard I can see the framework has been created with the
>>>> right name, but has no associated agents/tasks to it. So at least Flink has
>>>> been able to connect to the Mesos master to create the framework
>>>>
>>>>
>>>> Later in the mesos-appmaster log is when I see the Mesos connection
>>>> errors:
>>>>
>>>>
>>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting
>>>> the slot manager.
>>>>
>>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>>> (StoppedState -> StoppedState) with data ()
>>>>
>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>> heartbeat request.
>>>>
>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>> org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State
>>>> change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>>>
>>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>> heartbeat request.
>>>>
>>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
>>>> Mesos...
>>>>
>>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>>> (StoppedState -> ConnectingState) with data ()
>>>>
>>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
>>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
>>>> resource manager started.
>>>>
>>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
>>>> org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
>>>> (Suspended -> Suspended) with data GatherData(List(),List())
>>>>
>>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
>>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>>> connect to Mesos; still trying...
>>>>
>>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>> heartbeat request.
>>>>
>>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
>>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>>> heartbeat request.
>>>>
>>>>
>>>>
>>>>
>>>> So why the appmaster was able to connect to Mesos master to create the
>>>> framework but failed to connect later to do whatever it does later?
>>>>
>>>>
>>>> One possible issue I see is that the framework is set with web UI in h
>>>> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos
>>>> master. 311dcf7fd77c is the result of doing hostname on the Docker
>>>> container, and the Mesos master can not resolve that name. I could try to
>>>> replace the Docker container hostname with the Docker host hostname, but
>>>> the host port that gets mapped to 8081 on the container is a random port
>>>> that I can not know beforehand. Does Mesos master try to reach Flink using
>>>> that Web UI setting? Could this be the issue causing my connection problem,
>>>> or is this a red herring and the problem is a different one?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Javier Vegas
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com>
>>>> wrote:
>>>>
>>>>> Thanks, Matthias!
>>>>>
>>>>> There are lots of apps deployed to the Mesos cluster, the task manager
>>>>> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
>>>>> Job manager agent starting, but no error messages related to it. As you
>>>>> say, TaskManagers don't even have the chance to get confused about
>>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>>> tell it to start the Task Managers.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Javier
>>>>>
>>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <ma...@ververica.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Javier,
>>>>>> I don't see anything that's configured in the wrong way based on the
>>>>>> jobmanager logs you've provided. Have you been able to deploy other
>>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>>> not coming up.
>>>>>>
>>>>>> Best,
>>>>>> Matthias
>>>>>>
>>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> No additional ports need to be open as far as I know.
>>>>>>>
>>>>>>> Probably, $HOSTNAME is substituted for something not resolvable on
>>>>>>> TMs?
>>>>>>>
>>>>>>> Please also make sure that the following gets executed before
>>>>>>> mesos-appmaster.sh:
>>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>>> (as per the documentation you linked)
>>>>>>>
>>>>>>> Regards,
>>>>>>> Roman
>>>>>>>
>>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>>> instrucions in
>>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>>> binaries.
>>>>>>> >
>>>>>>> > My entrypoint for the Docker image is:
>>>>>>> >
>>>>>>> >
>>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>>> >
>>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>>> >
>>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>>> >
>>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>>> >
>>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>>> >
>>>>>>> >
>>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>>> >
>>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>>>>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>>> >
>>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>>>>>> executor on 10.0.20.177
>>>>>>> >
>>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>>> >
>>>>>>> > WARNING: Your kernel does not support swap limit capabilities or
>>>>>>> the cgroup is not mounted. Memory limited without swap.
>>>>>>> >
>>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>>> >
>>>>>>> > WARNING: Illegal reflective access by
>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>>> sun.security.krb5.Config.getInstance()
>>>>>>> >
>>>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>>> >
>>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>>>>>> illegal reflective access operations
>>>>>>> >
>>>>>>> > WARNING: All illegal access operations will be denied in a future
>>>>>>> release
>>>>>>> >
>>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>>> >
>>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>>>>>>> master@10.0.18.246:5050
>>>>>>> >
>>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials
>>>>>>> provided. Attempting to register without authentication
>>>>>>> >
>>>>>>> >
>>>>>>> > where the "New master detected" line is promising.
>>>>>>> >
>>>>>>> > However, on the Flink UI I see only the jobmanager started, and
>>>>>>> there are no task managers.  Getting into the Docker container, I see this
>>>>>>> in the log:
>>>>>>> >
>>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable
>>>>>>> to connect to Mesos; still trying...
>>>>>>> >
>>>>>>> >
>>>>>>> > I have verified that from the container I can access the Mesos
>>>>>>> container 10.0.18.246:5050
>>>>>>> >
>>>>>>> >
>>>>>>> > Does any other port besides the web UI port 5050 need to be open
>>>>>>> for mesos-appmaster to connect with the Mesos master?
>>>>>>> >
>>>>>>> >
>>>>>>> > In the appmaster log (attached) I see one exception that I don't
>>>>>>> know if they are related to the Mesos connection problem, one is
>>>>>>> >
>>>>>>> >
>>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>>>>>>> unset.
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>>> >
>>>>>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>>> >
>>>>>>> >         at
>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>> Method)
>>>>>>> >
>>>>>>> >         at
>>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>>> Source)
>>>>>>> >
>>>>>>> >         at
>>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>>> Source)
>>>>>>> >
>>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown
>>>>>>> Source)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>>> >
>>>>>>> >         at
>>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > I am not trying (yet) to run in high availability mode, so I am
>>>>>>> not sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>>>>>> about HADOOP_HOME in the FLink docs.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so
>>>>>>> Flink can connect to my Mesos master?
>>>>>>> >
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> >
>>>>>>> >
>>>>>>> > Javier Vegas
>>>>>>> >
>>>>>>> >
>>>>>>
>>>>>>
>>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Matthias Pohl <ma...@ververica.com>.
The port has its separate configuration parameter jobmanager.rpc.port [1]

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1

On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas <jv...@strava.com> wrote:

> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
> properly to the host IP and port mapped to 8081
>
> 2021-09-29 07:58:05.452 [main] INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
> -Djobmanager.rpc.address=10.0.22.114:31894
>
> which is very promising. But sadly a little bit later appmaster dies with
> this errror:
>
> 2021-09-29 07:58:05.648 [main] INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
> cluster services.
> 2021-09-29 07:58:05.674 [main] INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
> MesosSessionClusterEntrypoint down with application status FAILED.
> Diagnostics org.apache.flink.configurati
> on.IllegalConfigurationException: The configured hostname is not valid
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
> at
> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
> Caused by: java.lang.IllegalArgumentException
> at
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
> ... 17 more
> .
> 2021-09-29 07:58:05.685 [main] ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
> cluster entrypoint MesosSessionClusterEntrypoint.
> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to
> initialize the cluster entrypoint MesosSessionClusterEntrypoint.
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
> Caused by: org.apache.flink.configuration.IllegalConfigurationException:
> The configured hostname is not valid
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
> at
> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
> at
> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
> at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Unknown Source)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
> ... 2 common frames omitted
> Caused by: java.lang.IllegalArgumentException: null
> at
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
> at
> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
> ... 17 common frames omitted
>
>
>
> On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> One thing that was puzzling me yesterday when reading your post: Have you
>> tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
>> played around with Mesos, I remember using HOST to resolve the host's IP
>> address instead of the host's name. It could be that the hostname itself
>> cannot be resolved to the right IP address. But I struggled to find proper
>> documentation to back that up. Only in the recipes section of the Marathon
>> docs [1], HOST was used as well.
>>
>> Matthias
>>
>> [1]
>> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>>
>> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com> wrote:
>>
>>> Another update:  Looking more carefully in my appmaster log, I see the
>>> following
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>> Registering as new framework.
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>> -----------------------------------------------------------------------------
>>>
>>> ---
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>>> Info:
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
>>> URL: 10.0.18.246:5050
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>>> Info:
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
>>> (none)
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
>>> flink-test
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
>>> Timeout (secs): 604800.0
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role:
>>> *
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
>>> (none)
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
>>> (none)
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
>>> 311dcf7fd77c
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
>>> UI: http://311dcf7fd77c:8081
>>>
>>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>>> -----------------------------------------------------------------------------
>>>
>>> ---
>>>
>>>
>>> which is picking up the mesos.master and
>>> mesos.resourcemanager.framework.name params I am passing to
>>> mesos-appmaster.sh
>>>
>>>
>>> In my Mesos dashboard I can see the framework has been created with the
>>> right name, but has no associated agents/tasks to it. So at least Flink has
>>> been able to connect to the Mesos master to create the framework
>>>
>>>
>>> Later in the mesos-appmaster log is when I see the Mesos connection
>>> errors:
>>>
>>>
>>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
>>> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting
>>> the slot manager.
>>>
>>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>> (StoppedState -> StoppedState) with data ()
>>>
>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>> heartbeat request.
>>>
>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>> org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State
>>> change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>>
>>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>> heartbeat request.
>>>
>>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
>>> Mesos...
>>>
>>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>>> (StoppedState -> ConnectingState) with data ()
>>>
>>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
>>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
>>> resource manager started.
>>>
>>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
>>> org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
>>> (Suspended -> Suspended) with data GatherData(List(),List())
>>>
>>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
>>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect
>>> to Mesos; still trying...
>>>
>>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>> heartbeat request.
>>>
>>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
>>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>>> heartbeat request.
>>>
>>>
>>>
>>>
>>> So why the appmaster was able to connect to Mesos master to create the
>>> framework but failed to connect later to do whatever it does later?
>>>
>>>
>>> One possible issue I see is that the framework is set with web UI in h
>>> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos
>>> master. 311dcf7fd77c is the result of doing hostname on the Docker
>>> container, and the Mesos master can not resolve that name. I could try to
>>> replace the Docker container hostname with the Docker host hostname, but
>>> the host port that gets mapped to 8081 on the container is a random port
>>> that I can not know beforehand. Does Mesos master try to reach Flink using
>>> that Web UI setting? Could this be the issue causing my connection problem,
>>> or is this a red herring and the problem is a different one?
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Javier Vegas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com> wrote:
>>>
>>>> Thanks, Matthias!
>>>>
>>>> There are lots of apps deployed to the Mesos cluster, the task manager
>>>> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
>>>> Job manager agent starting, but no error messages related to it. As you
>>>> say, TaskManagers don't even have the chance to get confused about
>>>> variables, since the Job Manager can not connect to the Mesos master to
>>>> tell it to start the Task Managers.
>>>>
>>>> Thanks,
>>>>
>>>> Javier
>>>>
>>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <ma...@ververica.com>
>>>> wrote:
>>>>
>>>>> Hi Javier,
>>>>> I don't see anything that's configured in the wrong way based on the
>>>>> jobmanager logs you've provided. Have you been able to deploy other
>>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>>> anything? The variable resolution on the TaskManager side is a valid
>>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>>> not coming up.
>>>>>
>>>>> Best,
>>>>> Matthias
>>>>>
>>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> No additional ports need to be open as far as I know.
>>>>>>
>>>>>> Probably, $HOSTNAME is substituted for something not resolvable on
>>>>>> TMs?
>>>>>>
>>>>>> Please also make sure that the following gets executed before
>>>>>> mesos-appmaster.sh:
>>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>>> (as per the documentation you linked)
>>>>>>
>>>>>> Regards,
>>>>>> Roman
>>>>>>
>>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > I am trying to start Flink 1.13.2 on Mesos following the
>>>>>> instrucions in
>>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>>> binaries.
>>>>>> >
>>>>>> > My entrypoint for the Docker image is:
>>>>>> >
>>>>>> >
>>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>>> >
>>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>>> >
>>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>>> >
>>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>>> >
>>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>>> >
>>>>>> >
>>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>>> >
>>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>>>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>>> >
>>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>>>>> executor on 10.0.20.177
>>>>>> >
>>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>>> >
>>>>>> > WARNING: Your kernel does not support swap limit capabilities or
>>>>>> the cgroup is not mounted. Memory limited without swap.
>>>>>> >
>>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>>> >
>>>>>> > WARNING: Illegal reflective access by
>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>>> sun.security.krb5.Config.getInstance()
>>>>>> >
>>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>>> >
>>>>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>>>>> illegal reflective access operations
>>>>>> >
>>>>>> > WARNING: All illegal access operations will be denied in a future
>>>>>> release
>>>>>> >
>>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>>> >
>>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>>>>>> master@10.0.18.246:5050
>>>>>> >
>>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>>>>>> Attempting to register without authentication
>>>>>> >
>>>>>> >
>>>>>> > where the "New master detected" line is promising.
>>>>>> >
>>>>>> > However, on the Flink UI I see only the jobmanager started, and
>>>>>> there are no task managers.  Getting into the Docker container, I see this
>>>>>> in the log:
>>>>>> >
>>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable
>>>>>> to connect to Mesos; still trying...
>>>>>> >
>>>>>> >
>>>>>> > I have verified that from the container I can access the Mesos
>>>>>> container 10.0.18.246:5050
>>>>>> >
>>>>>> >
>>>>>> > Does any other port besides the web UI port 5050 need to be open
>>>>>> for mesos-appmaster to connect with the Mesos master?
>>>>>> >
>>>>>> >
>>>>>> > In the appmaster log (attached) I see one exception that I don't
>>>>>> know if they are related to the Mesos connection problem, one is
>>>>>> >
>>>>>> >
>>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>>>>>> unset.
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>>> >
>>>>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>>> >
>>>>>> >         at
>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>> Method)
>>>>>> >
>>>>>> >         at
>>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>>> Source)
>>>>>> >
>>>>>> >         at
>>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>>> Source)
>>>>>> >
>>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>>> >
>>>>>> >         at
>>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > I am not trying (yet) to run in high availability mode, so I am not
>>>>>> sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>>>>> about HADOOP_HOME in the FLink docs.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so
>>>>>> Flink can connect to my Mesos master?
>>>>>> >
>>>>>> >
>>>>>> > Thanks,
>>>>>> >
>>>>>> >
>>>>>> > Javier Vegas
>>>>>> >
>>>>>> >
>>>>>
>>>>>
>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
properly to the host IP and port mapped to 8081

2021-09-29 07:58:05.452 [main] INFO
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
-Djobmanager.rpc.address=10.0.22.114:31894

which is very promising. But sadly a little bit later appmaster dies with
this errror:

2021-09-29 07:58:05.648 [main] INFO
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
cluster services.
2021-09-29 07:58:05.674 [main] INFO
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
MesosSessionClusterEntrypoint down with application status FAILED.
Diagnostics org.apache.flink.configurati
on.IllegalConfigurationException: The configured hostname is not valid
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
at
org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
Caused by: java.lang.IllegalArgumentException
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
... 17 more
.
2021-09-29 07:58:05.685 [main] ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Could not start
cluster entrypoint MesosSessionClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to
initialize the cluster entrypoint MesosSessionClusterEntrypoint.
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:212)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:600)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:114)
Caused by: org.apache.flink.configuration.IllegalConfigurationException:
The configured hostname is not valid
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
at
org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
at
org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRpcServiceUtils.java:92)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:294)
at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.initializeServices(MesosSessionClusterEntrypoint.java:61)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:239)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:189)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
... 2 common frames omitted
Caused by: java.lang.IllegalArgumentException: null
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
... 17 common frames omitted



On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl <ma...@ververica.com>
wrote:

> One thing that was puzzling me yesterday when reading your post: Have you
> tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
> played around with Mesos, I remember using HOST to resolve the host's IP
> address instead of the host's name. It could be that the hostname itself
> cannot be resolved to the right IP address. But I struggled to find proper
> documentation to back that up. Only in the recipes section of the Marathon
> docs [1], HOST was used as well.
>
> Matthias
>
> [1]
> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>
> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com> wrote:
>
>> Another update:  Looking more carefully in my appmaster log, I see the
>> following
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>> Registering as new framework.
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>> -----------------------------------------------------------------------------
>>
>> ---
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>> Info:
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
>> URL: 10.0.18.246:5050
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>> Info:
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
>> (none)
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
>> flink-test
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
>> Timeout (secs): 604800.0
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role:
>> *
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
>> (none)
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
>> (none)
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
>> 311dcf7fd77c
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
>> UI: http://311dcf7fd77c:8081
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>> -----------------------------------------------------------------------------
>>
>> ---
>>
>>
>> which is picking up the mesos.master and
>> mesos.resourcemanager.framework.name params I am passing to
>> mesos-appmaster.sh
>>
>>
>> In my Mesos dashboard I can see the framework has been created with the
>> right name, but has no associated agents/tasks to it. So at least Flink has
>> been able to connect to the Mesos master to create the framework
>>
>>
>> Later in the mesos-appmaster log is when I see the Mesos connection
>> errors:
>>
>>
>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
>> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting
>> the slot manager.
>>
>> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>> (StoppedState -> StoppedState) with data ()
>>
>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>> heartbeat request.
>>
>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>> org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State
>> change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>>
>> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>> heartbeat request.
>>
>> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
>> Mesos...
>>
>> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
>> (StoppedState -> ConnectingState) with data ()
>>
>> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
>> resource manager started.
>>
>> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
>> org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
>> (Suspended -> Suspended) with data GatherData(List(),List())
>>
>> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
>> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect
>> to Mesos; still trying...
>>
>> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>> heartbeat request.
>>
>> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
>> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
>> heartbeat request.
>>
>>
>>
>>
>> So why the appmaster was able to connect to Mesos master to create the
>> framework but failed to connect later to do whatever it does later?
>>
>>
>> One possible issue I see is that the framework is set with web UI in h
>> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos master. 311dcf7fd77c
>> is the result of doing hostname on the Docker container, and the Mesos
>> master can not resolve that name. I could try to replace the Docker
>> container hostname with the Docker host hostname, but the host port that
>> gets mapped to 8081 on the container is a random port that I can not know
>> beforehand. Does Mesos master try to reach Flink using that Web UI setting?
>> Could this be the issue causing my connection problem, or is this a red
>> herring and the problem is a different one?
>>
>>
>> Thanks,
>>
>>
>> Javier Vegas
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com> wrote:
>>
>>> Thanks, Matthias!
>>>
>>> There are lots of apps deployed to the Mesos cluster, the task manager
>>> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
>>> Job manager agent starting, but no error messages related to it. As you
>>> say, TaskManagers don't even have the chance to get confused about
>>> variables, since the Job Manager can not connect to the Mesos master to
>>> tell it to start the Task Managers.
>>>
>>> Thanks,
>>>
>>> Javier
>>>
>>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <ma...@ververica.com>
>>> wrote:
>>>
>>>> Hi Javier,
>>>> I don't see anything that's configured in the wrong way based on the
>>>> jobmanager logs you've provided. Have you been able to deploy other
>>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>>> anything? The variable resolution on the TaskManager side is a valid
>>>> concern shared by Roman since it's easy to run into such an issue. But the
>>>> JobManager logs indicate that the JobManager is not able to contact the
>>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>>> not coming up.
>>>>
>>>> Best,
>>>> Matthias
>>>>
>>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> No additional ports need to be open as far as I know.
>>>>>
>>>>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>>>>
>>>>> Please also make sure that the following gets executed before
>>>>> mesos-appmaster.sh:
>>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>>> (as per the documentation you linked)
>>>>>
>>>>> Regards,
>>>>> Roman
>>>>>
>>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com>
>>>>> wrote:
>>>>> >
>>>>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions
>>>>> in
>>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>>> binaries.
>>>>> >
>>>>> > My entrypoint for the Docker image is:
>>>>> >
>>>>> >
>>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>>> >
>>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>>> >
>>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>>> >
>>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>>> >
>>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>>> >
>>>>> >
>>>>> >
>>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>>> >
>>>>> >
>>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>>> >
>>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>>> >
>>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>>>> executor on 10.0.20.177
>>>>> >
>>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>>> >
>>>>> > WARNING: Your kernel does not support swap limit capabilities or the
>>>>> cgroup is not mounted. Memory limited without swap.
>>>>> >
>>>>> > WARNING: An illegal reflective access operation has occurred
>>>>> >
>>>>> > WARNING: Illegal reflective access by
>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>>> sun.security.krb5.Config.getInstance()
>>>>> >
>>>>> > WARNING: Please consider reporting this to the maintainers of
>>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>>> >
>>>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>>>> illegal reflective access operations
>>>>> >
>>>>> > WARNING: All illegal access operations will be denied in a future
>>>>> release
>>>>> >
>>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>>> >
>>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>>>>> master@10.0.18.246:5050
>>>>> >
>>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>>>>> Attempting to register without authentication
>>>>> >
>>>>> >
>>>>> > where the "New master detected" line is promising.
>>>>> >
>>>>> > However, on the Flink UI I see only the jobmanager started, and
>>>>> there are no task managers.  Getting into the Docker container, I see this
>>>>> in the log:
>>>>> >
>>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable
>>>>> to connect to Mesos; still trying...
>>>>> >
>>>>> >
>>>>> > I have verified that from the container I can access the Mesos
>>>>> container 10.0.18.246:5050
>>>>> >
>>>>> >
>>>>> > Does any other port besides the web UI port 5050 need to be open for
>>>>> mesos-appmaster to connect with the Mesos master?
>>>>> >
>>>>> >
>>>>> > In the appmaster log (attached) I see one exception that I don't
>>>>> know if they are related to the Mesos connection problem, one is
>>>>> >
>>>>> >
>>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>>>>> unset.
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>>> >
>>>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>>> >
>>>>> >         at
>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>>> >
>>>>> >         at
>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>> Method)
>>>>> >
>>>>> >         at
>>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>>> Source)
>>>>> >
>>>>> >         at
>>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>>> Source)
>>>>> >
>>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>>>>> >
>>>>> >         at
>>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>>> >
>>>>> >         at
>>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>>> >
>>>>> >         at
>>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > I am not trying (yet) to run in high availability mode, so I am not
>>>>> sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>>>> about HADOOP_HOME in the FLink docs.
>>>>> >
>>>>> >
>>>>> >
>>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so
>>>>> Flink can connect to my Mesos master?
>>>>> >
>>>>> >
>>>>> > Thanks,
>>>>> >
>>>>> >
>>>>> > Javier Vegas
>>>>> >
>>>>> >
>>>>
>>>>
>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Matthias Pohl <ma...@ververica.com>.
One thing that was puzzling me yesterday when reading your post: Have you
tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
played around with Mesos, I remember using HOST to resolve the host's IP
address instead of the host's name. It could be that the hostname itself
cannot be resolved to the right IP address. But I struggled to find proper
documentation to back that up. Only in the recipes section of the Marathon
docs [1], HOST was used as well.

Matthias

[1]
https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks

On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas <jv...@strava.com> wrote:

> Another update:  Looking more carefully in my appmaster log, I see the
> following
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
> Registering as new framework.
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
> -----------------------------------------------------------------------------
>
> ---
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
> Info:
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
> URL: 10.0.18.246:5050
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
> Info:
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
> (none)
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
> flink-test
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
> Timeout (secs): 604800.0
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role: *
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Capabilities:
> (none)
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
> (none)
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
> 311dcf7fd77c
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web
> UI: http://311dcf7fd77c:8081
>
> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
> -----------------------------------------------------------------------------
>
> ---
>
>
> which is picking up the mesos.master and
> mesos.resourcemanager.framework.name params I am passing to
> mesos-appmaster.sh
>
>
> In my Mesos dashboard I can see the framework has been created with the
> right name, but has no associated agents/tasks to it. So at least Flink has
> been able to connect to the Mesos master to create the framework
>
>
> Later in the mesos-appmaster log is when I see the Mesos connection errors:
>
>
> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting
> the slot manager.
>
> 2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
> (StoppedState -> StoppedState) with data ()
>
> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
> org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State
> change (Suspended -> Suspended) with data ReconciliationData(Map(),0)
>
> 2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
> 2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to
> Mesos...
>
> 2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
> (StoppedState -> ConnectingState) with data ()
>
> 2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
> resource manager started.
>
> 2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
> org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
> (Suspended -> Suspended) with data GatherData(List(),List())
>
> 2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
> org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect
> to Mesos; still trying...
>
> 2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
> 2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
> o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
> heartbeat request.
>
>
>
>
> So why the appmaster was able to connect to Mesos master to create the
> framework but failed to connect later to do whatever it does later?
>
>
> One possible issue I see is that the framework is set with web UI in h
> ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos master. 311dcf7fd77c
> is the result of doing hostname on the Docker container, and the Mesos
> master can not resolve that name. I could try to replace the Docker
> container hostname with the Docker host hostname, but the host port that
> gets mapped to 8081 on the container is a random port that I can not know
> beforehand. Does Mesos master try to reach Flink using that Web UI setting?
> Could this be the issue causing my connection problem, or is this a red
> herring and the problem is a different one?
>
>
> Thanks,
>
>
> Javier Vegas
>
>
>
>
>
>
>
>
> On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com> wrote:
>
>> Thanks, Matthias!
>>
>> There are lots of apps deployed to the Mesos cluster, the task manager
>> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
>> Job manager agent starting, but no error messages related to it. As you
>> say, TaskManagers don't even have the chance to get confused about
>> variables, since the Job Manager can not connect to the Mesos master to
>> tell it to start the Task Managers.
>>
>> Thanks,
>>
>> Javier
>>
>> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <ma...@ververica.com>
>> wrote:
>>
>>> Hi Javier,
>>> I don't see anything that's configured in the wrong way based on the
>>> jobmanager logs you've provided. Have you been able to deploy other
>>> applications to this Mesos cluster? Do the Mesos master logs reveal
>>> anything? The variable resolution on the TaskManager side is a valid
>>> concern shared by Roman since it's easy to run into such an issue. But the
>>> JobManager logs indicate that the JobManager is not able to contact the
>>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>>> not coming up.
>>>
>>> Best,
>>> Matthias
>>>
>>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> No additional ports need to be open as far as I know.
>>>>
>>>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>>>
>>>> Please also make sure that the following gets executed before
>>>> mesos-appmaster.sh:
>>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>>> (as per the documentation you linked)
>>>>
>>>> Regards,
>>>> Roman
>>>>
>>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com> wrote:
>>>> >
>>>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions
>>>> in
>>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>>> and using Marathon to deploy a Docker image with both the Flink and my
>>>> binaries.
>>>> >
>>>> > My entrypoint for the Docker image is:
>>>> >
>>>> >
>>>> > /opt/flink/bin/mesos-appmaster.sh \
>>>> >
>>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>>> >
>>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>>> >
>>>> >       -Dmesos.master=10.0.18.246:5050 \
>>>> >
>>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>>> >
>>>> >
>>>> >
>>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>>> >
>>>> >
>>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>>> >
>>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>>> >
>>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>>> executor on 10.0.20.177
>>>> >
>>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>>> >
>>>> > WARNING: Your kernel does not support swap limit capabilities or the
>>>> cgroup is not mounted. Memory limited without swap.
>>>> >
>>>> > WARNING: An illegal reflective access operation has occurred
>>>> >
>>>> > WARNING: Illegal reflective access by
>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>>> sun.security.krb5.Config.getInstance()
>>>> >
>>>> > WARNING: Please consider reporting this to the maintainers of
>>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>>> >
>>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>>> illegal reflective access operations
>>>> >
>>>> > WARNING: All illegal access operations will be denied in a future
>>>> release
>>>> >
>>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>>> >
>>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>>>> master@10.0.18.246:5050
>>>> >
>>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>>>> Attempting to register without authentication
>>>> >
>>>> >
>>>> > where the "New master detected" line is promising.
>>>> >
>>>> > However, on the Flink UI I see only the jobmanager started, and there
>>>> are no task managers.  Getting into the Docker container, I see this in the
>>>> log:
>>>> >
>>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>>> connect to Mesos; still trying...
>>>> >
>>>> >
>>>> > I have verified that from the container I can access the Mesos
>>>> container 10.0.18.246:5050
>>>> >
>>>> >
>>>> > Does any other port besides the web UI port 5050 need to be open for
>>>> mesos-appmaster to connect with the Mesos master?
>>>> >
>>>> >
>>>> > In the appmaster log (attached) I see one exception that I don't know
>>>> if they are related to the Mesos connection problem, one is
>>>> >
>>>> >
>>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>>>> unset.
>>>> >
>>>> >         at
>>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>>> >
>>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>>> >
>>>> >         at
>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>>> >
>>>> >         at
>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>> Method)
>>>> >
>>>> >         at
>>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>>> Source)
>>>> >
>>>> >         at
>>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>>> Source)
>>>> >
>>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>>>> >
>>>> >         at
>>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>>> >
>>>> >         at
>>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>>> >
>>>> >         at
>>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > I am not trying (yet) to run in high availability mode, so I am not
>>>> sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>>> about HADOOP_HOME in the FLink docs.
>>>> >
>>>> >
>>>> >
>>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so
>>>> Flink can connect to my Mesos master?
>>>> >
>>>> >
>>>> > Thanks,
>>>> >
>>>> >
>>>> > Javier Vegas
>>>> >
>>>> >
>>>
>>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
Another update:  Looking more carefully in my appmaster log, I see the
following

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Registering
as new framework.

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
-----------------------------------------------------------------------------

---

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos Info:

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Master
URL: 10.0.18.246:5050

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
Info:

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     ID:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Name:
flink-test

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Failover
Timeout (secs): 604800.0

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Role: *

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
Capabilities:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Principal:
(none)

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Host:
311dcf7fd77c

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -     Web UI:
http://311dcf7fd77c:8081

2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
-----------------------------------------------------------------------------

---


which is picking up the mesos.master and
mesos.resourcemanager.framework.name params I am passing to
mesos-appmaster.sh


In my Mesos dashboard I can see the framework has been created with the
right name, but has no associated agents/tasks to it. So at least Flink has
been able to connect to the Mesos master to create the framework


Later in the mesos-appmaster log is when I see the Mesos connection errors:


2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting the
slot manager.

2021-09-29 01:15:39.815 [flink-akka.actor.default-dispatcher-2] DEBUG
org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
(StoppedState -> StoppedState) with data ()

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
org.apache.flink.mesos.scheduler.ReconciliationCoordinator  - State change
(Suspended -> Suspended) with data ReconciliationData(Map(),0)

2021-09-29 01:15:39.823 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:39.824 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.mesos.scheduler.ConnectionMonitor  - Connecting to Mesos...

2021-09-29 01:15:39.825 [flink-akka.actor.default-dispatcher-3] DEBUG
org.apache.flink.mesos.scheduler.ConnectionMonitor  - State change
(StoppedState -> ConnectingState) with data ()

2021-09-29 01:15:39.826 [flink-akka.actor.default-dispatcher-3] INFO
o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Mesos
resource manager started.

2021-09-29 01:15:39.831 [flink-akka.actor.default-dispatcher-4] DEBUG
org.apache.flink.mesos.scheduler.LaunchCoordinator  - State change
(Suspended -> Suspended) with data GatherData(List(),List())

2021-09-29 01:15:44.843 [flink-akka.actor.default-dispatcher-4] WARN
org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect to
Mesos; still trying...

2021-09-29 01:15:49.843 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.

2021-09-29 01:15:49.844 [flink-akka.actor.default-dispatcher-3] DEBUG
o.a.f.runtime.resourcemanager.active.ActiveResourceManager  - Trigger
heartbeat request.




So why the appmaster was able to connect to Mesos master to create the
framework but failed to connect later to do whatever it does later?


One possible issue I see is that the framework is set with web UI in h
ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos
master. 311dcf7fd77c
is the result of doing hostname on the Docker container, and the Mesos
master can not resolve that name. I could try to replace the Docker
container hostname with the Docker host hostname, but the host port that
gets mapped to 8081 on the container is a random port that I can not know
beforehand. Does Mesos master try to reach Flink using that Web UI setting?
Could this be the issue causing my connection problem, or is this a red
herring and the problem is a different one?


Thanks,


Javier Vegas








On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas <jv...@strava.com> wrote:

> Thanks, Matthias!
>
> There are lots of apps deployed to the Mesos cluster, the task manager
> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
> Job manager agent starting, but no error messages related to it. As you
> say, TaskManagers don't even have the chance to get confused about
> variables, since the Job Manager can not connect to the Mesos master to
> tell it to start the Task Managers.
>
> Thanks,
>
> Javier
>
> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <ma...@ververica.com>
> wrote:
>
>> Hi Javier,
>> I don't see anything that's configured in the wrong way based on the
>> jobmanager logs you've provided. Have you been able to deploy other
>> applications to this Mesos cluster? Do the Mesos master logs reveal
>> anything? The variable resolution on the TaskManager side is a valid
>> concern shared by Roman since it's easy to run into such an issue. But the
>> JobManager logs indicate that the JobManager is not able to contact the
>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>> not coming up.
>>
>> Best,
>> Matthias
>>
>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> No additional ports need to be open as far as I know.
>>>
>>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>>
>>> Please also make sure that the following gets executed before
>>> mesos-appmaster.sh:
>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>> (as per the documentation you linked)
>>>
>>> Regards,
>>> Roman
>>>
>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com> wrote:
>>> >
>>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions
>>> in
>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>> and using Marathon to deploy a Docker image with both the Flink and my
>>> binaries.
>>> >
>>> > My entrypoint for the Docker image is:
>>> >
>>> >
>>> > /opt/flink/bin/mesos-appmaster.sh \
>>> >
>>> >       -Djobmanager.rpc.address=$HOSTNAME \
>>> >
>>> >       -Dmesos.resourcemanager.framework.user=flink \
>>> >
>>> >       -Dmesos.master=10.0.18.246:5050 \
>>> >
>>> >       -Dmesos.resourcemanager.tasks.cpus=6
>>> >
>>> >
>>> >
>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>> >
>>> >
>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>> >
>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>> >
>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>> executor on 10.0.20.177
>>> >
>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>> >
>>> > WARNING: Your kernel does not support swap limit capabilities or the
>>> cgroup is not mounted. Memory limited without swap.
>>> >
>>> > WARNING: An illegal reflective access operation has occurred
>>> >
>>> > WARNING: Illegal reflective access by
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>> sun.security.krb5.Config.getInstance()
>>> >
>>> > WARNING: Please consider reporting this to the maintainers of
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> >
>>> > WARNING: Use --illegal-access=warn to enable warnings of further
>>> illegal reflective access operations
>>> >
>>> > WARNING: All illegal access operations will be denied in a future
>>> release
>>> >
>>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>>> >
>>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>>> master@10.0.18.246:5050
>>> >
>>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>>> Attempting to register without authentication
>>> >
>>> >
>>> > where the "New master detected" line is promising.
>>> >
>>> > However, on the Flink UI I see only the jobmanager started, and there
>>> are no task managers.  Getting into the Docker container, I see this in the
>>> log:
>>> >
>>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>>> connect to Mesos; still trying...
>>> >
>>> >
>>> > I have verified that from the container I can access the Mesos
>>> container 10.0.18.246:5050
>>> >
>>> >
>>> > Does any other port besides the web UI port 5050 need to be open for
>>> mesos-appmaster to connect with the Mesos master?
>>> >
>>> >
>>> > In the appmaster log (attached) I see one exception that I don't know
>>> if they are related to the Mesos connection problem, one is
>>> >
>>> >
>>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>>> unset.
>>> >
>>> >         at
>>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>>> >
>>> >         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>>> >
>>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>>> >
>>> >         at
>>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>>> >
>>> >         at
>>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>>> >
>>> >         at
>>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>>> >
>>> >         at
>>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>>> >
>>> >         at
>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>>> >
>>> >         at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>> >
>>> >         at
>>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>>> Source)
>>> >
>>> >         at
>>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>>> Source)
>>> >
>>> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>>> >
>>> >         at
>>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>>> >
>>> >         at
>>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>>> >
>>> >         at
>>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>>> >
>>> >
>>> >
>>> >
>>> > I am not trying (yet) to run in high availability mode, so I am not
>>> sure if I need to have HADOOP_HOME set or not, but I don't see anything
>>> about HADOOP_HOME in the FLink docs.
>>> >
>>> >
>>> >
>>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so
>>> Flink can connect to my Mesos master?
>>> >
>>> >
>>> > Thanks,
>>> >
>>> >
>>> > Javier Vegas
>>> >
>>> >
>>
>>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
Thanks, Matthias!

There are lots of apps deployed to the Mesos cluster, the task manager
itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
Job manager agent starting, but no error messages related to it. As you
say, TaskManagers don't even have the chance to get confused about
variables, since the Job Manager can not connect to the Mesos master to
tell it to start the Task Managers.

Thanks,

Javier

On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl <ma...@ververica.com>
wrote:

> Hi Javier,
> I don't see anything that's configured in the wrong way based on the
> jobmanager logs you've provided. Have you been able to deploy other
> applications to this Mesos cluster? Do the Mesos master logs reveal
> anything? The variable resolution on the TaskManager side is a valid
> concern shared by Roman since it's easy to run into such an issue. But the
> JobManager logs indicate that the JobManager is not able to contact the
> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
> not coming up.
>
> Best,
> Matthias
>
> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org>
> wrote:
>
>> Hi,
>>
>> No additional ports need to be open as far as I know.
>>
>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>
>> Please also make sure that the following gets executed before
>> mesos-appmaster.sh:
>> export HADOOP_CLASSPATH=$(hadoop classpath)
>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>> (as per the documentation you linked)
>>
>> Regards,
>> Roman
>>
>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com> wrote:
>> >
>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>> and using Marathon to deploy a Docker image with both the Flink and my
>> binaries.
>> >
>> > My entrypoint for the Docker image is:
>> >
>> >
>> > /opt/flink/bin/mesos-appmaster.sh \
>> >
>> >       -Djobmanager.rpc.address=$HOSTNAME \
>> >
>> >       -Dmesos.resourcemanager.framework.user=flink \
>> >
>> >       -Dmesos.master=10.0.18.246:5050 \
>> >
>> >       -Dmesos.resourcemanager.tasks.cpus=6
>> >
>> >
>> >
>> > When mesos-appmaster.sh starts, in the stderr I see this:
>> >
>> >
>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>> >
>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
>> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>> >
>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>> executor on 10.0.20.177
>> >
>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>> >
>> > WARNING: Your kernel does not support swap limit capabilities or the
>> cgroup is not mounted. Memory limited without swap.
>> >
>> > WARNING: An illegal reflective access operation has occurred
>> >
>> > WARNING: Illegal reflective access by
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>> sun.security.krb5.Config.getInstance()
>> >
>> > WARNING: Please consider reporting this to the maintainers of
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> >
>> > WARNING: Use --illegal-access=warn to enable warnings of further
>> illegal reflective access operations
>> >
>> > WARNING: All illegal access operations will be denied in a future
>> release
>> >
>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>> >
>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>> master@10.0.18.246:5050
>> >
>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>> Attempting to register without authentication
>> >
>> >
>> > where the "New master detected" line is promising.
>> >
>> > However, on the Flink UI I see only the jobmanager started, and there
>> are no task managers.  Getting into the Docker container, I see this in the
>> log:
>> >
>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>> connect to Mesos; still trying...
>> >
>> >
>> > I have verified that from the container I can access the Mesos
>> container 10.0.18.246:5050
>> >
>> >
>> > Does any other port besides the web UI port 5050 need to be open for
>> mesos-appmaster to connect with the Mesos master?
>> >
>> >
>> > In the appmaster log (attached) I see one exception that I don't know
>> if they are related to the Mesos connection problem, one is
>> >
>> >
>> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are
>> unset.
>> >
>> >         at
>> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>> >
>> >         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>> >
>> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>> >
>> >         at
>> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>> >
>> >         at
>> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>> >
>> >         at
>> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>> >
>> >         at
>> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>> >
>> >         at
>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>> >
>> >         at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)
>> >
>> >         at
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
>> Source)
>> >
>> >         at
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
>> Source)
>> >
>> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>> >
>> >         at
>> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>> >
>> >         at
>> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>> >
>> >         at
>> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>> >
>> >
>> >
>> >
>> > I am not trying (yet) to run in high availability mode, so I am not
>> sure if I need to have HADOOP_HOME set or not, but I don't see anything
>> about HADOOP_HOME in the FLink docs.
>> >
>> >
>> >
>> > Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink
>> can connect to my Mesos master?
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Javier Vegas
>> >
>> >
>
>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Matthias Pohl <ma...@ververica.com>.
Hi Javier,
I don't see anything that's configured in the wrong way based on the
jobmanager logs you've provided. Have you been able to deploy other
applications to this Mesos cluster? Do the Mesos master logs reveal
anything? The variable resolution on the TaskManager side is a valid
concern shared by Roman since it's easy to run into such an issue. But the
JobManager logs indicate that the JobManager is not able to contact the
Mesos master. Hence, I'd assume that it's not related to the TaskManagers
not coming up.

Best,
Matthias

On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan <ro...@apache.org> wrote:

> Hi,
>
> No additional ports need to be open as far as I know.
>
> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>
> Please also make sure that the following gets executed before
> mesos-appmaster.sh:
> export HADOOP_CLASSPATH=$(hadoop classpath)
> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
> (as per the documentation you linked)
>
> Regards,
> Roman
>
> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com> wrote:
> >
> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
> and using Marathon to deploy a Docker image with both the Flink and my
> binaries.
> >
> > My entrypoint for the Docker image is:
> >
> >
> > /opt/flink/bin/mesos-appmaster.sh \
> >
> >       -Djobmanager.rpc.address=$HOSTNAME \
> >
> >       -Dmesos.resourcemanager.framework.user=flink \
> >
> >       -Dmesos.master=10.0.18.246:5050 \
> >
> >       -Dmesos.resourcemanager.tasks.cpus=6
> >
> >
> >
> > When mesos-appmaster.sh starts, in the stderr I see this:
> >
> >
> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
> >
> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
> >
> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
> executor on 10.0.20.177
> >
> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
> >
> > WARNING: Your kernel does not support swap limit capabilities or the
> cgroup is not mounted. Memory limited without swap.
> >
> > WARNING: An illegal reflective access operation has occurred
> >
> > WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()
> >
> > WARNING: Please consider reporting this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtil
> >
> > WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> >
> > WARNING: All illegal access operations will be denied in a future release
> >
> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
> >
> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
> master@10.0.18.246:5050
> >
> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
> Attempting to register without authentication
> >
> >
> > where the "New master detected" line is promising.
> >
> > However, on the Flink UI I see only the jobmanager started, and there
> are no task managers.  Getting into the Docker container, I see this in the
> log:
> >
> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
> connect to Mesos; still trying...
> >
> >
> > I have verified that from the container I can access the Mesos container
> 10.0.18.246:5050
> >
> >
> > Does any other port besides the web UI port 5050 need to be open for
> mesos-appmaster to connect with the Mesos master?
> >
> >
> > In the appmaster log (attached) I see one exception that I don't know if
> they are related to the Mesos connection problem, one is
> >
> >
> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
> >
> >         at
> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
> >
> >         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
> >
> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
> >
> >         at
> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
> >
> >         at
> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
> >
> >         at
> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
> >
> >         at
> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
> >
> >         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> >
> >         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
> Source)
> >
> >         at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
> Source)
> >
> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
> >
> >         at
> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
> >
> >         at
> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
> >
> >         at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
> >
> >
> >
> >
> > I am not trying (yet) to run in high availability mode, so I am not sure
> if I need to have HADOOP_HOME set or not, but I don't see anything about
> HADOOP_HOME in the FLink docs.
> >
> >
> >
> > Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink
> can connect to my Mesos master?
> >
> >
> > Thanks,
> >
> >
> > Javier Vegas
> >
> >

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Javier Vegas <jv...@strava.com>.
Thanks, Roman!

Looking at the log, seems that the TaskManager can resolve $HOSTNAME to its
own hostname (07a6b681ee0f), as seen in these lines:

2021-09-27 22:02:41.067 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
-Djobmanager.rpc.address=*07a6b681ee0f*

2021-09-27 22:02:43.025 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - Rest endpoint
listening at *07a6b681ee0f*:8081

2021-09-27 22:02:43.025 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - http://
*07a6b681ee0f*:8081 was granted leadership with
leaderSessionID=00000000-0000-0000-0000-000000000000

2021-09-27 22:02:43.026 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - Web frontend
listening at http://*07a6b681ee0f*:8081.


I am deploying to Mesos with Marathon, so I have no way other than
$HOSTNAME to indicate the host that will execute mesos-appmaster.sh

The environment variables are set, this is what I can see if I hop into the
Docker container:

root@07a6b681ee0f:/opt/flink# echo $HADOOP_CLASSPATH

/opt/flink/hadoop-3.2.2/etc/hadoop:/opt/flink/hadoop-3.2.2/share/hadoop/common/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/common/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/*:/opt/flink/lib


root@07a6b681ee0f:/opt/flink# echo $MESOS_NATIVE_JAVA_LIBRARY

/usr/lib/libmesos.so




On Tue, Sep 28, 2021 at 5:45 AM Roman Khachatryan <ro...@apache.org> wrote:

> Hi,
>
> No additional ports need to be open as far as I know.
>
> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>
> Please also make sure that the following gets executed before
> mesos-appmaster.sh:
> export HADOOP_CLASSPATH=$(hadoop classpath)
> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
> (as per the documentation you linked)
>
> Regards,
> Roman
>
> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com> wrote:
> >
> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
> and using Marathon to deploy a Docker image with both the Flink and my
> binaries.
> >
> > My entrypoint for the Docker image is:
> >
> >
> > /opt/flink/bin/mesos-appmaster.sh \
> >
> >       -Djobmanager.rpc.address=$HOSTNAME \
> >
> >       -Dmesos.resourcemanager.framework.user=flink \
> >
> >       -Dmesos.master=10.0.18.246:5050 \
> >
> >       -Dmesos.resourcemanager.tasks.cpus=6
> >
> >
> >
> > When mesos-appmaster.sh starts, in the stderr I see this:
> >
> >
> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
> >
> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
> >
> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
> executor on 10.0.20.177
> >
> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
> >
> > WARNING: Your kernel does not support swap limit capabilities or the
> cgroup is not mounted. Memory limited without swap.
> >
> > WARNING: An illegal reflective access operation has occurred
> >
> > WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()
> >
> > WARNING: Please consider reporting this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtil
> >
> > WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> >
> > WARNING: All illegal access operations will be denied in a future release
> >
> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
> >
> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
> master@10.0.18.246:5050
> >
> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
> Attempting to register without authentication
> >
> >
> > where the "New master detected" line is promising.
> >
> > However, on the Flink UI I see only the jobmanager started, and there
> are no task managers.  Getting into the Docker container, I see this in the
> log:
> >
> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
> connect to Mesos; still trying...
> >
> >
> > I have verified that from the container I can access the Mesos container
> 10.0.18.246:5050
> >
> >
> > Does any other port besides the web UI port 5050 need to be open for
> mesos-appmaster to connect with the Mesos master?
> >
> >
> > In the appmaster log (attached) I see one exception that I don't know if
> they are related to the Mesos connection problem, one is
> >
> >
> > java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
> >
> >         at
> org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
> >
> >         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
> >
> >         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
> >
> >         at
> org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
> >
> >         at
> org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
> >
> >         at
> org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
> >
> >         at
> org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
> >
> >         at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
> >
> >         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> >
> >         at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
> Source)
> >
> >         at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
> Source)
> >
> >         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
> >
> >         at
> org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
> >
> >         at
> org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
> >
> >         at
> org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
> >
> >
> >
> >
> > I am not trying (yet) to run in high availability mode, so I am not sure
> if I need to have HADOOP_HOME set or not, but I don't see anything about
> HADOOP_HOME in the FLink docs.
> >
> >
> >
> > Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink
> can connect to my Mesos master?
> >
> >
> > Thanks,
> >
> >
> > Javier Vegas
> >
> >
>

Re: Unable to connect to Mesos on mesos-appmaster.sh start

Posted by Roman Khachatryan <ro...@apache.org>.
Hi,

No additional ports need to be open as far as I know.

Probably, $HOSTNAME is substituted for something not resolvable on TMs?

Please also make sure that the following gets executed before
mesos-appmaster.sh:
export HADOOP_CLASSPATH=$(hadoop classpath)
export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
(as per the documentation you linked)

Regards,
Roman

On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas <jv...@strava.com> wrote:
>
> I am trying to start Flink 1.13.2 on Mesos following the instrucions in https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ and using Marathon to deploy a Docker image with both the Flink and my binaries.
>
> My entrypoint for the Docker image is:
>
>
> /opt/flink/bin/mesos-appmaster.sh \
>
>       -Djobmanager.rpc.address=$HOSTNAME \
>
>       -Dmesos.resourcemanager.framework.user=flink \
>
>       -Dmesos.master=10.0.18.246:5050 \
>
>       -Dmesos.resourcemanager.tasks.cpus=6
>
>
>
> When mesos-appmaster.sh starts, in the stderr I see this:
>
>
> I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>
> I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>
> I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor on 10.0.20.177
>
> I0927 16:50:32.311394 801345 executor.cpp:186] Starting task tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>
> WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method sun.security.krb5.Config.getInstance()
>
> WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
>
> WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
>
> WARNING: All illegal access operations will be denied in a future release
>
> I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>
> I0927 16:50:43.624439   328 sched.cpp:336] New master detected at master@10.0.18.246:5050
>
> I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided. Attempting to register without authentication
>
>
> where the "New master detected" line is promising.
>
> However, on the Flink UI I see only the jobmanager started, and there are no task managers.  Getting into the Docker container, I see this in the log:
>
> WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to connect to Mesos; still trying...
>
>
> I have verified that from the container I can access the Mesos container 10.0.18.246:5050
>
>
> Does any other port besides the web UI port 5050 need to be open for mesos-appmaster to connect with the Mesos master?
>
>
> In the appmaster log (attached) I see one exception that I don't know if they are related to the Mesos connection problem, one is
>
>
> java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
>
>         at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)
>
>         at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)
>
>         at org.apache.hadoop.util.Shell.<clinit>(Shell.java:496)
>
>         at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
>
>         at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)
>
>         at org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)
>
>         at org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:90)
>
>         at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)
>
>         at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)
>
>         at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)
>
>         at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)
>
>         at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)
>
>         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
>         at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>
>         at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>
>         at java.base/java.lang.reflect.Method.invoke(Unknown Source)
>
>         at org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)
>
>         at org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)
>
>         at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)
>
>
>
>
> I am not trying (yet) to run in high availability mode, so I am not sure if I need to have HADOOP_HOME set or not, but I don't see anything about HADOOP_HOME in the FLink docs.
>
>
>
> Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink can connect to my Mesos master?
>
>
> Thanks,
>
>
> Javier Vegas
>
>