You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Biswajit Das <bi...@gmail.com> on 2017/07/30 19:42:43 UTC

Flink -mesos-app master hang

Hi All,
I'm trying to run a flink docker from the marathon with mesos app master; I
could see it goes on a continuous loop and failed to launch the task
manger. If I go to mesos master UI I could see job manager web UI with task
manager zero .

I have pretty much checked every possible log starting from Ubuntu machine
docker.log /mesos master/slave  pretty much no information other than just
failed task , I could see below log @ flink . However, I'm able to run same
docker image if I run jobamanger and taskmanager by itself in marathon and
let it connect via jobmanager RPC port .

for mesos config , I'm using below details from yml
mesos.master: ${MESOS_MASTER}
mesos.failover-timeout: 60
mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
mesos.resourcemanager.tasks.container.type: docker
mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME}

---------------------------
07-30 02:05:48,351 WARN
org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task
taskmanager-00002 failed unexpectedly.
2017-07-30 02:05:48,352 INFO
org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
- Mesos task taskmanager-00002 failed, with a TaskManager in launch or
registration. State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED
(Container exited with status 127)
-----------------------------------------------------

Please let me know if any one has any pointer to debug further ..


~ Biswajit

Re: Flink -mesos-app master hang

Posted by Biswajit Das <bi...@gmail.com>.

Hi Till ,

Thank you for the reply , I have posted some logs with initial email chain
. I think issue is more to do with docker private registry when there is
authorization involved . I can run docker running Job manager and task
manager as separate task for marathon and connect via RPC port . I was
trying to run via mesos app master so that job manager itself launch the
task manager part of framework .

Thank you again

~ Biswajit

On Fri, Aug 4, 2017 at 3:17 AM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Biswajit,
>
> are there any Mesos logs which might help us pinpointing the problem? I've
> actually never run Flink on Mesos with Docker images. But it could be that
> Flink does not set things properly up for running Docker images. I'll try
> to run Flink based on Docker images over the weekend in order to see
> whether I can reproduce the problem.
>
> Cheers,
> Till
>
> On Wed, Aug 2, 2017 at 8:48 PM, Biswajit Das <bi...@gmail.com>
> wrote:
>
>> Hi There,
>>
>> I have posted this here in the group a few days back and after that I
>> have been exchanging email with Eron, thanks to Eron for all the tips.
>> Now  I see this basic auth error, I'm little confused how come Job Manager
>> launched fine and task manager failing to auth.
>> Also, mesos doc says by default authenticate is false so it should not
>> have gone there,  do I have to disable somewhere inside flink ??? I don't
>> see any config or property in code.
>>
>> This is kind of blocker for me now for mesos deployment , really
>> appreciate for any inputs/suggestion
>>
>> ~ Biswajit
>>
>> ---------- Forwarded message ----------
>> From: Eron Wright <ew...@live.com>
>> Date: Wed, Aug 2, 2017 at 10:51 AM
>> ------------------------------
>> *From:* Biswajit Das <bi...@gmail.com>
>> *Sent:* Wednesday, August 2, 2017 10:19:45 AM
>> *To:* Eron Wright
>> *Subject:* Re: Flink -mesos-app master hang
>>
>> Hi Eron ,
>>
>> Good morning , I'm really sorry for flooding question . I'll post this
>> one to user group also .
>> I could narrow down the actual error thrown by mesos , seems like JM some
>> how not able to authenticate . I'm little confused if it is *docker
>> private registry tls error *or some thing else , I have started slave
>> even with --docker_config , previously mostly I was using  docker.tar.gz
>> with container for private repo authentication .
>>
>> 017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.schedul
>> er.TaskMonitor                  - Mesos task taskmanager-00003 failed
>> unexpectedly.
>> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager * - Mesos task
>> taskmanager-00003 failed, with a TaskManager in launch or registration.
>> State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch
>> container: Unexpected WWW-Authenticate header format: 'Basic
>> realm="Registry Realm"')*
>> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
>> taskmanager-00003 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
>> message=Failed to launch container: Unexpected WWW-Authenticate header
>> format: 'Basic realm="Registry Realm"'
>> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Total number of failed
>> tasks so far: 3
>> 2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
>> because the number of failed tasks (3) exceeded the maximum failed tasks
>> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
>> configuration setting. By default its the number of requested tasks.
>> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster
>> with status FAILED : Stopping Mesos session because the number of failed
>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled
>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its
>> the number of requested tasks.
>> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Shutting down and
>> unregistering as a Mesos framework.
>> 2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
>> master
>> root@ip-172-31-4-44:/etc/me
>>
>> On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <ew...@live.com> wrote:
>>
>>> I think you're on the right track, in trying to configure the docker
>>> image provider.  This is on Linux right, and you definitely restarted the
>>> agents?
>>>
>>>
>>> An important difference between the JM and the TM is that the JM is a
>>> task launched by the Marathon framework, whereas the TM is a task launched
>>> by the JM framework.  The respective configurations and behaviors are
>>> different.   For example, I see that Marathon is launching the JM with the
>>> Docker containerizer, whereas the JS is launching the TM with the Mesos
>>> containerizer (with Docker image provider support).     The Mesos
>>> containerizer is more modern and preferred, and I don't think Flink
>>> supports anything else.
>>>
>>>
>>> The doc I linked to shows how to launch a docker image-based container
>>> with mesos-execute.   Using mesos-execute to verify your cluster
>>> configuration is a good idea, to isolate any issue.  For example, see if
>>> you can launch a container using the Mesos containerizer and the Docker
>>> image provider, executing a simple command such as 'sleep'.
>>>
>>>
>>> Eron
>>> ------------------------------
>>> *From:* Biswajit Das <bi...@gmail.com>
>>> *Sent:* Tuesday, August 1, 2017 10:02:51 AM
>>> *To:* Eron Wright
>>>
>>> *Subject:* Re: Flink -mesos-app master hang
>>>
>>> Hi Eron ,
>>>
>>> Thank you for the email , I really appreciate your reply.
>>>
>>> That's what is confusing me. I have been running mesos with container
>>> both on staging and production for almost a year now with mostly
>>> spark/presto load everything containerize fairly big cluster. .. Here is
>>> one of my slave config . One interesting part here is ,  app master is
>>> launched and I can access job manager web UI from mesos frame work , I can
>>> also see it is registered itself as `flink` framework . The only thing I'm
>>> seeing task manager is showing `0` . I have asked to create 2 instance
>>>
>>>
>>> /usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos
>>> --attributes=environment:dev;agent_role:generic *--containerizers=docker,mesos
>>> * --executor_registration_timeout=10mins --hostname=XXX *--image_providers=appc,docker
>>> --ip=XXX --isolation=filesystem/linux,docker/runtime*
>>> --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos
>>>
>>>
>>> Previously I never had *--image_providers and --isolation* , after
>>> seeing this error I have added this two but not much help , I'm running on
>>> ubuntu /mesos 1.1.0 and submitting the job with marathon ..
>>>
>>>
>>> I have tried with toggling mesos debug log , not much info ...other hen
>>> git signal to kill the framework ..
>>>
>>> marathon json task
>>>
>>>> {
>>>>   "id": "/flink-app-master",
>>>>   "cmd": null,
>>>>   "cpus": 2,
>>>>   "mem": 4096,
>>>>   "disk": 10000,
>>>>   "instances": 1,
>>>>   "constraints": [
>>>>     [
>>>>       "hostname",
>>>>       "LIKE",
>>>>       "xxx" ->>> restricited to some host for debugging as I have
>>>> fairly big cluster
>>>>     ]
>>>>   ],
>>>>   "acceptedResourceRoles": [
>>>>     "*"
>>>>   ],
>>>>   "container": {
>>>>     "type": "DOCKER",
>>>>     "volumes": [],
>>>>     "docker": {
>>>>       "image": "docker.xx.xx/flink:1.8.0",
>>>>       "network": "HOST",
>>>>       "portMappings": [],
>>>>       "privileged": false,
>>>>       "parameters": [],
>>>>       "forcePullImage": false
>>>>     }
>>>>   },
>>>>   "env": {
>>>>     "MESOS_MASTER": "zk://XX/mesos"
>>>>   },
>>>>   "portDefinitions": [
>>>>     {
>>>>       "port": 9081,
>>>>       "protocol": "tcp",
>>>>       "name": "default",
>>>>       "labels": {}
>>>>     }
>>>>   ],
>>>>   "uris": [
>>>>     "file:///etc/docker.tar.gz"
>>>>   ],
>>>>   "fetch": [
>>>>     {
>>>>       "uri": "file:///etc/docker.tar.gz",
>>>>       "extract": true,
>>>>       "executable": false,
>>>>       "cache": false
>>>>     }
>>>>   ]
>>>> }
>>>>
>>>
>>> On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <ew...@live.com> wrote:
>>>
>>>> From the error message it seems that your Mesos cluster doesn't have
>>>> the docker image provisioner installed.   The message originates from Mesos
>>>> anyway so the problem lies there.   Note that docker image support is
>>>> provided in Linux only.  You can also use the Flink on Mesos support
>>>> without images, if you make sure that JAVA_HOME is set on all executors.
>>>>
>>>> Hope this helps!
>>>>
>>>> http://mesos.apache.org/documentation/latest/container-image/
>>>>
>>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>>
>>>>
>>>>
>>>> From: Biswajit Das
>>>> Sent: Tuesday, August 1, 1:24 AM
>>>> Subject: Re: Flink -mesos-app master hang
>>>> To: ewright@live.com
>>>>
>>>>
>>>> Hi Eron ,  I have came across some of your comment in JIRA and wanted
>>>> to clarify this ^^ . I'm kind of running little clueless ,  Any pointer for
>>>> me to look ..
>>>>
>>>>
>>>> -----------------------------------------------
>>>> 2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.schedul
>>>> er.LaunchCoordinator            - Waiting for more offers; 1 task(s)
>>>> are not yet launched.
>>>> 2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Launching Mesos task
>>>> taskmanager-00039 on host 172.31.5.212.
>>>> 2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.schedul
>>>> er.TaskMonitor                  - Mesos task taskmanager-00039 failed
>>>> unexpectedly.
>>>> *2017-08-01 07:26:34,733 INFO
>>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
>>>> - Mesos task taskmanager-00039 failed, with a TaskManager in launch or
>>>> registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED
>>>> (Failed to launch container: Unsupported container image type: DOCKER)*
>>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
>>>> taskmanager-00039 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
>>>> message=Failed to launch container: Unsupported container image type: DOCKER
>>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Total number of failed
>>>> tasks so far: 3
>>>> 2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
>>>> because the number of failed tasks (3) exceeded the maximum failed tasks
>>>> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
>>>> configuration setting. By default its the number of requested tasks.
>>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster
>>>> with status FAILED : Stopping Mesos session because the number of failed
>>>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled
>>>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its
>>>> the number of requested tasks.
>>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Shutting down and
>>>> unregistering as a Mesos framework.
>>>> 2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
>>>> master
>>>> 2017-08-01 07:26:34,745 INFO  org.apache.f
>>>> ---------------------------------------------------
>>>>
>>>> Thank you in advance .
>>>> ~Biswajit
>>>>
>>>> On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <bi...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi All,
>>>> I'm trying to run a flink docker from the marathon with mesos app
>>>> master; I could see it goes on a continuous loop and failed to launch the
>>>> task manger. If I go to mesos master UI I could see job manager web UI with
>>>> task manager zero .
>>>>
>>>> I have pretty much checked every possible log starting from Ubuntu
>>>> machine docker.log /mesos master/slave  pretty much no information other
>>>> than just failed task , I could see below log @ flink . However, I'm able
>>>> to run same docker image if I run jobamanger and taskmanager by itself in
>>>> marathon and let it connect via jobmanager RPC port .
>>>>
>>>> for mesos config , I'm using below details from yml
>>>> mesos.master: ${MESOS_MASTER}
>>>> mesos.failover-timeout: 60
>>>> mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
>>>> mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
>>>> mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
>>>> mesos.resourcemanager.tasks.container.type: docker
>>>> mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME}
>>>>
>>>> ---------------------------
>>>> 07-30 02:05:48,351 WARN  org.apache.flink.mesos.schedul
>>>> er.TaskMonitor                  - Mesos task taskmanager-00002 failed
>>>> unexpectedly.
>>>> 2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime
>>>> .clusterframework.MesosFlinkResourceManager  - Mesos task
>>>> taskmanager-00002 failed, with a TaskManager in launch or registration.
>>>> State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited
>>>> with status 127)
>>>> -----------------------------------------------------
>>>>
>>>> Please let me know if any one has any pointer to debug further ..
>>>>
>>>>
>>>> ~ Biswajit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Flink -mesos-app master hang

Posted by Till Rohrmann <tr...@apache.org>.

Hi Biswajit,

are there any Mesos logs which might help us pinpointing the problem? I've
actually never run Flink on Mesos with Docker images. But it could be that
Flink does not set things properly up for running Docker images. I'll try
to run Flink based on Docker images over the weekend in order to see
whether I can reproduce the problem.

Cheers,
Till

On Wed, Aug 2, 2017 at 8:48 PM, Biswajit Das <bi...@gmail.com> wrote:

> Hi There,
>
> I have posted this here in the group a few days back and after that I have
> been exchanging email with Eron, thanks to Eron for all the tips.  Now  I
> see this basic auth error, I'm little confused how come Job Manager
> launched fine and task manager failing to auth.
> Also, mesos doc says by default authenticate is false so it should not
> have gone there,  do I have to disable somewhere inside flink ??? I don't
> see any config or property in code.
>
> This is kind of blocker for me now for mesos deployment , really
> appreciate for any inputs/suggestion
>
> ~ Biswajit
>
> ---------- Forwarded message ----------
> From: Eron Wright <ew...@live.com>
> Date: Wed, Aug 2, 2017 at 10:51 AM
> ------------------------------
> *From:* Biswajit Das <bi...@gmail.com>
> *Sent:* Wednesday, August 2, 2017 10:19:45 AM
> *To:* Eron Wright
> *Subject:* Re: Flink -mesos-app master hang
>
> Hi Eron ,
>
> Good morning , I'm really sorry for flooding question . I'll post this one
> to user group also .
> I could narrow down the actual error thrown by mesos , seems like JM some
> how not able to authenticate . I'm little confused if it is *docker
> private registry tls error *or some thing else , I have started slave
> even with --docker_config , previously mostly I was using  docker.tar.gz
> with container for private repo authentication .
>
> 017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.schedul
> er.TaskMonitor                  - Mesos task taskmanager-00003 failed
> unexpectedly.
> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager * - Mesos task
> taskmanager-00003 failed, with a TaskManager in launch or registration.
> State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch
> container: Unexpected WWW-Authenticate header format: 'Basic
> realm="Registry Realm"')*
> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
> taskmanager-00003 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
> message=Failed to launch container: Unexpected WWW-Authenticate header
> format: 'Basic realm="Registry Realm"'
> 2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Total number of failed
> tasks so far: 3
> 2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
> because the number of failed tasks (3) exceeded the maximum failed tasks
> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
> configuration setting. By default its the number of requested tasks.
> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster with
> status FAILED : Stopping Mesos session because the number of failed tasks
> (3) exceeded the maximum failed tasks (2). This number is controlled by the
> 'mesos.maximum-failed-tasks' configuration setting. By default its the
> number of requested tasks.
> 2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Shutting down and
> unregistering as a Mesos framework.
> 2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.runtime
> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
> master
> root@ip-172-31-4-44:/etc/me
>
> On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <ew...@live.com> wrote:
>
>> I think you're on the right track, in trying to configure the docker
>> image provider.  This is on Linux right, and you definitely restarted the
>> agents?
>>
>>
>> An important difference between the JM and the TM is that the JM is a
>> task launched by the Marathon framework, whereas the TM is a task launched
>> by the JM framework.  The respective configurations and behaviors are
>> different.   For example, I see that Marathon is launching the JM with the
>> Docker containerizer, whereas the JS is launching the TM with the Mesos
>> containerizer (with Docker image provider support).     The Mesos
>> containerizer is more modern and preferred, and I don't think Flink
>> supports anything else.
>>
>>
>> The doc I linked to shows how to launch a docker image-based container
>> with mesos-execute.   Using mesos-execute to verify your cluster
>> configuration is a good idea, to isolate any issue.  For example, see if
>> you can launch a container using the Mesos containerizer and the Docker
>> image provider, executing a simple command such as 'sleep'.
>>
>>
>> Eron
>> ------------------------------
>> *From:* Biswajit Das <bi...@gmail.com>
>> *Sent:* Tuesday, August 1, 2017 10:02:51 AM
>> *To:* Eron Wright
>>
>> *Subject:* Re: Flink -mesos-app master hang
>>
>> Hi Eron ,
>>
>> Thank you for the email , I really appreciate your reply.
>>
>> That's what is confusing me. I have been running mesos with container
>> both on staging and production for almost a year now with mostly
>> spark/presto load everything containerize fairly big cluster. .. Here is
>> one of my slave config . One interesting part here is ,  app master is
>> launched and I can access job manager web UI from mesos frame work , I can
>> also see it is registered itself as `flink` framework . The only thing I'm
>> seeing task manager is showing `0` . I have asked to create 2 instance
>>
>>
>> /usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos
>> --attributes=environment:dev;agent_role:generic *--containerizers=docker,mesos
>> * --executor_registration_timeout=10mins --hostname=XXX *--image_providers=appc,docker
>> --ip=XXX --isolation=filesystem/linux,docker/runtime*
>> --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos
>>
>>
>> Previously I never had *--image_providers and --isolation* , after
>> seeing this error I have added this two but not much help , I'm running on
>> ubuntu /mesos 1.1.0 and submitting the job with marathon ..
>>
>>
>> I have tried with toggling mesos debug log , not much info ...other hen
>> git signal to kill the framework ..
>>
>> marathon json task
>>
>>> {
>>>   "id": "/flink-app-master",
>>>   "cmd": null,
>>>   "cpus": 2,
>>>   "mem": 4096,
>>>   "disk": 10000,
>>>   "instances": 1,
>>>   "constraints": [
>>>     [
>>>       "hostname",
>>>       "LIKE",
>>>       "xxx" ->>> restricited to some host for debugging as I have fairly
>>> big cluster
>>>     ]
>>>   ],
>>>   "acceptedResourceRoles": [
>>>     "*"
>>>   ],
>>>   "container": {
>>>     "type": "DOCKER",
>>>     "volumes": [],
>>>     "docker": {
>>>       "image": "docker.xx.xx/flink:1.8.0",
>>>       "network": "HOST",
>>>       "portMappings": [],
>>>       "privileged": false,
>>>       "parameters": [],
>>>       "forcePullImage": false
>>>     }
>>>   },
>>>   "env": {
>>>     "MESOS_MASTER": "zk://XX/mesos"
>>>   },
>>>   "portDefinitions": [
>>>     {
>>>       "port": 9081,
>>>       "protocol": "tcp",
>>>       "name": "default",
>>>       "labels": {}
>>>     }
>>>   ],
>>>   "uris": [
>>>     "file:///etc/docker.tar.gz"
>>>   ],
>>>   "fetch": [
>>>     {
>>>       "uri": "file:///etc/docker.tar.gz",
>>>       "extract": true,
>>>       "executable": false,
>>>       "cache": false
>>>     }
>>>   ]
>>> }
>>>
>>
>> On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <ew...@live.com> wrote:
>>
>>> From the error message it seems that your Mesos cluster doesn't have the
>>> docker image provisioner installed.   The message originates from Mesos
>>> anyway so the problem lies there.   Note that docker image support is
>>> provided in Linux only.  You can also use the Flink on Mesos support
>>> without images, if you make sure that JAVA_HOME is set on all executors.
>>>
>>> Hope this helps!
>>>
>>> http://mesos.apache.org/documentation/latest/container-image/
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>
>>>
>>>
>>> From: Biswajit Das
>>> Sent: Tuesday, August 1, 1:24 AM
>>> Subject: Re: Flink -mesos-app master hang
>>> To: ewright@live.com
>>>
>>>
>>> Hi Eron ,  I have came across some of your comment in JIRA and wanted to
>>> clarify this ^^ . I'm kind of running little clueless ,  Any pointer for me
>>> to look ..
>>>
>>>
>>> -----------------------------------------------
>>> 2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.schedul
>>> er.LaunchCoordinator            - Waiting for more offers; 1 task(s)
>>> are not yet launched.
>>> 2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Launching Mesos task
>>> taskmanager-00039 on host 172.31.5.212.
>>> 2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.schedul
>>> er.TaskMonitor                  - Mesos task taskmanager-00039 failed
>>> unexpectedly.
>>> *2017-08-01 07:26:34,733 INFO
>>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
>>> - Mesos task taskmanager-00039 failed, with a TaskManager in launch or
>>> registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED
>>> (Failed to launch container: Unsupported container image type: DOCKER)*
>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
>>> taskmanager-00039 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
>>> message=Failed to launch container: Unsupported container image type: DOCKER
>>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Total number of failed
>>> tasks so far: 3
>>> 2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
>>> because the number of failed tasks (3) exceeded the maximum failed tasks
>>> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
>>> configuration setting. By default its the number of requested tasks.
>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster
>>> with status FAILED : Stopping Mesos session because the number of failed
>>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled
>>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its
>>> the number of requested tasks.
>>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Shutting down and
>>> unregistering as a Mesos framework.
>>> 2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
>>> master
>>> 2017-08-01 07:26:34,745 INFO  org.apache.f
>>> ---------------------------------------------------
>>>
>>> Thank you in advance .
>>> ~Biswajit
>>>
>>> On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <bi...@gmail.com>
>>> wrote:
>>>
>>> Hi All,
>>> I'm trying to run a flink docker from the marathon with mesos app
>>> master; I could see it goes on a continuous loop and failed to launch the
>>> task manger. If I go to mesos master UI I could see job manager web UI with
>>> task manager zero .
>>>
>>> I have pretty much checked every possible log starting from Ubuntu
>>> machine docker.log /mesos master/slave  pretty much no information other
>>> than just failed task , I could see below log @ flink . However, I'm able
>>> to run same docker image if I run jobamanger and taskmanager by itself in
>>> marathon and let it connect via jobmanager RPC port .
>>>
>>> for mesos config , I'm using below details from yml
>>> mesos.master: ${MESOS_MASTER}
>>> mesos.failover-timeout: 60
>>> mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
>>> mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
>>> mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
>>> mesos.resourcemanager.tasks.container.type: docker
>>> mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME}
>>>
>>> ---------------------------
>>> 07-30 02:05:48,351 WARN  org.apache.flink.mesos.schedul
>>> er.TaskMonitor                  - Mesos task taskmanager-00002 failed
>>> unexpectedly.
>>> 2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime
>>> .clusterframework.MesosFlinkResourceManager  - Mesos task
>>> taskmanager-00002 failed, with a TaskManager in launch or registration.
>>> State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited
>>> with status 127)
>>> -----------------------------------------------------
>>>
>>> Please let me know if any one has any pointer to debug further ..
>>>
>>>
>>> ~ Biswajit
>>>
>>>
>>>
>>>
>>>
>>
>
>

Fwd: Flink -mesos-app master hang

Posted by Biswajit Das <bi...@gmail.com>.

Hi There,

I have posted this here in the group a few days back and after that I have
been exchanging email with Eron, thanks to Eron for all the tips.  Now  I
see this basic auth error, I'm little confused how come Job Manager
launched fine and task manager failing to auth.
Also, mesos doc says by default authenticate is false so it should not have
gone there,  do I have to disable somewhere inside flink ??? I don't see
any config or property in code.

This is kind of blocker for me now for mesos deployment , really appreciate
for any inputs/suggestion

~ Biswajit

---------- Forwarded message ----------
From: Eron Wright <ew...@live.com>
Date: Wed, Aug 2, 2017 at 10:51 AM
------------------------------
*From:* Biswajit Das <bi...@gmail.com>
*Sent:* Wednesday, August 2, 2017 10:19:45 AM
*To:* Eron Wright
*Subject:* Re: Flink -mesos-app master hang

Hi Eron ,

Good morning , I'm really sorry for flooding question . I'll post this one
to user group also .
I could narrow down the actual error thrown by mesos , seems like JM some
how not able to authenticate . I'm little confused if it is *docker private
registry tls error *or some thing else , I have started slave even with
--docker_config , previously mostly I was using  docker.tar.gz with
container for private repo authentication .

017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.
scheduler.TaskMonitor                  - Mesos task taskmanager-00003
failed unexpectedly.
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.
runtime.clusterframework.MesosFlinkResourceManager * - Mesos task
taskmanager-00003 failed, with a TaskManager in launch or registration.
State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch
container: Unexpected WWW-Authenticate header format: 'Basic
realm="Registry Realm"')*
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.
runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task
taskmanager-00003 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
message=Failed to launch container: Unexpected WWW-Authenticate header
format: 'Basic realm="Registry Realm"'
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.
runtime.clusterframework.MesosFlinkResourceManager  - Total number of
failed tasks so far: 3
2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.
runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos
session because the number of failed tasks (3) exceeded the maximum failed
tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks'
configuration setting. By default its the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.
runtime.clusterframework.MesosFlinkResourceManager  - Shutting down cluster
with status FAILED : Stopping Mesos session because the number of failed
tasks (3) exceeded the maximum failed tasks (2). This number is controlled
by the 'mesos.maximum-failed-tasks' configuration setting. By default its
the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.
runtime.clusterframework.MesosFlinkResourceManager  - Shutting down and
unregistering as a Mesos framework.
2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.
runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos
resource master
root@ip-172-31-4-44:/etc/me

On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <ew...@live.com> wrote:

> I think you're on the right track, in trying to configure the docker image
> provider.  This is on Linux right, and you definitely restarted the agents?
>
>
> An important difference between the JM and the TM is that the JM is a task
> launched by the Marathon framework, whereas the TM is a task launched by
> the JM framework.  The respective configurations and behaviors are
> different.   For example, I see that Marathon is launching the JM with the
> Docker containerizer, whereas the JS is launching the TM with the Mesos
> containerizer (with Docker image provider support).     The Mesos
> containerizer is more modern and preferred, and I don't think Flink
> supports anything else.
>
>
> The doc I linked to shows how to launch a docker image-based container
> with mesos-execute.   Using mesos-execute to verify your cluster
> configuration is a good idea, to isolate any issue.  For example, see if
> you can launch a container using the Mesos containerizer and the Docker
> image provider, executing a simple command such as 'sleep'.
>
>
> Eron
> ------------------------------
> *From:* Biswajit Das <bi...@gmail.com>
> *Sent:* Tuesday, August 1, 2017 10:02:51 AM
> *To:* Eron Wright
>
> *Subject:* Re: Flink -mesos-app master hang
>
> Hi Eron ,
>
> Thank you for the email , I really appreciate your reply.
>
> That's what is confusing me. I have been running mesos with container both
> on staging and production for almost a year now with mostly spark/presto
> load everything containerize fairly big cluster. .. Here is one of my slave
> config . One interesting part here is ,  app master is launched and I can
> access job manager web UI from mesos frame work , I can also see it is
> registered itself as `flink` framework . The only thing I'm seeing task
> manager is showing `0` . I have asked to create 2 instance
>
>
> /usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos
> --attributes=environment:dev;agent_role:generic *--containerizers=docker,mesos
> * --executor_registration_timeout=10mins --hostname=XXX *--image_providers=appc,docker
> --ip=XXX --isolation=filesystem/linux,docker/runtime*
> --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos
>
>
> Previously I never had *--image_providers and --isolation* , after seeing
> this error I have added this two but not much help , I'm running on ubuntu
> /mesos 1.1.0 and submitting the job with marathon ..
>
>
> I have tried with toggling mesos debug log , not much info ...other hen
> git signal to kill the framework ..
>
> marathon json task
>
>> {
>>   "id": "/flink-app-master",
>>   "cmd": null,
>>   "cpus": 2,
>>   "mem": 4096,
>>   "disk": 10000,
>>   "instances": 1,
>>   "constraints": [
>>     [
>>       "hostname",
>>       "LIKE",
>>       "xxx" ->>> restricited to some host for debugging as I have fairly
>> big cluster
>>     ]
>>   ],
>>   "acceptedResourceRoles": [
>>     "*"
>>   ],
>>   "container": {
>>     "type": "DOCKER",
>>     "volumes": [],
>>     "docker": {
>>       "image": "docker.xx.xx/flink:1.8.0",
>>       "network": "HOST",
>>       "portMappings": [],
>>       "privileged": false,
>>       "parameters": [],
>>       "forcePullImage": false
>>     }
>>   },
>>   "env": {
>>     "MESOS_MASTER": "zk://XX/mesos"
>>   },
>>   "portDefinitions": [
>>     {
>>       "port": 9081,
>>       "protocol": "tcp",
>>       "name": "default",
>>       "labels": {}
>>     }
>>   ],
>>   "uris": [
>>     "file:///etc/docker.tar.gz"
>>   ],
>>   "fetch": [
>>     {
>>       "uri": "file:///etc/docker.tar.gz",
>>       "extract": true,
>>       "executable": false,
>>       "cache": false
>>     }
>>   ]
>> }
>>
>
> On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <ew...@live.com> wrote:
>
>> From the error message it seems that your Mesos cluster doesn't have the
>> docker image provisioner installed.   The message originates from Mesos
>> anyway so the problem lies there.   Note that docker image support is
>> provided in Linux only.  You can also use the Flink on Mesos support
>> without images, if you make sure that JAVA_HOME is set on all executors.
>>
>> Hope this helps!
>>
>> http://mesos.apache.org/documentation/latest/container-image/
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>>
>>
>> From: Biswajit Das
>> Sent: Tuesday, August 1, 1:24 AM
>> Subject: Re: Flink -mesos-app master hang
>> To: ewright@live.com
>>
>>
>> Hi Eron ,  I have came across some of your comment in JIRA and wanted to
>> clarify this ^^ . I'm kind of running little clueless ,  Any pointer for me
>> to look ..
>>
>>
>> -----------------------------------------------
>> 2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.schedul
>> er.LaunchCoordinator            - Waiting for more offers; 1 task(s) are
>> not yet launched.
>> 2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Launching Mesos task
>> taskmanager-00039 on host 172.31.5.212.
>> 2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.schedul
>> er.TaskMonitor                  - Mesos task taskmanager-00039 failed
>> unexpectedly.
>> *2017-08-01 07:26:34,733 INFO
>> org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
>> - Mesos task taskmanager-00039 failed, with a TaskManager in launch or
>> registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED
>> (Failed to launch container: Unsupported container image type: DOCKER)*
>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Diagnostics for task
>> taskmanager-00039 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED
>> message=Failed to launch container: Unsupported container image type: DOCKER
>> 2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Total number of failed
>> tasks so far: 3
>> 2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos session
>> because the number of failed tasks (3) exceeded the maximum failed tasks
>> (2). This number is controlled by the 'mesos.maximum-failed-tasks'
>> configuration setting. By default its the number of requested tasks.
>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Shutting down cluster
>> with status FAILED : Stopping Mesos session because the number of failed
>> tasks (3) exceeded the maximum failed tasks (2). This number is controlled
>> by the 'mesos.maximum-failed-tasks' configuration setting. By default its
>> the number of requested tasks.
>> 2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Shutting down and
>> unregistering as a Mesos framework.
>> 2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource
>> master
>> 2017-08-01 07:26:34,745 INFO  org.apache.f
>> ---------------------------------------------------
>>
>> Thank you in advance .
>> ~Biswajit
>>
>> On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <bi...@gmail.com>
>> wrote:
>>
>> Hi All,
>> I'm trying to run a flink docker from the marathon with mesos app master;
>> I could see it goes on a continuous loop and failed to launch the task
>> manger. If I go to mesos master UI I could see job manager web UI with task
>> manager zero .
>>
>> I have pretty much checked every possible log starting from Ubuntu
>> machine docker.log /mesos master/slave  pretty much no information other
>> than just failed task , I could see below log @ flink . However, I'm able
>> to run same docker image if I run jobamanger and taskmanager by itself in
>> marathon and let it connect via jobmanager RPC port .
>>
>> for mesos config , I'm using below details from yml
>> mesos.master: ${MESOS_MASTER}
>> mesos.failover-timeout: 60
>> mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
>> mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
>> mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
>> mesos.resourcemanager.tasks.container.type: docker
>> mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME}
>>
>> ---------------------------
>> 07-30 02:05:48,351 WARN  org.apache.flink.mesos.schedul
>> er.TaskMonitor                  - Mesos task taskmanager-00002 failed
>> unexpectedly.
>> 2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime
>> .clusterframework.MesosFlinkResourceManager  - Mesos task
>> taskmanager-00002 failed, with a TaskManager in launch or registration.
>> State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited
>> with status 127)
>> -----------------------------------------------------
>>
>> Please let me know if any one has any pointer to debug further ..
>>
>>
>> ~ Biswajit
>>
>>
>>
>>
>>
>