You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Manuel Montesino <ma...@piksel.com> on 2017/10/18 13:52:00 UTC

Problems with taskmanagers in Mesos Cluster

Hi,

We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.

There is a way that kill or stop the taskmanagers when the session is redeployed?

Some environment configuration from marathon json file related to taskmanagers:

```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```

Thanks in advance and kind regards,

Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com

[Piksel_Email.png]

This message is private and confidential. If you have received this message in error, please notify the sender or servicedesk@piksel.com and remove it from your system.

Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339

Re: Problems with taskmanagers in Mesos Cluster

Posted by Manuel Montesino <ma...@piksel.com>.

Sorry, forget about the api methods comment, that is for  flink jobs.


For flink session, we do a deploy directly to marathon and is marathon that manage the job... that's the reason that restart the jobmanager and not the taskmanagers, because the taskmanagers are created by flink connecting to mesos directly and marathon don't know any relation between the marathon job and the mesos tasks of flink taskmanagers.


Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com

[Piksel_Email.png]
________________________________
De: Manuel Montesino
Enviado: miércoles, 25 de octubre de 2017 11:27:22
Para: Eron Wright
Cc: user@flink.apache.org; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster


Hi Eron,


Thanks for your response.


Maybe I'm not explaining well. The thing is that when we redepoy a flink session, not kill or stop the active taskmanagers and create/start new ones (those with new configuration), that's what we want (a full redeploy) so there are not recovered TM, still the sames with same configuration.


If we change the zk high availability name, the TK will be orphans in Mesos, creating a new ones and we don't want that.


Another thing is the way we are re-deploying. We have developed an script to deploy flink jobs from flink's api (we have a pipeline to do all this operations), in this script we stop/kill the session with /cancel or /cancel-with-savepoint api methods.


Maybe is clear now?.


Thanks in advance.


Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com

[Piksel_Email.png]
________________________________
De: Eron Wright <er...@gmail.com>
Enviado: lunes, 23 de octubre de 2017 19:03:50
Para: Manuel Montesino
Cc: user@flink.apache.org; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster

If I understand you correctly, the high-availability path isn't being changed but other TM-related settings are, and the recovered TMs aren't picking up the new configuration.   I don't think that Flink supports on-the-fly reconfiguration of a Task Manager at this time.

As a workaround, to achieve a clean new session when you reconfigure Flink via Marathon, update the HA path accordingly.

Would that work for you?



On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <ma...@piksel.com>> wrote:
Hi,

We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.

There is a way that kill or stop the taskmanagers when the session is redeployed?

Some environment configuration from marathon json file related to taskmanagers:

```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```

Thanks in advance and kind regards,

Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com<http://piksel.com>

[Piksel_Email.png]

This message is private and confidential. If you have received this message in error, please notify the sender or servicedesk@piksel.com<ma...@piksel.com> and remove it from your system.

Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339<https://maps.google.com/?q=2100+Powers+Ferry+Road+SE,+Suite+400,+Atlanta,+GA+30339&entry=gmail&source=g>

Re: Problems with taskmanagers in Mesos Cluster

Posted by Manuel Montesino <ma...@piksel.com>.

Hi Eron,


Thanks for your response.


Maybe I'm not explaining well. The thing is that when we redepoy a flink session, not kill or stop the active taskmanagers and create/start new ones (those with new configuration), that's what we want (a full redeploy) so there are not recovered TM, still the sames with same configuration.


If we change the zk high availability name, the TK will be orphans in Mesos, creating a new ones and we don't want that.


Another thing is the way we are re-deploying. We have developed an script to deploy flink jobs from flink's api (we have a pipeline to do all this operations), in this script we stop/kill the session with /cancel or /cancel-with-savepoint api methods.


Maybe is clear now?.


Thanks in advance.


Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com

[Piksel_Email.png]
________________________________
De: Eron Wright <er...@gmail.com>
Enviado: lunes, 23 de octubre de 2017 19:03:50
Para: Manuel Montesino
Cc: user@flink.apache.org; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster

If I understand you correctly, the high-availability path isn't being changed but other TM-related settings are, and the recovered TMs aren't picking up the new configuration.   I don't think that Flink supports on-the-fly reconfiguration of a Task Manager at this time.

As a workaround, to achieve a clean new session when you reconfigure Flink via Marathon, update the HA path accordingly.

Would that work for you?



On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <ma...@piksel.com>> wrote:
Hi,

We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.

There is a way that kill or stop the taskmanagers when the session is redeployed?

Some environment configuration from marathon json file related to taskmanagers:

```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```

Thanks in advance and kind regards,

Manuel Montesino
Devops Engineer

E manuel.montesino@piksel(dot)com

Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com<http://piksel.com>

[Piksel_Email.png]

This message is private and confidential. If you have received this message in error, please notify the sender or servicedesk@piksel.com<ma...@piksel.com> and remove it from your system.

Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339<https://maps.google.com/?q=2100+Powers+Ferry+Road+SE,+Suite+400,+Atlanta,+GA+30339&entry=gmail&source=g>

Re: Problems with taskmanagers in Mesos Cluster

Posted by Eron Wright <er...@gmail.com>.

If I understand you correctly, the high-availability path isn't being
changed but other TM-related settings are, and the recovered TMs aren't
picking up the new configuration.   I don't think that Flink supports
on-the-fly reconfiguration of a Task Manager at this time.

As a workaround, to achieve a clean new session when you reconfigure Flink
via Marathon, update the HA path accordingly.

Would that work for you?



On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <
manuel.montesino@piksel.com> wrote:

> Hi,
>
> We have deployed a Mesos cluster with Marathon, we deploy flink sessions
> through marathon with multiple taskmanagers configured. Some times in
> previous stages usually change configuration on marathon json about memory
> and other stuff, but when redeploy the flink session the jobmanagers stop
> and start with new configuration, but the taskmanagers not reuse the same
> was configured. So we have to kill/stop the dockers of each taskmanager
> task.
>
> There is a way that kill or stop the taskmanagers when the session is
> redeployed?
>
> Some environment configuration from marathon json file related to
> taskmanagers:
>
> ```
> "flink_akka.ask.timeout": "1min",
> "flink_akka.framesize": "102400k",
> "flink_high-availability": "zookeeper",
> "flink_high-availability.zookeeper.path.root": "/flink",
> "flink_jobmanager.web.history": "200",
> "flink_mesos.failover-timeout": "86400",
> "flink_mesos.initial-tasks": "16",
> "flink_mesos.maximum-failed-tasks": "-1",
> "flink_mesos.resourcemanager.tasks.container.type": "docker",
> "flink_mesos.resourcemanager.tasks.mem": "6144",
> "flink_metrics.reporters": "jmx",
> "flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.
> JMXReporter",
> "flink_state.backend": "org.apache.flink.contrib.streaming.state.
> RocksDBStateBackendFactory",
> "flink_taskmanager.maxRegistrationDuration": "10 min",
> "flink_taskmanager.network.numberOfBuffers": "8192",
> "flink_jobmanager.heap.mb": "768",
> "flink_taskmanager.debug.memory.startLogThread": "true",
> "flink_mesos.resourcemanager.tasks.cpus": "1.3",
> "flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200
> -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50
> -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC
> -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
> "flink_containerized.heap-cutoff-ratio": "0.67"
> ```
>
> Thanks in advance and kind regards,
>
> *Manuel Montesino*
> Devops Engineer
>
> *E* *manuel.montesino@piksel(dot)com*
>
> Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
> *liberating viewing* | *piksel.com <http://piksel.com>*
>
> [image: Piksel_Email.png]
>
> This message is private and confidential. If you have received this
> message in error, please notify the sender or servicedesk@piksel.com and
> remove it from your system.
>
> Piksel Inc is a company registered in the United States, 2100 Powers
> Ferry Road SE, Suite 400, Atlanta, GA 30339
> <https://maps.google.com/?q=2100+Powers+Ferry+Road+SE,+Suite+400,+Atlanta,+GA+30339&entry=gmail&source=g>
>