You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Manuel Montesino <ma...@piksel.com> on 2017/10/18 13:52:00 UTC
Problems with taskmanagers in Mesos Cluster
Hi,
We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.
There is a way that kill or stop the taskmanagers when the session is redeployed?
Some environment configuration from marathon json file related to taskmanagers:
```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```
Thanks in advance and kind regards,
Manuel Montesino
Devops Engineer
E manuel.montesino@piksel(dot)com
Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com
[Piksel_Email.png]
This message is private and confidential. If you have received this message in error, please notify the sender or servicedesk@piksel.com and remove it from your system.
Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339
Re: Problems with taskmanagers in Mesos Cluster
Posted by Manuel Montesino <ma...@piksel.com>.
Sorry, forget about the api methods comment, that is for flink jobs.
For flink session, we do a deploy directly to marathon and is marathon that manage the job... that's the reason that restart the jobmanager and not the taskmanagers, because the taskmanagers are created by flink connecting to mesos directly and marathon don't know any relation between the marathon job and the mesos tasks of flink taskmanagers.
Manuel Montesino
Devops Engineer
E manuel.montesino@piksel(dot)com
Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com
[Piksel_Email.png]
________________________________
De: Manuel Montesino
Enviado: miƩrcoles, 25 de octubre de 2017 11:27:22
Para: Eron Wright
Cc: user@flink.apache.org; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster
Hi Eron,
Thanks for your response.
Maybe I'm not explaining well. The thing is that when we redepoy a flink session, not kill or stop the active taskmanagers and create/start new ones (those with new configuration), that's what we want (a full redeploy) so there are not recovered TM, still the sames with same configuration.
If we change the zk high availability name, the TK will be orphans in Mesos, creating a new ones and we don't want that.
Another thing is the way we are re-deploying. We have developed an script to deploy flink jobs from flink's api (we have a pipeline to do all this operations), in this script we stop/kill the session with /cancel or /cancel-with-savepoint api methods.
Maybe is clear now?.
Thanks in advance.
Manuel Montesino
Devops Engineer
E manuel.montesino@piksel(dot)com
Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com
[Piksel_Email.png]
________________________________
De: Eron Wright <er...@gmail.com>
Enviado: lunes, 23 de octubre de 2017 19:03:50
Para: Manuel Montesino
Cc: user@flink.apache.org; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster
If I understand you correctly, the high-availability path isn't being changed but other TM-related settings are, and the recovered TMs aren't picking up the new configuration. I don't think that Flink supports on-the-fly reconfiguration of a Task Manager at this time.
As a workaround, to achieve a clean new session when you reconfigure Flink via Marathon, update the HA path accordingly.
Would that work for you?
On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <ma...@piksel.com>> wrote:
Hi,
We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.
There is a way that kill or stop the taskmanagers when the session is redeployed?
Some environment configuration from marathon json file related to taskmanagers:
```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```
Thanks in advance and kind regards,
Manuel Montesino
Devops Engineer
E manuel.montesino@piksel(dot)com
Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com<http://piksel.com>
[Piksel_Email.png]
This message is private and confidential. If you have received this message in error, please notify the sender or servicedesk@piksel.com<ma...@piksel.com> and remove it from your system.
Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339<https://maps.google.com/?q=2100+Powers+Ferry+Road+SE,+Suite+400,+Atlanta,+GA+30339&entry=gmail&source=g>
Re: Problems with taskmanagers in Mesos Cluster
Posted by Manuel Montesino <ma...@piksel.com>.
Hi Eron,
Thanks for your response.
Maybe I'm not explaining well. The thing is that when we redepoy a flink session, not kill or stop the active taskmanagers and create/start new ones (those with new configuration), that's what we want (a full redeploy) so there are not recovered TM, still the sames with same configuration.
If we change the zk high availability name, the TK will be orphans in Mesos, creating a new ones and we don't want that.
Another thing is the way we are re-deploying. We have developed an script to deploy flink jobs from flink's api (we have a pipeline to do all this operations), in this script we stop/kill the session with /cancel or /cancel-with-savepoint api methods.
Maybe is clear now?.
Thanks in advance.
Manuel Montesino
Devops Engineer
E manuel.montesino@piksel(dot)com
Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com
[Piksel_Email.png]
________________________________
De: Eron Wright <er...@gmail.com>
Enviado: lunes, 23 de octubre de 2017 19:03:50
Para: Manuel Montesino
Cc: user@flink.apache.org; Product-Flow
Asunto: Re: Problems with taskmanagers in Mesos Cluster
If I understand you correctly, the high-availability path isn't being changed but other TM-related settings are, and the recovered TMs aren't picking up the new configuration. I don't think that Flink supports on-the-fly reconfiguration of a Task Manager at this time.
As a workaround, to achieve a clean new session when you reconfigure Flink via Marathon, update the HA path accordingly.
Would that work for you?
On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <ma...@piksel.com>> wrote:
Hi,
We have deployed a Mesos cluster with Marathon, we deploy flink sessions through marathon with multiple taskmanagers configured. Some times in previous stages usually change configuration on marathon json about memory and other stuff, but when redeploy the flink session the jobmanagers stop and start with new configuration, but the taskmanagers not reuse the same was configured. So we have to kill/stop the dockers of each taskmanager task.
There is a way that kill or stop the taskmanagers when the session is redeployed?
Some environment configuration from marathon json file related to taskmanagers:
```
"flink_akka.ask.timeout": "1min",
"flink_akka.framesize": "102400k",
"flink_high-availability": "zookeeper",
"flink_high-availability.zookeeper.path.root": "/flink",
"flink_jobmanager.web.history": "200",
"flink_mesos.failover-timeout": "86400",
"flink_mesos.initial-tasks": "16",
"flink_mesos.maximum-failed-tasks": "-1",
"flink_mesos.resourcemanager.tasks.container.type": "docker",
"flink_mesos.resourcemanager.tasks.mem": "6144",
"flink_metrics.reporters": "jmx",
"flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.JMXReporter",
"flink_state.backend": "org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory",
"flink_taskmanager.maxRegistrationDuration": "10 min",
"flink_taskmanager.network.numberOfBuffers": "8192",
"flink_jobmanager.heap.mb": "768",
"flink_taskmanager.debug.memory.startLogThread": "true",
"flink_mesos.resourcemanager.tasks.cpus": "1.3",
"flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
"flink_containerized.heap-cutoff-ratio": "0.67"
```
Thanks in advance and kind regards,
Manuel Montesino
Devops Engineer
E manuel.montesino@piksel(dot)com
Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
liberating viewing | piksel.com<http://piksel.com>
[Piksel_Email.png]
This message is private and confidential. If you have received this message in error, please notify the sender or servicedesk@piksel.com<ma...@piksel.com> and remove it from your system.
Piksel Inc is a company registered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, GA 30339<https://maps.google.com/?q=2100+Powers+Ferry+Road+SE,+Suite+400,+Atlanta,+GA+30339&entry=gmail&source=g>
Re: Problems with taskmanagers in Mesos Cluster
Posted by Eron Wright <er...@gmail.com>.
If I understand you correctly, the high-availability path isn't being
changed but other TM-related settings are, and the recovered TMs aren't
picking up the new configuration. I don't think that Flink supports
on-the-fly reconfiguration of a Task Manager at this time.
As a workaround, to achieve a clean new session when you reconfigure Flink
via Marathon, update the HA path accordingly.
Would that work for you?
On Wed, Oct 18, 2017 at 6:52 AM, Manuel Montesino <
manuel.montesino@piksel.com> wrote:
> Hi,
>
> We have deployed a Mesos cluster with Marathon, we deploy flink sessions
> through marathon with multiple taskmanagers configured. Some times in
> previous stages usually change configuration on marathon json about memory
> and other stuff, but when redeploy the flink session the jobmanagers stop
> and start with new configuration, but the taskmanagers not reuse the same
> was configured. So we have to kill/stop the dockers of each taskmanager
> task.
>
> There is a way that kill or stop the taskmanagers when the session is
> redeployed?
>
> Some environment configuration from marathon json file related to
> taskmanagers:
>
> ```
> "flink_akka.ask.timeout": "1min",
> "flink_akka.framesize": "102400k",
> "flink_high-availability": "zookeeper",
> "flink_high-availability.zookeeper.path.root": "/flink",
> "flink_jobmanager.web.history": "200",
> "flink_mesos.failover-timeout": "86400",
> "flink_mesos.initial-tasks": "16",
> "flink_mesos.maximum-failed-tasks": "-1",
> "flink_mesos.resourcemanager.tasks.container.type": "docker",
> "flink_mesos.resourcemanager.tasks.mem": "6144",
> "flink_metrics.reporters": "jmx",
> "flink_metrics.reporter.jmx.class": "org.apache.flink.metrics.jmx.
> JMXReporter",
> "flink_state.backend": "org.apache.flink.contrib.streaming.state.
> RocksDBStateBackendFactory",
> "flink_taskmanager.maxRegistrationDuration": "10 min",
> "flink_taskmanager.network.numberOfBuffers": "8192",
> "flink_jobmanager.heap.mb": "768",
> "flink_taskmanager.debug.memory.startLogThread": "true",
> "flink_mesos.resourcemanager.tasks.cpus": "1.3",
> "flink_env.java.opts.taskmanager": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200
> -XX:ConcGCThreads=1 -XX:InitiatingHeapOccupancyPercent=35
> -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50
> -XX:MaxMetaspaceFreeRatio=80 -XX:+DisableExplicitGC
> -Djava.awt.headless=true -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M",
> "flink_containerized.heap-cutoff-ratio": "0.67"
> ```
>
> Thanks in advance and kind regards,
>
> *Manuel Montesino*
> Devops Engineer
>
> *E* *manuel.montesino@piksel(dot)com*
>
> Marie Curie,1. Ground Floor. Campanillas, Malaga 29590
> *liberating viewing* | *piksel.com <http://piksel.com>*
>
> [image: Piksel_Email.png]
>
> This message is private and confidential. If you have received this
> message in error, please notify the sender or servicedesk@piksel.com and
> remove it from your system.
>
> Piksel Inc is a company registered in the United States, 2100 Powers
> Ferry Road SE, Suite 400, Atlanta, GA 30339
> <https://maps.google.com/?q=2100+Powers+Ferry+Road+SE,+Suite+400,+Atlanta,+GA+30339&entry=gmail&source=g>
>