You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Chakravarthy varaga <ch...@gmail.com> on 2017/02/22 11:30:33 UTC

Deployment Architecture for Flink Applications

Hi Team,

    We are analysing different deployment options for managing Flink Jobs
on AWS EC2 instances.

     Basically, the options (Resource Manangers) in front of us are using:
     -> Standalone cluster
     -> On YARN
     -> Deploy using Mesos/Marthon
     -> Deploy using Kubernetes/Docker

     The Resource Managers options are a bit confusing as we are unable to
decide on which one to go with. What we are looking at as inputs to our
analysis is:
    ->  Dynamic Scaling of resources
    ->  Resource Allocation
    ->  Jobs Scheduling
    ->  No-Downtime upgrades
    ->  Monitoring & Metrics.

    Right now our plan is to do a paper based study evaluating these
options.

    I'm sure lot of you guys in production/support would have encountered
issues around these. Can someone point out to blogs/research
papers/material focussing on the approach taken and the considerations for
evaluation?

    Any help here is highly appreciated !

Best Regards
CVP

Re: Deployment Architecture for Flink Applications

Posted by Kostas Kloudas <k....@data-artisans.com>.

Hi CVP,

On how people use Flink, you can check this blogpost to see how Alibaba does it:
http://data-artisans.com/blink-flink-alibaba-search/ <http://data-artisans.com/blink-flink-alibaba-search/>

In addition, you can also find some more information on the matter on the talks from 
the last Flink Forwards conference: http://berlin.flink-forward.org/program/sessions/ <http://berlin.flink-forward.org/program/sessions/>

For example Netflix also shares some information here: 
http://berlin.flink-forward.org/kb_sessions/beaming-flink-to-the-cloud-netflix/ <http://berlin.flink-forward.org/kb_sessions/beaming-flink-to-the-cloud-netflix/>

Now for how things work under the hood, I will provide links to the Flink documentation. 
I hope that this will also help you figure out what fits your needs best:

For deployment and operations, the main resource is the Flink documentation, 
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html <https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html>

and for what is about to come on that front, you can check out the FLIP-6 page:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077>

To dynamically scale your Flink job you have to take a savepoint and restart your job with different parallelism.
You can find some details here https://www.slideshare.net/tillrohrmann/dynamic-scaling-how-apache-flink-adapts-to-changing-workloads <https://www.slideshare.net/tillrohrmann/dynamic-scaling-how-apache-flink-adapts-to-changing-workloads> , but unfortunately, this talk is a little bit outdated. We will update our documentation on dynamic scaling soon.

For the Resource allocation and Job Scheduling, you can check the links I included for deployment and operations,
and also this: https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html <https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html>

For metrics and monitoring you can check here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html <https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html>
and the related pages in the Debugging and monitoring section of the Flink documentation.

I hope this can help as a first step,
Kostas

> 
>     Right now our plan is to do a paper based study evaluating these options. 
>  
>     I'm sure lot of you guys in production/support would have encountered issues around these. Can someone point out to blogs/research papers/material focussing on the approach taken and the considerations for evaluation?
> 
>     Any help here is highly appreciated !
> 
> Best Regards
> CVP
>        

> On Feb 22, 2017, at 12:30 PM, Chakravarthy varaga <ch...@gmail.com> wrote:
> 
> Hi Team,
> 
>     We are analysing different deployment options for managing Flink Jobs on AWS EC2 instances.
> 
>      Basically, the options (Resource Manangers) in front of us are using:
>      -> Standalone cluster
>      -> On YARN
>      -> Deploy using Mesos/Marthon
>      -> Deploy using Kubernetes/Docker
>      
>      The Resource Managers options are a bit confusing as we are unable to decide on which one to go with. What we are looking at as inputs to our analysis is:
>     ->  Dynamic Scaling of resources
>     ->  Resource Allocation
>     ->  Jobs Scheduling 
>     ->  No-Downtime upgrades
>     ->  Monitoring & Metrics.
> 
>     Right now our plan is to do a paper based study evaluating these options. 
>  
>     I'm sure lot of you guys in production/support would have encountered issues around these. Can someone point out to blogs/research papers/material focussing on the approach taken and the considerations for evaluation?
> 
>     Any help here is highly appreciated !
> 
> Best Regards
> CVP
>