You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/05/10 09:23:49 UTC

[GitHub] [flink] morsapaes opened a new pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

morsapaes opened a new pull request #12057:
URL: https://github.com/apache/flink/pull/12057


   ## What is the purpose of the change
   
   This PR extends the "Flink Architecture" section of the docs with more information about Flink Master components and application execution (incl. the new Application Cluster mode).
   
   ## Brief change log
   
     - Modified flink-architecture.md
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r424955104



##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. </div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  }}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.
+
+* **Resource Isolation**: in a Flink Application Cluster, the ResourceManager
+  and Dispatcher are scoped to a single Flink Application, which provides a
+  better separation of concerns than the Flink Session Cluster.
+
+<div class="alert alert-info"> <strong>Note:</strong> A Flink Job Cluster can be seen as a “run-on-client” alternative to Flink Application Clusters. </div>
+
+{% top %}
+
+## Self-contained Flink Applications
+
+When you want to do something like event-driven applications, it doesn’t make
+sense that you have to think about and manage clusters. So, there are efforts
+in the community towards enabling _Flink-as-a-Library_ in the future.
+
+The idea is that deploying a Flink Application becomes as easy as starting a
+process: Flink would be as any other library which you add to your application
+and does not affect how you deploy it. If you want to deploy such an
+application, it simply starts a set of processes which connect to each other,
+figure out their roles (e.g. JobManager, TaskManager) and execute the
+application in a distributed, parallel way. If the application cannot keep up
+with the workload, you simply start some new processes to rescale.

Review comment:
       From what I understood in conversation with Stephan and Aljoscha, library mode is still "not there" yet. The idea here would mainly be to clear the myth that you always need a cluster to run Flink, and then the progress of this effort can be discussed in the "Operations" docs. 
   
   I do like your suggestion to hint at auto-scaling rather than implying that the processes need to be scaled manually and will rephrase some sentences!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r424946648



##########
File path: docs/concepts/flink-architecture.md
##########
@@ -61,13 +59,53 @@ frameworks like [YARN]({{ site.baseurl }}{% link ops/deployment/yarn_setup.md
 TaskManagers connect to Flink Masters, announcing themselves as available, and
 are assigned work.
 
-The *client* is not part of the runtime and program execution, but is used to
-prepare and send a dataflow to the Flink Master.  After that, the client can
-disconnect, or stay connected to receive progress reports. The client runs
-either as part of the Java/Scala program that triggers the execution, or in the
-command line process `./bin/flink run ...`.
+### Flink Master
+
+The _Flink Master_ coordinates the distributed execution of Flink Applications:
+it decides when to schedule the next task (or set of tasks), reacts to finished
+tasks or execution failures, coordinates checkpoints, coordinates recovery on
+failures, among others. This process consists of three different components:
+
+  * **ResourceManager** 
 
-<img src="{{ site.baseurl }}/fig/processes.svg" alt="The processes involved in executing a Flink dataflow" class="offset" width="80%" />
+    The _ResourceManager_ is responsible for resource de-/allocation and
+    provisioning in a Flink cluster — it manages TaskManager slots, Flink’s
+    smallest resource processing unit (see [Flink Workers](#flink-workers)).
+    Flink implements multiple ResourceManagers for different environments and
+    resource providers such as YARN, Mesos, Kubernetes and standalone
+    deployments. In a standalone setup, the ResourceManager can only distribute
+    the slots of available TaskManagers and cannot start new TaskManagers on
+    its own.  
+
+  * **Dispatcher** 
+
+    The _Dispatcher_ provides a REST interface to submit Flink applications for
+    execution and starts a new JobManager component for each submitted job. It
+    also runs the Flink WebUI to provide information about job executions.
+
+  * **JobManager** 
+
+    The _JobManager_ is responsible for managing the execution of a single
+    [JobGraph]({{ site.baseurl }}/concepts/glossary.html#logical-graph).
+    Multiple jobs can run simultaneously in a Flink cluster, each having its
+    own JobManager.
+
+There is always at least one Flink Master. A high-availability setup might have
+multiple Flink Masters, one of which is always the *leader*, and the others are
+*standby* (see [High Availability (HA)]({{ site.baseurl
+}}/ops/jobmanager_high_availability.html)).
+
+### Flink Workers
+
+The *TaskManagers* (also called *workers*) execute the tasks (or more
+specifically, the subtasks) of a dataflow, and buffer and exchange the data

Review comment:
       This was a carryover from the base that was already there. Agree and will remove.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] morsapaes commented on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

morsapaes commented on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626298663


   Pinging @aljoscha and @alpinegizmo (who had this rework in mind as well).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626300700


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/164993430",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1114",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/165501981",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "status" : "FAILURE",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/165979160",
       "triggerID" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "triggerType" : "PUSH"
     }, {
       "hash" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1281",
       "triggerID" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 41c4429c8ec2db10b05f65da3317ed08f2732823 Travis: [FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/165979160) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1281) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot commented on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626298922

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

## Automated Checks
Last check on commit fb921c4d5309beb15e5730dd16c1a0f8b111ab86 (Sun May 10 09:26:46 UTC 2020)

**Warnings:**
* Documentation files were touched, but no `.zh.md` files: Update Chinese documentation or file Jira ticket.

<sub>Mention the bot in a comment to re-run the automated checks.</sub>
## Review Progress

* ❓ 1. The [description] looks good.
* ❓ 2. There is [consensus] that the contribution should go into to Flink.
* ❓ 3. Needs [attention] from.
* ❓ 4. The change fits into the overall [architecture].
* ❓ 5. Overall code [quality] is good.

Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
The @flinkbot bot supports the following commands:

- `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
- `@flinkbot approve all` to approve all aspects
- `@flinkbot approve-until architecture` to approve everything until `architecture`
- `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
- `@flinkbot disapprove architecture` to remove an approval you gave earlier
</details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626300700


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "FAILURE",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/164993430",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb921c4d5309beb15e5730dd16c1a0f8b111ab86 Travis: [FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/164993430) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] aljoscha commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

aljoscha commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r426176348



##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. </div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  }}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.

Review comment:
       @alpinegizmo I think this confusion arose because people tried forcing "application mode" for the thing that was "job mode" before we had our terminology/features more sorted.
   
   Doesn't the usage here more or less match your "thing that evolves over time and as it gets deployed comes to life as a sequence of one flink job after another"?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r424955104



##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. </div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  }}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.
+
+* **Resource Isolation**: in a Flink Application Cluster, the ResourceManager
+  and Dispatcher are scoped to a single Flink Application, which provides a
+  better separation of concerns than the Flink Session Cluster.
+
+<div class="alert alert-info"> <strong>Note:</strong> A Flink Job Cluster can be seen as a “run-on-client” alternative to Flink Application Clusters. </div>
+
+{% top %}
+
+## Self-contained Flink Applications
+
+When you want to do something like event-driven applications, it doesn’t make
+sense that you have to think about and manage clusters. So, there are efforts
+in the community towards enabling _Flink-as-a-Library_ in the future.
+
+The idea is that deploying a Flink Application becomes as easy as starting a
+process: Flink would be as any other library which you add to your application
+and does not affect how you deploy it. If you want to deploy such an
+application, it simply starts a set of processes which connect to each other,
+figure out their roles (e.g. JobManager, TaskManager) and execute the
+application in a distributed, parallel way. If the application cannot keep up
+with the workload, you simply start some new processes to rescale.

Review comment:
       From what I understood in conversation with Stephan and Aljoscha, library mode is still "not there" yet. The idea here would mainly be to clear the myth that you always need a cluster to run Flink, and the progress of this effort can be discussed in the "Operations" docs. 
   
   I do like your suggestion to hint at auto-scaling rather than implying that the processes need to be scaled manually and will rephrase some sentences!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] alpinegizmo commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

alpinegizmo commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r426290315



##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. </div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  }}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.

Review comment:
       @aljoscha Not at all. I was trying to evoke a definition of application that's akin to what one gets with a vvP deployment, whereas the definition here in this section is speaking about an application as a single program that spawns multiple jobs on a one-time basis. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626298922

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

## Automated Checks
Last check on commit 41c4429c8ec2db10b05f65da3317ed08f2732823 (Fri Oct 16 10:54:29 UTC 2020)

**Warnings:**
* Documentation files were touched, but no `.zh.md` files: Update Chinese documentation or file Jira ticket.

<sub>Mention the bot in a comment to re-run the automated checks.</sub>
## Review Progress

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626300700


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/164993430",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1114",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "FAILURE",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/165501981",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "status" : "PENDING",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/165979160",
       "triggerID" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "triggerType" : "PUSH"
     }, {
       "hash" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1281",
       "triggerID" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 041714738531dd2a7247fa27e98c2a0202f497ac Travis: [FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/165501981) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1114) 
   * 41c4429c8ec2db10b05f65da3317ed08f2732823 Travis: [PENDING](https://travis-ci.com/github/flink-ci/flink/builds/165979160) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1281) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] aljoscha closed pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

aljoscha closed pull request #12057:
URL: https://github.com/apache/flink/pull/12057


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot commented on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot commented on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626300700


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb921c4d5309beb15e5730dd16c1a0f8b111ab86 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626300700


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/164993430",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1114",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/165501981",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "status" : "FAILURE",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/165979160",
       "triggerID" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "triggerType" : "PUSH"
     }, {
       "hash" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1281",
       "triggerID" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 41c4429c8ec2db10b05f65da3317ed08f2732823 Travis: [FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/165979160) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1281) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] morsapaes commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

morsapaes commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r425146659



##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. </div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  }}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.

Review comment:
       There is definitely room for confusion. In this case, I was trying to reduce that by being consistent with the definition in the Glossary: https://ci.apache.org/projects/flink/flink-docs-master/concepts/glossary.html#flink-application
   
   In the (near) future, it would probably be good to make this clearer by e.g. dropping the job cluster terminology and use just the new one — in the end:
   
   "The "per job mode" is a bit of a (rarer) special case of the application, pulling out dataflow graph assembly, but otherwise is the same. To avoid overcomplicating this early, we were thinking to not have this in the concepts, but only in the detailed ops docs."
   
   Users might be be too familiar with job clusters/per-job mode to just wipe it off sight and replace it, though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] alpinegizmo commented on a change in pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

alpinegizmo commented on a change in pull request #12057:
URL: https://github.com/apache/flink/pull/12057#discussion_r425124863



##########
File path: docs/concepts/flink-architecture.md
##########
@@ -129,4 +167,108 @@ two main benefits:
 
 <img src="{{ site.baseurl }}/fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
+## Flink Application Execution
+
+A _Flink Application_ is any user program that spawns one or multiple Flink
+jobs from its ``main()`` method. The execution of these jobs can happen in a
+local JVM (``LocalEnvironment``) or on a remote setup of clusters with multiple
+machines (``RemoteEnvironment``). For each program, the
+[``ExecutionEnvironment``]({{ site.baseurl }}/api/java/) provides methods to
+control the job execution (e.g. setting the parallelism) and to interact with
+the outside world (see [Anatomy of a Flink Program]({{ site.baseurl
+}}/dev/api_concepts.html#anatomy-of-a-flink-program)).
+
+The jobs of a Flink Application can either be submitted to a long-running
+[Flink Session Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-session-cluster), a dedicated [Flink Job
+Cluster]({{ site.baseurl }}/concepts/glossary.html#flink-job-cluster) or a
+[Flink Application Cluster]({{ site.baseurl
+}}/concepts/glossary.html#flink-application-cluster). The difference between
+these options is mainly related to the cluster’s lifecycle and to resource
+isolation guarantees.
+
+### Flink Session Cluster
+
+* **Cluster Lifecycle**: in a Flink Session Cluster, the client connects to a
+  pre-existing, long-running cluster that can accept multiple job submissions.
+  Even after all jobs are finished, the cluster (and the Flink Master) will
+  keep running until the session is manually stopped. The lifetime of a Flink
+  Session Cluster is therefore not bound to the lifetime of any Flink Job.
+
+* **Resource Isolation**: TaskManager slots are allocated by the
+  ResourceManager on job submission and released once the job is finished.
+  Because all jobs are sharing the same cluster, there is some competition for
+  cluster resources — like network bandwidth in the submit-job phase. One
+  limitation of this shared setup is that if one TaskManager crashes, then all
+  jobs that have tasks running on this worker will fail; in a similar way, if
+  some fatal error occurs on the Flink Master, it will affect all jobs running
+  in the cluster.
+
+* **Other considerations**: having a pre-existing cluster saves a considerable
+  amount of time applying for resources and starting TaskManagers. This is
+  important in scenarios where the execution time of jobs is very short and a
+  high startup time would negatively impact the end-to-end user experience — as
+  is the case with interactive analysis of short queries, where it is desirable
+  that jobs can quickly perform computations using existing resources.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Session Cluster was also known as a Flink Cluster in <i>session mode</i>. </div>
+
+### Flink Job Cluster
+
+* **Cluster Lifecycle**: in a Flink Job Cluster, the available cluster manager
+  (like YARN or Kubernetes) is used to spin up a cluster for each submitted job
+  and this cluster is available to that job only. Here, the client first
+  requests resources from the cluster manager to start the Flink Master and
+  submits the job to the Dispatcher running inside this process. TaskManagers
+  are then lazily allocated based on the resource requirements of the job. Once
+  the job is finished, the Flink Job Cluster is torn down.
+
+* **Resource Isolation**: a fatal error in the Flink Master only ever affects
+  one job in a Flink Job Cluster.
+
+* **Other considerations**: because the ResourceManager has to apply and wait
+  for external resource management components to start the TaskManager
+  processes and allocate resources, Flink Job Clusters are more suited to large
+  jobs that are long-running, have high-stability requirements and are not
+  sensitive to higher startup times.
+
+<div class="alert alert-info"> <strong>Note:</strong> Formerly, a Flink Job Cluster was also known as a Flink Cluster in <i>job (or per-job) mode</i>. </div>
+
+### Flink Application Cluster
+
+* **Cluster Lifecycle**: a Flink Application Cluster is a dedicated Flink
+  cluster that only executes jobs from one Flink Application and where the
+  ``main()`` method runs on the cluster rather than the client. The job
+  submission is a one-step process: you don’t need to start a Flink cluster
+  first and then submit a job to the existing cluster session; instead, you
+  package your application logic and dependencies into a executable job JAR and
+  the cluster entrypoint ([ApplicationClusterEntryPoint]({{ site.baseurl
+  }}/api/java/index.html?org/apache/flink/container/entrypoint/StandaloneJobClusterEntryPoint.html))
+  is responsible for calling the ``main()`` method to extract the JobGraph.
+  This allows you to deploy a Flink Application like any other application on
+  Kubernetes, for example. The lifetime of a Flink Application Cluster is
+  therefore bound to the lifetime of the Flink Application.

Review comment:
       I think we have a terminology problem here. I've heard some folks using the term "application" to talk about the thing that evolves over time and as it gets deployed comes to life as a sequence of one flink job after another. I've also had it explained to me that "application cluster" is just updated terminology for "job cluster", and it means the same thing. And then this section is introducing yet another definition.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] morsapaes commented on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

morsapaes commented on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-628530402


   Thanks for reviewing, @aljoscha and @alpinegizmo! It's weird that the links worked fine in the local build even with the duplicated reference to the base URL — I've corrected that and will also update the Documentation Style Guide.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626300700


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "DELETED",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/164993430",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1114",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "status" : "FAILURE",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/165501981",
       "triggerID" : "041714738531dd2a7247fa27e98c2a0202f497ac",
       "triggerType" : "PUSH"
     }, {
       "hash" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "41c4429c8ec2db10b05f65da3317ed08f2732823",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 041714738531dd2a7247fa27e98c2a0202f497ac Travis: [FAILURE](https://travis-ci.com/github/flink-ci/flink/builds/165501981) Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=1114) 
   * 41c4429c8ec2db10b05f65da3317ed08f2732823 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] aljoscha commented on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

aljoscha commented on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-633487066


   Thanks again, @morsapaes! I merged this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [flink] flinkbot edited a comment on pull request #12057: [FLINK-16210][docs] Extending the "Flink Architecture" section.

Posted by GitBox <gi...@apache.org>.

flinkbot edited a comment on pull request #12057:
URL: https://github.com/apache/flink/pull/12057#issuecomment-626300700


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "status" : "PENDING",
       "url" : "https://travis-ci.com/github/flink-ci/flink/builds/164993430",
       "triggerID" : "fb921c4d5309beb15e5730dd16c1a0f8b111ab86",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fb921c4d5309beb15e5730dd16c1a0f8b111ab86 Travis: [PENDING](https://travis-ci.com/github/flink-ci/flink/builds/164993430) Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=911) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org