You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by StefanRRichter <gi...@git.apache.org> on 2017/02/03 11:03:47 UTC

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

GitHub user StefanRRichter opened a pull request:

    https://github.com/apache/flink/pull/3259

    Documentation: Production readiness checklist

    Documentation: Production readiness checklist

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StefanRRichter/flink DocuProductionReady

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3259.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3259
    
----
commit ecb13c817a420628bcbfd517e0addaccad4b61f2
Author: Stefan Richter <s....@data-artisans.com>
Date:   2017-02-03T11:01:44Z

    Documentation: Production readiness checklist

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #3259: Documentation: Production readiness checklist

Posted by uce <gi...@git.apache.org>.

Github user uce commented on the issue:

    https://github.com/apache/flink/pull/3259
  
    Looks good to me. Going to merge this...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

Posted by alpinegizmo <gi...@git.apache.org>.

Github user alpinegizmo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3259#discussion_r99375422
  
    --- Diff: docs/ops/production_ready.md ---
    @@ -0,0 +1,88 @@
    +---
    +title: "Production Readiness Checklist"
    +nav-parent_id: setup
    +nav-pos: 20
    +---
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +-->
    +
    +* ToC
    +{:toc}
    +
    +## Production Readiness Checklist
    +
    +Purpose of this production readiness checklist is to provide a condensed overview of configuration options that are
    +important and need **careful considerations** if you plan to bring your Flink job into **production**. For most of these options
    +Flink provides out-of-the-box defaults to make usage and adoption of Flink easier. For many users and scenarios, those
    +defaults are good starting points for development and completely sufficient for "one-shot" jobs. 
    +
    +However, once you are planning to bring a Flink appplication to production the requirements typically increase. For example,
    +you want your job to be (re-)scalable and to have a good upgrade story for your job and new Flink versions.
    +
    +In the following, we present a collection of configuration options that you should check before your job goes into production.
    +
    +### Set maximum parallelism for operators explicitly
    +
    +Maximum parallelism is a configuration parameter that is newly introduced in Flink 1.2 and has important implications
    +for the (re-)scalability of your Flink job. This parameter, which can be set on a per-job and/or per-operator granularity,
    +determines the maximum parallelism to which you can scale operators. It is important to understand that (as of now) there
    +is **now way to increase** this parameter after your job was initially started, except for restarting your job completely 
    --- End diff --
    
    no way
    
    after your job has been started, except


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

Posted by alpinegizmo <gi...@git.apache.org>.

Github user alpinegizmo commented on a diff in the pull request:

https://github.com/apache/flink/pull/3259#discussion_r99376976

--- Diff: docs/ops/production_ready.md ---
@@ -0,0 +1,88 @@
+---
+title: "Production Readiness Checklist"
+nav-parent_id: setup
+nav-pos: 20
+---
+
+
+* ToC
+{:toc}
+
+## Production Readiness Checklist
+
+Purpose of this production readiness checklist is to provide a condensed overview of configuration options that are
+important and need **careful considerations** if you plan to bring your Flink job into **production**. For most of these options
+Flink provides out-of-the-box defaults to make usage and adoption of Flink easier. For many users and scenarios, those
+defaults are good starting points for development and completely sufficient for "one-shot" jobs.
+
+However, once you are planning to bring a Flink appplication to production the requirements typically increase. For example,
+you want your job to be (re-)scalable and to have a good upgrade story for your job and new Flink versions.
+
+In the following, we present a collection of configuration options that you should check before your job goes into production.
+
+### Set maximum parallelism for operators explicitly
+
+Maximum parallelism is a configuration parameter that is newly introduced in Flink 1.2 and has important implications
+for the (re-)scalability of your Flink job. This parameter, which can be set on a per-job and/or per-operator granularity,
+determines the maximum parallelism to which you can scale operators. It is important to understand that (as of now) there
+is **now way to increase** this parameter after your job was initially started, except for restarting your job completely
+from scratch (i.e. with a new state, and not from a previous checkpoint/savepoint). Even if Flink would provide some way
+to change maximum parallelism for existing savepoints in the future, you can already assume that for large states this is
+likely a long running operation that you want to avoid. At this point, you might wonder why not just to use a very high
+value as default for this parameter. The reason behind this is that high maximum parallelism can have some impact on your
+applications performance and even state sizes, because Flink has to maintain certain meta data for it's ability to rescale which
+can increase with the maximum parallelism. In general, you should chose a max parallelism that is high enough to fit your
+future needs in scalability, but keeping it as low as possible can give slightly better performance. In particular,
+a maximum parallelism higher that 128 will typically result in slightly bigger state snapshots from the keyed backends.
+
+Notice that maximum parallelism must fulfill the following conditions:
+
+`0 < parallelism <= max parallelism <= 2^15`
+
+You can set the maximum parallelism by `setMaxParallelism(int maxparallelism)`. By default, Flink will chose the maximum
+parallelism as a function of the parallelism when the job is first started:
+
+- `128` : for all parallelism <= 128.
+- `MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15)` : for all parallelism > 128.
+
+### Set UUIDs for operators
+
+As mentioned in the documentation for [savepoints]({{ site.baseurl }}/setup/savepoints.html, users should set uids for
+operators. Those operator uids are important for Flink's mapping of operator states to operators which, in turn, is
+essential for savepoints. By default operator uids are generated by traversing the JobGraph and hashing certain operator
+properties. While this is comfortable from a user perspective, it is also very fragile to changes on the JobGraph (e.g.
+if you want to exchange an operator). To establish a stable mapping, we need stable operator uids provided by the user
+through `setUid(String uid)`.
+
+### Choice of state backend
+
+Currently, Flink has the limitation that it can only restore the state from a savepoint for the same state backend that
+took the savepoint. For example, this means that we can not take a savepoint with a memory state backend, then change
+the job to use RocksDB state backend and restore. While we are planning to make backends interoperable in the near
+future, they are not yet. This means you should carefully consider which backend you use for your job before going to
+production.
+
+In general, we recommend using RocksDB because this is currently the only state backend that supports large states (i.e.
+state that exceeds the available main memory) and asynchronous snapshots. From our experience, asynchronous snapshots are
+very important for large states because they do not block the operators and Flink can write the snapshots without stopping
+on the stream processing. However, RocksDB can have worse performance that e.g. the memory based state backends. If
--- End diff --

without stopping stream processing

worse performance than, for example, the memory-based state backends

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/3259


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

Posted by alpinegizmo <gi...@git.apache.org>.

Github user alpinegizmo commented on a diff in the pull request:

https://github.com/apache/flink/pull/3259#discussion_r99375568

application's

metadata for its ability

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

Posted by alpinegizmo <gi...@git.apache.org>.

Github user alpinegizmo commented on a diff in the pull request:

https://github.com/apache/flink/pull/3259#discussion_r99375767

choose

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

Posted by alpinegizmo <gi...@git.apache.org>.

Github user alpinegizmo commented on a diff in the pull request:

https://github.com/apache/flink/pull/3259#discussion_r99376488

While this is comfortable from a user perspective, it is also very fragile, as changes to the JobGraph (e.g.
exchanging an operator) will result in new UUIDs.

[GitHub] flink pull request #3259: Documentation: Production readiness checklist

Posted by alpinegizmo <gi...@git.apache.org>.

Github user alpinegizmo commented on a diff in the pull request:

https://github.com/apache/flink/pull/3259#discussion_r99376685

to use a RocksDB