You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by tw...@apache.org on 2017/08/08 12:06:54 UTC
[5/9] flink git commit: [FLINK-7301] [docs] Rework state documentation

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/deployment/docker.md
----------------------------------------------------------------------
diff --git a/docs/ops/deployment/docker.md b/docs/ops/deployment/docker.md
new file mode 100644
index 0000000..4986f2a
--- /dev/null
+++ b/docs/ops/deployment/docker.md
@@ -0,0 +1,102 @@
+---
+title:  "Docker Setup"
+nav-title: Docker
+nav-parent_id: deployment
+nav-pos: 4
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+[Docker](https://www.docker.com) is a popular container runtime. There are
+official Docker images for Apache Flink available on Docker Hub which can be
+used directly or extended to better integrate into a production environment.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Official Docker Images
+
+The [official Docker repository](https://hub.docker.com/_/flink/) is
+hosted on Docker Hub and serves images of Flink version 1.2.1 and later.
+
+Images for each supported combination of Hadoop and Scala are available, and
+tag aliases are provided for convenience.
+
+For example, the following aliases can be used: *(`1.2.y` indicates the latest
+release of Flink 1.2)*
+
+* `flink:latest` →
+`flink:<latest-flink>-hadoop<latest-hadoop>-scala_<latest-scala>`
+* `flink:1.2` → `flink:1.2.y-hadoop27-scala_2.11`
+* `flink:1.2.1-scala_2.10` → `flink:1.2.1-hadoop27-scala_2.10`
+* `flink:1.2-hadoop26` → `flink:1.2.y-hadoop26-scala_2.11`
+
+<!-- NOTE: uncomment when docker-flink/docker-flink/issues/14 is resolved. -->
+<!--
+Additionally, images based on Alpine Linux are available. Reference them by
+appending `-alpine` to the tag. For the Alpine version of `flink:latest`, use
+`flink:alpine`.
+
+For example:
+
+* `flink:alpine`
+* `flink:1.2.1-alpine`
+* `flink:1.2-scala_2.10-alpine`
+-->
+
+**Note:** The docker images are provided as a community project by individuals
+on a best-effort basis. They are not official releases by the Apache Flink PMC.
+
+## Flink with Docker Compose
+
+[Docker Compose](https://docs.docker.com/compose/) is a convenient way to run a
+group of Docker containers locally.
+
+An [example config file](https://github.com/docker-flink/examples/blob/master/docker-compose.yml)
+is available on GitHub.
+
+### Usage
+
+* Launch a cluster in the foreground
+
+        docker-compose up
+
+* Launch a cluster in the background
+
+        docker-compose up -d
+
+* Scale the cluster up or down to *N* TaskManagers
+
+        docker-compose scale taskmanager=<N>
+
+When the cluster is running, you can visit the web UI at [http://localhost:8081
+](http://localhost:8081) and submit a job.
+
+To submit a job via the command line, you must copy the JAR to the Jobmanager
+container and submit the job from there.
+
+For example:
+
+{% raw %}
+    $ JOBMANAGER_CONTAINER=$(docker ps --filter name=jobmanager --format={{.ID}})
+    $ docker cp path/to/jar "$JOBMANAGER_CONTAINER":/job.jar
+    $ docker exec -t -i "$JOBMANAGER_CONTAINER" flink run /job.jar
+{% endraw %}
+
+{% top %}

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/deployment/gce_setup.md
----------------------------------------------------------------------
diff --git a/docs/ops/deployment/gce_setup.md b/docs/ops/deployment/gce_setup.md
new file mode 100644
index 0000000..2925737
--- /dev/null
+++ b/docs/ops/deployment/gce_setup.md
@@ -0,0 +1,93 @@
+---
+title:  "Google Compute Engine Setup"
+nav-title: Google Compute Engine
+nav-parent_id: deployment
+nav-pos: 6
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+This documentation provides instructions on how to setup Flink fully automatically with Hadoop 1 or Hadoop 2 on top of a [Google Compute Engine](https://cloud.google.com/compute/) cluster. This is made possible by Google's [bdutil](https://cloud.google.com/hadoop/bdutil) which starts a cluster and deploys Flink with Hadoop. To get started, just follow the steps below.
+
+* This will be replaced by the TOC
+{:toc}
+
+# Prerequisites
+
+## Install Google Cloud SDK
+
+Please follow the instructions on how to setup the [Google Cloud SDK](https://cloud.google.com/sdk/). In particular, make sure to authenticate with Google Cloud using the following command:
+
+    gcloud auth login
+
+## Install bdutil
+
+At the moment, there is no bdutil release yet which includes the Flink
+extension. However, you can get the latest version of bdutil with Flink support
+from [GitHub](https://github.com/GoogleCloudPlatform/bdutil):
+
+    git clone https://github.com/GoogleCloudPlatform/bdutil.git
+
+After you have downloaded the source, change into the newly created `bdutil` directory and continue with the next steps.
+
+# Deploying Flink on Google Compute Engine
+
+## Set up a bucket
+
+If you have not done so, create a bucket for the bdutil config and staging files. A new bucket can be created with gsutil:
+
+    gsutil mb gs://<bucket_name>
+
+## Adapt the bdutil config
+
+To deploy Flink with bdutil, adapt at least the following variables in
+bdutil_env.sh.
+
+    CONFIGBUCKET="<bucket_name>"
+    PROJECT="<compute_engine_project_name>"
+    NUM_WORKERS=<number_of_workers>
+
+    # set this to 'n1-standard-2' if you're using the free trial
+    GCE_MACHINE_TYPE="<gce_machine_type>"
+
+    # for example: "europe-west1-d"
+    GCE_ZONE="<gce_zone>"
+
+## Adapt the Flink config
+
+bdutil's Flink extension handles the configuration for you. You may additionally adjust configuration variables in `extensions/flink/flink_env.sh`. If you want to make further configuration, please take a look at [configuring Flink](../config.html). You will have to restart Flink after changing its configuration using `bin/stop-cluster` and `bin/start-cluster`.
+
+## Bring up a cluster with Flink
+
+To bring up the Flink cluster on Google Compute Engine, execute:
+
+    ./bdutil -e extensions/flink/flink_env.sh deploy
+
+## Run a Flink example job:
+
+    ./bdutil shell
+    cd /home/hadoop/flink-install/bin
+    ./flink run ../examples/batch/WordCount.jar gs://dataflow-samples/shakespeare/othello.txt gs://<bucket_name>/output
+
+## Shut down your cluster
+
+Shutting down a cluster is as simple as executing
+
+    ./bdutil -e extensions/flink/flink_env.sh delete

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/deployment/index.md
----------------------------------------------------------------------
diff --git a/docs/ops/deployment/index.md b/docs/ops/deployment/index.md
new file mode 100644
index 0000000..e82299d
--- /dev/null
+++ b/docs/ops/deployment/index.md
@@ -0,0 +1,24 @@
+---
+title: "Clusters & Deployment"
+nav-id: deployment
+nav-parent_id: ops
+nav-pos: 1
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/deployment/kubernetes.md
----------------------------------------------------------------------
diff --git a/docs/ops/deployment/kubernetes.md b/docs/ops/deployment/kubernetes.md
new file mode 100644
index 0000000..0790a05
--- /dev/null
+++ b/docs/ops/deployment/kubernetes.md
@@ -0,0 +1,157 @@
+---
+title:  "Kubernetes Setup"
+nav-title: Kubernetes
+nav-parent_id: deployment
+nav-pos: 4
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+[Kubernetes](https://kubernetes.io) is a container orchestration system.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Simple Kubernetes Flink Cluster
+
+A basic Flink cluster deployment in Kubernetes has three components:
+
+* a Deployment for a single Jobmanager
+* a Deployment for a pool of Taskmanagers
+* a Service exposing the Jobmanager's RPC and UI ports
+
+### Launching the cluster
+
+Using the [resource definitions found below](#simple-kubernetes-flink-cluster-
+resources), launch the cluster with the `kubectl` command:
+
+    kubectl create -f jobmanager-deployment.yaml
+    kubectl create -f taskmanager-deployment.yaml
+    kubectl create -f jobmanager-service.yaml
+
+You can then access the Flink UI via `kubectl proxy`:
+
+1. Run `kubectl proxy` in a terminal
+2. Navigate to [http://localhost:8001/api/v1/proxy/namespaces/default/services/flink-jobmanager:8081
+](http://localhost:8001/api/v1/proxy/namespaces/default/services/flink-
+jobmanager:8081) in your browser
+
+### Deleting the cluster
+
+Again, use `kubectl` to delete the cluster:
+
+    kubectl delete -f jobmanager-deployment.yaml
+    kubectl delete -f taskmanager-deployment.yaml
+    kubectl delete -f jobmanager-service.yaml
+
+## Advanced Cluster Deployment
+
+An early version of a [Flink Helm chart](https://github.com/docker-flink/
+examples) is available on GitHub.
+
+## Appendix
+
+### Simple Kubernetes Flink cluster resources
+
+`jobmanager-deployment.yaml`
+{% highlight yaml %}
+apiVersion: extensions/v1beta1
+kind: Deployment
+metadata:
+  name: flink-jobmanager
+spec:
+  replicas: 1
+  template:
+    metadata:
+      labels:
+        app: flink
+        component: jobmanager
+    spec:
+      containers:
+      - name: jobmanager
+        image: flink:latest
+        args:
+        - jobmanager
+        ports:
+        - containerPort: 6123
+          name: rpc
+        - containerPort: 6124
+          name: blob
+        - containerPort: 6125
+          name: query
+        - containerPort: 8081
+          name: ui
+        env:
+        - name: JOB_MANAGER_RPC_ADDRESS
+          value: flink-jobmanager
+{% endhighlight %}
+
+`taskmanager-deployment.yaml`
+{% highlight yaml %}
+apiVersion: extensions/v1beta1
+kind: Deployment
+metadata:
+  name: flink-taskmanager
+spec:
+  replicas: 2
+  template:
+    metadata:
+      labels:
+        app: flink
+        component: taskmanager
+    spec:
+      containers:
+      - name: taskmanager
+        image: flink:latest
+        args:
+        - taskmanager
+        ports:
+        - containerPort: 6121
+          name: data
+        - containerPort: 6122
+          name: rpc
+        - containerPort: 6125
+          name: query
+        env:
+        - name: JOB_MANAGER_RPC_ADDRESS
+          value: flink-jobmanager
+{% endhighlight %}
+
+`jobmanager-service.yaml`
+{% highlight yaml %}
+apiVersion: v1
+kind: Service
+metadata:
+  name: flink-jobmanager
+spec:
+  ports:
+  - name: rpc
+    port: 6123
+  - name: blob
+    port: 6124
+  - name: query
+    port: 6125
+  - name: ui
+    port: 8081
+  selector:
+    app: flink
+    component: jobmanager
+{% endhighlight %}
+
+{% top %}

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/deployment/mapr_setup.md
----------------------------------------------------------------------
diff --git a/docs/ops/deployment/mapr_setup.md b/docs/ops/deployment/mapr_setup.md
new file mode 100644
index 0000000..7575bdc
--- /dev/null
+++ b/docs/ops/deployment/mapr_setup.md
@@ -0,0 +1,132 @@
+---
+title:  "MapR Setup"
+nav-title: MapR
+nav-parent_id: deployment
+nav-pos: 7
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This documentation provides instructions on how to prepare Flink for YARN
+executions on a [MapR](https://mapr.com/) cluster.
+
+* This will be replaced by the TOC
+{:toc}
+
+## Running Flink on YARN with MapR
+
+The instructions below assume MapR version 5.2.0. They will guide you
+to be able to start submitting [Flink on YARN]({{ site.baseurl }}/ops/deployment/yarn_setup.html)
+jobs or sessions to a MapR cluster.
+
+### Building Flink for MapR
+
+In order to run Flink on MapR, Flink needs to be built with MapR's own
+Hadoop and Zookeeper distribution. Simply build Flink using Maven with
+the following command from the project root directory:
+
+```
+mvn clean install -DskipTests -Pvendor-repos,mapr \
+    -Dhadoop.version=2.7.0-mapr-1607 \
+    -Dzookeeper.version=3.4.5-mapr-1604
+```
+
+The `vendor-repos` build profile adds MapR's repository to the build so that
+MapR's Hadoop / Zookeeper dependencies can be fetched. The `mapr` build
+profile additionally resolves some dependency clashes between MapR and
+Flink, as well as ensuring that the native MapR libraries on the cluster
+nodes are used. Both profiles must be activated.
+
+By default the `mapr` profile builds with Hadoop / Zookeeper dependencies
+for MapR version 5.2.0, so you don't need to explicitly override
+the `hadoop.version` and `zookeeper.version` properties.
+For different MapR versions, simply override these properties to appropriate
+values. The corresponding Hadoop / Zookeeper distributions for each MapR version
+can be found on MapR documentations such as
+[here](http://maprdocs.mapr.com/home/DevelopmentGuide/MavenArtifacts.html).
+
+### Job Submission Client Setup
+
+The client submitting Flink jobs to MapR also needs to be prepared with the below setups.
+
+Ensure that MapR's JAAS config file is picked up to avoid login failures:
+
+```
+export JVM_ARGS=-Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf
+```
+
+Make sure that the `yarn.nodemanager.resource.cpu-vcores` property is set in `yarn-site.xml`:
+
+~~~xml
+<!-- in /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/yarn-site.xml -->
+
+<configuration>
+...
+
+<property>
+    <name>yarn.nodemanager.resource.cpu-vcores</name>
+    <value>...</value>
+</property>
+
+...
+</configuration>
+~~~
+
+Also remember to set the `YARN_CONF_DIR` or `HADOOP_CONF_DIR` environment
+variables to the path where `yarn-site.xml` is located:
+
+```
+export YARN_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/
+export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/
+```
+
+Make sure that the MapR native libraries are picked up in the classpath:
+
+```
+export FLINK_CLASSPATH=/opt/mapr/lib/*
+```
+
+If you'll be starting Flink on YARN sessions with `yarn-session.sh`, the
+below is also required:
+
+```
+export CC_CLASSPATH=/opt/mapr/lib/*
+```
+
+## Running Flink with a Secured MapR Cluster
+
+*Note: In Flink 1.2.0, Flink's Kerberos authentication for YARN execution has
+a bug that forbids it to work with MapR Security. Please upgrade to later Flink
+versions in order to use Flink with a secured MapR cluster. For more details,
+please see [FLINK-5949](https://issues.apache.org/jira/browse/FLINK-5949).*
+
+Flink's [Kerberos authentication]({{ site.baseurl }}/ops/security-kerberos.html) is independent of
+[MapR's Security authentication](http://maprdocs.mapr.com/home/SecurityGuide/Configuring-MapR-Security.html).
+With the above build procedures and environment variable setups, Flink
+does not require any additional configuration to work with MapR Security.
+
+Users simply need to login by using MapR's `maprlogin` authentication
+utility. Users that haven't acquired MapR login credentials would not be
+able to submit Flink jobs, erroring with:
+
+```
+java.lang.Exception: unable to establish the security context
+Caused by: o.a.f.r.security.modules.SecurityModule$SecurityInstallException: Unable to set the Hadoop login user
+Caused by: java.io.IOException: failure to login: Unable to obtain MapR credentials
+```

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/deployment/mesos.md
----------------------------------------------------------------------
diff --git a/docs/ops/deployment/mesos.md b/docs/ops/deployment/mesos.md
new file mode 100644
index 0000000..2fa340d
--- /dev/null
+++ b/docs/ops/deployment/mesos.md
@@ -0,0 +1,269 @@
+---
+title:  "Mesos Setup"
+nav-title: Mesos
+nav-parent_id: deployment
+nav-pos: 3
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Background
+
+The Mesos implementation consists of two components: The Application Master and
+the Worker. The workers are simple TaskManagers which are parameterized by the environment
+set up by the application master. The most sophisticated component of the Mesos
+implementation is the application master. The application master currently hosts
+the following components:
+
+### Mesos Scheduler
+
+The scheduler is responsible for registering the framework with Mesos,
+requesting resources, and launching worker nodes. The scheduler continuously
+needs to report back to Mesos to ensure the framework is in a healthy state. To
+verify the health of the cluster, the scheduler monitors the spawned workers and
+marks them as failed and restarts them if necessary.
+
+Flink's Mesos scheduler itself is currently not highly available. However, it
+persists all necessary information about its state (e.g. configuration, list of
+workers) in Zookeeper. In the presence of a failure, it relies on an external
+system to bring up a new scheduler. The scheduler will then register with Mesos
+again and go through the reconciliation phase. In the reconciliation phase, the
+scheduler receives a list of running workers nodes. It matches these against the
+recovered information from Zookeeper and makes sure to bring back the cluster in
+the state before the failure.
+
+### Artifact Server
+
+The artifact server is responsible for providing resources to the worker
+nodes. The resources can be anything from the Flink binaries to shared secrets
+or configuration files. For instance, in non-containered environments, the
+artifact server will provide the Flink binaries. What files will be served
+depends on the configuration overlay used.
+
+### Flink's JobManager and Web Interface
+
+The Mesos scheduler currently resides with the JobManager but will be started
+independently of the JobManager in future versions (see
+[FLIP-6](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077)). The
+proposed changes will also add a Dipsatcher component which will be the central
+point for job submission and monitoring.
+
+### Startup script and configuration overlays
+
+The startup script provide a way to configure and start the application
+master. All further configuration is then inherited by the workers nodes. This
+is achieved using configuration overlays. Configuration overlays provide a way
+to infer configuration from environment variables and config files which are
+shipped to the worker nodes.
+
+
+## DC/OS
+
+This section refers to [DC/OS](https://dcos.io) which is a Mesos distribution
+with a sophisticated application management layer. It comes pre-installed with
+Marathon, a service to supervise applications and maintain their state in case
+of failures.
+
+If you don't have a running DC/OS cluster, please follow the
+[instructions on how to install DC/OS on the official website](https://dcos.io/install/).
+
+Once you have a DC/OS cluster, you may install Flink through the DC/OS
+Universe. In the search prompt, just search for Flink. Alternatively, you can use the DC/OS CLI:
+
+    dcos package install flink
+
+Further information can be found in the
+[DC/OS examples documentation](https://github.com/dcos/examples/tree/master/1.8/flink).
+
+
+## Mesos without DC/OS
+
+You can also run Mesos without DC/OS.
+
+### Installing Mesos
+
+Please follow the [instructions on how to setup Mesos on the official website](http://mesos.apache.org/documentation/latest/getting-started/).
+
+After installation you have to configure the set of master and agent nodes by creating the files `MESOS_HOME/etc/mesos/masters` and `MESOS_HOME/etc/mesos/slaves`.
+These files contain in each row a single hostname on which the respective component will be started (assuming SSH access to these nodes).
+
+Next you have to create `MESOS_HOME/etc/mesos/mesos-master-env.sh` or use the template found in the same directory.
+In this file, you have to define
+
+    export MESOS_work_dir=WORK_DIRECTORY
+
+and it is recommended to uncommment
+
+    export MESOS_log_dir=LOGGING_DIRECTORY
+
+
+In order to configure the Mesos agents, you have to create `MESOS_HOME/etc/mesos/mesos-agent-env.sh` or use the template found in the same directory.
+You have to configure
+
+    export MESOS_master=MASTER_HOSTNAME:MASTER_PORT
+
+and uncomment
+
+    export MESOS_log_dir=LOGGING_DIRECTORY
+    export MESOS_work_dir=WORK_DIRECTORY
+
+#### Mesos Library
+
+In order to run Java applications with Mesos you have to export `MESOS_NATIVE_JAVA_LIBRARY=MESOS_HOME/lib/libmesos.so` on Linux.
+Under Mac OS X you have to export `MESOS_NATIVE_JAVA_LIBRARY=MESOS_HOME/lib/libmesos.dylib`.
+
+#### Deploying Mesos
+
+In order to start your mesos cluster, use the deployment script `MESOS_HOME/sbin/mesos-start-cluster.sh`.
+In order to stop your mesos cluster, use the deployment script `MESOS_HOME/sbin/mesos-stop-cluster.sh`.
+More information about the deployment scripts can be found [here](http://mesos.apache.org/documentation/latest/deploy-scripts/).
+
+### Installing Marathon
+
+Optionally, you may also [install Marathon](https://mesosphere.github.io/marathon/docs/) which will be necessary to run Flink in high availability (HA) mode.
+
+### Pre-installing Flink vs Docker/Mesos containers
+
+You may install Flink on all of your Mesos Master and Agent nodes.
+You can also pull the binaries from the Flink web site during deployment and apply your custom configuration before launching the application master.
+A more convenient and easier to maintain approach is to use Docker containers to manage the Flink binaries and configuration.
+
+This is controlled via the following configuration entries:
+
+    mesos.resourcemanager.tasks.container.type: mesos _or_ docker
+
+If set to 'docker', specify the image name:
+
+    mesos.resourcemanager.tasks.container.image.name: image_name
+
+
+### Standalone
+
+In the `/bin` directory of the Flink distribution, you find two startup scripts
+which manage the Flink processes in a Mesos cluster:
+
+1. `mesos-appmaster.sh`
+   This starts the Mesos application master which will register the Mesos scheduler.
+   It is also responsible for starting up the worker nodes.
+
+2. `mesos-taskmanager.sh`
+   The entry point for the Mesos worker processes.
+   You don't need to explicitly execute this script.
+   It is automatically launched by the Mesos worker node to bring up a new TaskManager.
+
+In order to run the `mesos-appmaster.sh` script you have to define `mesos.master` in the `flink-conf.yaml` or pass it via `-Dmesos.master=...` to the Java process.
+Additionally, you should define the number of task managers which are started by Mesos via `mesos.initial-tasks`.
+This value can also be defined in the `flink-conf.yaml` or passed as a Java property.
+
+When executing `mesos-appmaster.sh`, it will create a job manager on the machine where you executed the script.
+In contrast to that, the task managers will be run as Mesos tasks in the Mesos cluster.
+
+#### General configuration
+
+It is possible to completely parameterize a Mesos application through Java properties passed to the Mesos application master.
+This also allows to specify general Flink configuration parameters.
+For example:
+
+    bin/mesos-appmaster.sh \
+        -Dmesos.master=master.foobar.org:5050 \
+        -Djobmanager.heap.mb=1024 \
+        -Djobmanager.rpc.port=6123 \
+        -Djobmanager.web.port=8081 \
+        -Dmesos.initial-tasks=10 \
+        -Dmesos.resourcemanager.tasks.mem=4096 \
+        -Dtaskmanager.heap.mb=3500 \
+        -Dtaskmanager.numberOfTaskSlots=2 \
+        -Dparallelism.default=10
+
+
+### High Availability
+
+You will need to run a service like Marathon or Apache Aurora which takes care of restarting the Flink master process in case of node or process failures.
+In addition, Zookeeper needs to be configured like described in the [High Availability section of the Flink docs]({{ site.baseurl }}/ops/jobmanager_high_availability.html)
+
+For the reconciliation of tasks to work correctly, please also set `high-availability.zookeeper.path.mesos-workers` to a valid Zookeeper path.
+
+#### Marathon
+
+Marathon needs to be set up to launch the `bin/mesos-appmaster.sh` script.
+In particular, it should also adjust any configuration parameters for the Flink cluster.
+
+Here is an example configuration for Marathon:
+
+    {
+        "id": "flink",
+        "cmd": "$FLINK_HOME/bin/mesos-appmaster.sh -Djobmanager.heap.mb=1024 -Djobmanager.rpc.port=6123 -Djobmanager.web.port=8081 -Dmesos.initial-tasks=1 -Dmesos.resourcemanager.tasks.mem=1024 -Dtaskmanager.heap.mb=1024 -Dtaskmanager.numberOfTaskSlots=2 -Dparallelism.default=2 -Dmesos.resourcemanager.tasks.cpus=1",
+        "cpus": 1.0,
+        "mem": 1024
+    }
+
+When running Flink with Marathon, the whole Flink cluster including the job manager will be run as Mesos tasks in the Mesos cluster.
+
+### Configuration parameters
+
+`mesos.initial-tasks`: The initial workers to bring up when the master starts (**DEFAULT**: The number of workers specified at cluster startup).
+
+`mesos.constraints.hard.hostattribute`: Constraints for task placement on mesos based on agent attributes (**DEFAULT**: None).
+Takes a comma-separated list of key:value pairs corresponding to the attributes exposed by the target
+mesos agents.  Example: `az:eu-west-1a,series:t2`
+
+`mesos.maximum-failed-tasks`: The maximum number of failed workers before the cluster fails (**DEFAULT**: Number of initial workers).
+May be set to -1 to disable this feature.
+
+`mesos.master`: The Mesos master URL. The value should be in one of the following forms:
+
+* `host:port`
+* `zk://host1:port1,host2:port2,.../path`
+* `zk://username:password@host1:port1,host2:port2,.../path`
+* `file:///path/to/file`
+
+`mesos.failover-timeout`: The failover timeout in seconds for the Mesos scheduler, after which running tasks are automatically shut down (**DEFAULT:** 600).
+
+`mesos.resourcemanager.artifactserver.port`:The config parameter defining the Mesos artifact server port to use. Setting the port to 0 will let the OS choose an available port.
+
+`mesos.resourcemanager.framework.name`: Mesos framework name (**DEFAULT:** Flink)
+
+`mesos.resourcemanager.framework.role`: Mesos framework role definition (**DEFAULT:** *)
+
+`high-availability.zookeeper.path.mesos-workers`: The ZooKeeper root path for persisting the Mesos worker information.
+
+`mesos.resourcemanager.framework.principal`: Mesos framework principal (**NO DEFAULT**)
+
+`mesos.resourcemanager.framework.secret`: Mesos framework secret (**NO DEFAULT**)
+
+`mesos.resourcemanager.framework.user`: Mesos framework user (**DEFAULT:**"")
+
+`mesos.resourcemanager.artifactserver.ssl.enabled`: Enables SSL for the Flink artifact server (**DEFAULT**: true). Note that `security.ssl.enabled` also needs to be set to `true` encryption to enable encryption.
+
+`mesos.resourcemanager.tasks.mem`: Memory to assign to the Mesos workers in MB (**DEFAULT**: 1024)
+
+`mesos.resourcemanager.tasks.cpus`: CPUs to assign to the Mesos workers (**DEFAULT**: 0.0)
+
+`mesos.resourcemanager.tasks.container.type`: Type of the containerization used: "mesos" or "docker" (DEFAULT: mesos);
+
+`mesos.resourcemanager.tasks.container.image.name`: Image name to use for the container (**NO DEFAULT**)
+
+`mesos.resourcemanager.tasks.container.volumes`: A comma seperated list of [host_path:]container_path[:RO|RW]. This allows for mounting additional volumes into your container. (**NO DEFAULT**)
+
+`mesos.resourcemanager.tasks.hostname`: Optional value to define the TaskManager's hostname. The pattern `_TASK_` is replaced by the actual id of the Mesos task. This can be used to configure the TaskManager to use Mesos DNS (e.g. `_TASK_.flink-service.mesos`) for name lookups. (**NO DEFAULT**)
+
+`mesos.resourcemanager.tasks.bootstrap-cmd`: A command which is executed before the TaskManager is started (**NO DEFAULT**).

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/deployment/yarn_setup.md
----------------------------------------------------------------------
diff --git a/docs/ops/deployment/yarn_setup.md b/docs/ops/deployment/yarn_setup.md
new file mode 100644
index 0000000..8c435f7
--- /dev/null
+++ b/docs/ops/deployment/yarn_setup.md
@@ -0,0 +1,338 @@
+---
+title:  "YARN Setup"
+nav-title: YARN
+nav-parent_id: deployment
+nav-pos: 2
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Quickstart
+
+### Start a long-running Flink cluster on YARN
+
+Start a YARN session with 4 Task Managers (each with 4 GB of Heapspace):
+
+~~~bash
+# get the hadoop2 package from the Flink download page at
+# {{ site.download_url }}
+curl -O <flink_hadoop2_download_url>
+tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
+cd flink-{{ site.version }}/
+./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096
+~~~
+
+Specify the `-s` flag for the number of processing slots per Task Manager. We recommend to set the number of slots to the number of processors per machine.
+
+Once the session has been started, you can submit jobs to the cluster using the `./bin/flink` tool.
+
+### Run a Flink job on YARN
+
+~~~bash
+# get the hadoop2 package from the Flink download page at
+# {{ site.download_url }}
+curl -O <flink_hadoop2_download_url>
+tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
+cd flink-{{ site.version }}/
+./bin/flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096 ./examples/batch/WordCount.jar
+~~~
+
+## Flink YARN Session
+
+Apache [Hadoop YARN](http://hadoop.apache.org/) is a cluster resource management framework. It allows to run various distributed applications on top of a cluster. Flink runs on YARN next to other applications. Users do not have to setup or install anything if there is already a YARN setup.
+
+**Requirements**
+
+- at least Apache Hadoop 2.2
+- HDFS (Hadoop Distributed File System) (or another distributed file system supported by Hadoop)
+
+If you have troubles using the Flink YARN client, have a look in the [FAQ section](http://flink.apache.org/faq.html#yarn-deployment).
+
+### Start Flink Session
+
+Follow these instructions to learn how to launch a Flink Session within your YARN cluster.
+
+A session will start all required Flink services (JobManager and TaskManagers) so that you can submit programs to the cluster. Note that you can run multiple programs per session.
+
+#### Download Flink
+
+Download a Flink package for Hadoop >= 2 from the [download page]({{ site.download_url }}). It contains the required files.
+
+Extract the package using:
+
+~~~bash
+tar xvzf flink-{{ site.version }}-bin-hadoop2.tgz
+cd flink-{{site.version }}/
+~~~
+
+#### Start a Session
+
+Use the following command to start a session
+
+~~~bash
+./bin/yarn-session.sh
+~~~
+
+This command will show you the following overview:
+
+~~~bash
+Usage:
+   Required
+     -n,--container <arg>   Number of YARN container to allocate (=Number of Task Managers)
+   Optional
+     -D <arg>                        Dynamic properties
+     -d,--detached                   Start detached
+     -jm,--jobManagerMemory <arg>    Memory for JobManager Container [in MB]
+     -nm,--name                      Set a custom name for the application on YARN
+     -q,--query                      Display available YARN resources (memory, cores)
+     -qu,--queue <arg>               Specify YARN queue.
+     -s,--slots <arg>                Number of slots per TaskManager
+     -tm,--taskManagerMemory <arg>   Memory per TaskManager Container [in MB]
+     -z,--zookeeperNamespace <arg>   Namespace to create the Zookeeper sub-paths for HA mode
+~~~
+
+Please note that the Client requires the `YARN_CONF_DIR` or `HADOOP_CONF_DIR` environment variable to be set to read the YARN and HDFS configuration.
+
+**Example:** Issue the following command to allocate 10 Task Managers, with 8 GB of memory and 32 processing slots each:
+
+~~~bash
+./bin/yarn-session.sh -n 10 -tm 8192 -s 32
+~~~
+
+The system will use the configuration in `conf/flink-conf.yaml`. Please follow our [configuration guide]({{ site.baseurl }}/ops/config.html) if you want to change something.
+
+Flink on YARN will overwrite the following configuration parameters `jobmanager.rpc.address` (because the JobManager is always allocated at different machines), `taskmanager.tmp.dirs` (we are using the tmp directories given by YARN) and `parallelism.default` if the number of slots has been specified.
+
+If you don't want to change the configuration file to set configuration parameters, there is the option to pass dynamic properties via the `-D` flag. So you can pass parameters this way: `-Dfs.overwrite-files=true -Dtaskmanager.network.memory.min=536346624`.
+
+The example invocation starts 11 containers (even though only 10 containers were requested), since there is one additional container for the ApplicationMaster and Job Manager.
+
+Once Flink is deployed in your YARN cluster, it will show you the connection details of the Job Manager.
+
+Stop the YARN session by stopping the unix process (using CTRL+C) or by entering 'stop' into the client.
+
+Flink on YARN will only start all requested containers if enough resources are available on the cluster. Most YARN schedulers account for the requested memory of the containers,
+some account also for the number of vcores. By default, the number of vcores is equal to the processing slots (`-s`) argument. The `yarn.containers.vcores` allows overwriting the
+number of vcores with a custom value.
+
+#### Detached YARN Session
+
+If you do not want to keep the Flink YARN client running all the time, it's also possible to start a *detached* YARN session.
+The parameter for that is called `-d` or `--detached`.
+
+In that case, the Flink YARN client will only submit Flink to the cluster and then close itself.
+Note that in this case its not possible to stop the YARN session using Flink.
+
+Use the YARN utilities (`yarn application -kill <appId>`) to stop the YARN session.
+
+#### Attach to an existing Session
+
+Use the following command to start a session
+
+~~~bash
+./bin/yarn-session.sh
+~~~
+
+This command will show you the following overview:
+
+~~~bash
+Usage:
+   Required
+     -id,--applicationId <yarnAppId> YARN application Id
+~~~
+
+As already mentioned, `YARN_CONF_DIR` or `HADOOP_CONF_DIR` environment variable must be set to read the YARN and HDFS configuration.
+
+**Example:** Issue the following command to attach to running Flink YARN session `application_1463870264508_0029`:
+
+~~~bash
+./bin/yarn-session.sh -id application_1463870264508_0029
+~~~
+
+Attaching to a running session uses YARN ResourceManager to determine Job Manager RPC port.
+
+Stop the YARN session by stopping the unix process (using CTRL+C) or by entering 'stop' into the client.
+
+### Submit Job to Flink
+
+Use the following command to submit a Flink program to the YARN cluster:
+
+~~~bash
+./bin/flink
+~~~
+
+Please refer to the documentation of the [command-line client]({{ site.baseurl }}/ops/cli.html).
+
+The command will show you a help menu like this:
+
+~~~bash
+[...]
+Action "run" compiles and runs a program.
+
+  Syntax: run [OPTIONS] <jar-file> <arguments>
+  "run" action arguments:
+     -c,--class <classname>           Class with the program entry point ("main"
+                                      method or "getPlan()" method. Only needed
+                                      if the JAR file does not specify the class
+                                      in its manifest.
+     -m,--jobmanager <host:port>      Address of the JobManager (master) to
+                                      which to connect. Use this flag to connect
+                                      to a different JobManager than the one
+                                      specified in the configuration.
+     -p,--parallelism <parallelism>   The parallelism with which to run the
+                                      program. Optional flag to override the
+                                      default value specified in the
+                                      configuration
+~~~
+
+Use the *run* action to submit a job to YARN. The client is able to determine the address of the JobManager. In the rare event of a problem, you can also pass the JobManager address using the `-m` argument. The JobManager address is visible in the YARN console.
+
+**Example**
+
+~~~bash
+wget -O LICENSE-2.0.txt http://www.apache.org/licenses/LICENSE-2.0.txt
+hadoop fs -copyFromLocal LICENSE-2.0.txt hdfs:/// ...
+./bin/flink run ./examples/batch/WordCount.jar \
+        hdfs:///..../LICENSE-2.0.txt hdfs:///.../wordcount-result.txt
+~~~
+
+If there is the following error, make sure that all TaskManagers started:
+
+~~~bash
+Exception in thread "main" org.apache.flink.compiler.CompilerException:
+    Available instances could not be determined from job manager: Connection timed out.
+~~~
+
+You can check the number of TaskManagers in the JobManager web interface. The address of this interface is printed in the YARN session console.
+
+If the TaskManagers do not show up after a minute, you should investigate the issue using the log files.
+
+
+## Run a single Flink job on YARN
+
+The documentation above describes how to start a Flink cluster within a Hadoop YARN environment. It is also possible to launch Flink within YARN only for executing a single job.
+
+Please note that the client then expects the `-yn` value to be set (number of TaskManagers).
+
+***Example:***
+
+~~~bash
+./bin/flink run -m yarn-cluster -yn 2 ./examples/batch/WordCount.jar
+~~~
+
+The command line options of the YARN session are also available with the `./bin/flink` tool. They are prefixed with a `y` or `yarn` (for the long argument options).
+
+Note: You can use a different configuration directory per job by setting the environment variable `FLINK_CONF_DIR`. To use this copy the `conf` directory from the Flink distribution and modify, for example, the logging settings on a per-job basis.
+
+Note: It is possible to combine `-m yarn-cluster` with a detached YARN submission (`-yd`) to "fire and forget" a Flink job to the YARN cluster. In this case, your application will not get any accumulator results or exceptions from the ExecutionEnvironment.execute() call!
+
+### User jars & Classpath
+
+By default Flink will include the user jars into the system classpath when running a single job. This behavior can be controlled with the `yarn.per-job-cluster.include-user-jar` parameter.
+
+When setting this to `DISABLED` Flink will include the jar in the user classpath instead.
+
+The user-jars position in the class path can be controlled by setting the parameter to one of the following:
+
+- `ORDER`: (default) Adds the jar to the system class path based on the lexicographic order.
+- `FIRST`: Adds the jar to the beginning of the system class path.
+- `LAST`: Adds the jar to the end of the system class path.
+
+## Recovery behavior of Flink on YARN
+
+Flink's YARN client has the following configuration parameters to control how to behave in case of container failures. These parameters can be set either from the `conf/flink-conf.yaml` or when starting the YARN session, using `-D` parameters.
+
+- `yarn.reallocate-failed`: This parameter controls whether Flink should reallocate failed TaskManager containers. Default: true
+- `yarn.maximum-failed-containers`: The maximum number of failed containers the ApplicationMaster accepts until it fails the YARN session. Default: The number of initially requested TaskManagers (`-n`).
+- `yarn.application-attempts`: The number of ApplicationMaster (+ its TaskManager containers) attempts. If this value is set to 1 (default), the entire YARN session will fail when the Application master fails. Higher values specify the number of restarts of the ApplicationMaster by YARN.
+
+## Debugging a failed YARN session
+
+There are many reasons why a Flink YARN session deployment can fail. A misconfigured Hadoop setup (HDFS permissions, YARN configuration), version incompatibilities (running Flink with vanilla Hadoop dependencies on Cloudera Hadoop) or other errors.
+
+### Log Files
+
+In cases where the Flink YARN session fails during the deployment itself, users have to rely on the logging capabilities of Hadoop YARN. The most useful feature for that is the [YARN log aggregation](http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/).
+To enable it, users have to set the `yarn.log-aggregation-enable` property to `true` in the `yarn-site.xml` file.
+Once that is enabled, users can use the following command to retrieve all log files of a (failed) YARN session.
+
+~~~
+yarn logs -applicationId <application ID>
+~~~
+
+Note that it takes a few seconds after the session has finished until the logs show up.
+
+### YARN Client console & Web interfaces
+
+The Flink YARN client also prints error messages in the terminal if errors occur during runtime (for example if a TaskManager stops working after some time).
+
+In addition to that, there is the YARN Resource Manager web interface (by default on port 8088). The port of the Resource Manager web interface is determined by the `yarn.resourcemanager.webapp.address` configuration value.
+
+It allows to access log files for running YARN applications and shows diagnostics for failed apps.
+
+## Build YARN client for a specific Hadoop version
+
+Users using Hadoop distributions from companies like Hortonworks, Cloudera or MapR might have to build Flink against their specific versions of Hadoop (HDFS) and YARN. Please read the [build instructions]({{ site.baseurl }}/start/building.html) for more details.
+
+## Running Flink on YARN behind Firewalls
+
+Some YARN clusters use firewalls for controlling the network traffic between the cluster and the rest of the network.
+In those setups, Flink jobs can only be submitted to a YARN session from within the cluster's network (behind the firewall).
+If this is not feasible for production use, Flink allows to configure a port range for all relevant services. With these
+ranges configured, users can also submit jobs to Flink crossing the firewall.
+
+Currently, two services are needed to submit a job:
+
+ * The JobManager (ApplicationMaster in YARN)
+ * The BlobServer running within the JobManager.
+
+When submitting a job to Flink, the BlobServer will distribute the jars with the user code to all worker nodes (TaskManagers).
+The JobManager receives the job itself and triggers the execution.
+
+The two configuration parameters for specifying the ports are the following:
+
+ * `yarn.application-master.port`
+ * `blob.server.port`
+
+These two configuration options accept single ports (for example: "50010"), ranges ("50000-50025"), or a combination of
+both ("50010,50011,50020-50025,50050-50075").
+
+(Hadoop is using a similar mechanism, there the configuration parameter is called `yarn.app.mapreduce.am.job.client.port-range`.)
+
+## Background / Internals
+
+This section briefly describes how Flink and YARN interact.
+
+<img src="{{ site.baseurl }}/fig/FlinkOnYarn.svg" class="img-responsive">
+
+The YARN client needs to access the Hadoop configuration to connect to the YARN resource manager and to HDFS. It determines the Hadoop configuration using the following strategy:
+
+* Test if `YARN_CONF_DIR`, `HADOOP_CONF_DIR` or `HADOOP_CONF_PATH` are set (in that order). If one of these variables are set, they are used to read the configuration.
+* If the above strategy fails (this should not be the case in a correct YARN setup), the client is using the `HADOOP_HOME` environment variable. If it is set, the client tries to access `$HADOOP_HOME/etc/hadoop` (Hadoop 2) and `$HADOOP_HOME/conf` (Hadoop 1).
+
+When starting a new Flink YARN session, the client first checks if the requested resources (containers and memory) are available. After that, it uploads a jar that contains Flink and the configuration to HDFS (step 1).
+
+The next step of the client is to request (step 2) a YARN container to start the *ApplicationMaster* (step 3). Since the client registered the configuration and jar-file as a resource for the container, the NodeManager of YARN running on that particular machine will take care of preparing the container (e.g. downloading the files). Once that has finished, the *ApplicationMaster* (AM) is started.
+
+The *JobManager* and AM are running in the same container. Once they successfully started, the AM knows the address of the JobManager (its own host). It is generating a new Flink configuration file for the TaskManagers (so that they can connect to the JobManager). The file is also uploaded to HDFS. Additionally, the *AM* container is also serving Flink's web interface. All ports the YARN code is allocating are *ephemeral ports*. This allows users to execute multiple Flink YARN sessions in parallel.
+
+After that, the AM starts allocating the containers for Flink's TaskManagers, which will download the jar file and the modified configuration from the HDFS. Once these steps are completed, Flink is set up and ready to accept Jobs.

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/index.md
----------------------------------------------------------------------
diff --git a/docs/ops/index.md b/docs/ops/index.md
new file mode 100644
index 0000000..a2e33ad
--- /dev/null
+++ b/docs/ops/index.md
@@ -0,0 +1,25 @@
+---
+title: "Deployment & Operations"
+nav-id: ops
+nav-title: '<i class="fa fa-sliders title maindish" aria-hidden="true"></i> Deployment & Operations'
+nav-parent_id: root
+nav-pos: 6
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/jobmanager_high_availability.md
----------------------------------------------------------------------
diff --git a/docs/ops/jobmanager_high_availability.md b/docs/ops/jobmanager_high_availability.md
new file mode 100644
index 0000000..7dd7d4c
--- /dev/null
+++ b/docs/ops/jobmanager_high_availability.md
@@ -0,0 +1,239 @@
+---
+title: "JobManager High Availability (HA)"
+nav-title: High Availability (HA)
+nav-parent_id: ops
+nav-pos: 2
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+The JobManager coordinates every Flink deployment. It is responsible for both *scheduling* and *resource management*.
+
+By default, there is a single JobManager instance per Flink cluster. This creates a *single point of failure* (SPOF): if the JobManager crashes, no new programs can be submitted and running programs fail.
+
+With JobManager High Availability, you can recover from JobManager failures and thereby eliminate the *SPOF*. You can configure high availability for both **standalone** and **YARN clusters**.
+
+* Toc
+{:toc}
+
+## Standalone Cluster High Availability
+
+The general idea of JobManager high availability for standalone clusters is that there is a **single leading JobManager** at any time and **multiple standby JobManagers** to take over leadership in case the leader fails. This guarantees that there is **no single point of failure** and programs can make progress as soon as a standby JobManager has taken leadership. There is no explicit distinction between standby and master JobManager instances. Each JobManager can take the role of master or standby.
+
+As an example, consider the following setup with three JobManager instances:
+
+<img src="{{ site.baseurl }}/fig/jobmanager_ha_overview.png" class="center" />
+
+### Configuration
+
+To enable JobManager High Availability you have to set the **high-availability mode** to *zookeeper*, configure a **ZooKeeper quorum** and set up a **masters file** with all JobManagers hosts and their web UI ports.
+
+Flink leverages **[ZooKeeper](http://zookeeper.apache.org)** for *distributed coordination* between all running JobManager instances. ZooKeeper is a separate service from Flink, which provides highly reliable distributed coordination via leader election and light-weight consistent state storage. Check out [ZooKeeper's Getting Started Guide](http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html) for more information about ZooKeeper. Flink includes scripts to [bootstrap a simple ZooKeeper](#bootstrap-zookeeper) installation.
+
+#### Masters File (masters)
+
+In order to start an HA-cluster configure the *masters* file in `conf/masters`:
+
+- **masters file**: The *masters file* contains all hosts, on which JobManagers are started, and the ports to which the web user interface binds.
+
+  <pre>
+jobManagerAddress1:webUIPort1
+[...]
+jobManagerAddressX:webUIPortX
+  </pre>
+
+By default, the job manager will pick a *random port* for inter process communication. You can change this via the **`high-availability.jobmanager.port`** key. This key accepts single ports (e.g. `50010`), ranges (`50000-50025`), or a combination of both (`50010,50011,50020-50025,50050-50075`).
+
+#### Config File (flink-conf.yaml)
+
+In order to start an HA-cluster add the following configuration keys to `conf/flink-conf.yaml`:
+
+- **high-availability mode** (required): The *high-availability mode* has to be set in `conf/flink-conf.yaml` to *zookeeper* in order to enable high availability mode.
+
+  <pre>high-availability: zookeeper</pre>
+
+- **ZooKeeper quorum** (required): A *ZooKeeper quorum* is a replicated group of ZooKeeper servers, which provide the distributed coordination service.
+
+  <pre>high-availability.zookeeper.quorum: address1:2181[,...],addressX:2181</pre>
+
+  Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port.
+
+- **ZooKeeper root** (recommended): The *root ZooKeeper node*, under which all cluster nodes are placed.
+
+  <pre>high-availability.zookeeper.path.root: /flink
+
+- **ZooKeeper cluster-id** (recommended): The *cluster-id ZooKeeper node*, under which all required coordination data for a cluster is placed.
+
+  <pre>high-availability.zookeeper.path.cluster-id: /default_ns # important: customize per cluster</pre>
+
+  **Important**: You should not set this value manually when runnig a YARN
+  cluster, a per-job YARN session, or on another cluster manager. In those
+  cases a cluster-id is automatically being generated based on the application
+  id. Manually setting a cluster-id overrides this behaviour in YARN.
+  Specifying a cluster-id with the -z CLI option, in turn, overrides manual
+  configuration. If you are running multiple Flink HA clusters on bare metal,
+  you have to manually configure separate cluster-ids for each cluster.
+
+- **Storage directory** (required): JobManager metadata is persisted in the file system *storageDir* and only a pointer to this state is stored in ZooKeeper.
+
+    <pre>
+high-availability.zookeeper.storageDir: hdfs:///flink/recovery
+    </pre>
+
+    The `storageDir` stores all metadata needed to recover a JobManager failure.
+
+After configuring the masters and the ZooKeeper quorum, you can use the provided cluster startup scripts as usual. They will start an HA-cluster. Keep in mind that the **ZooKeeper quorum has to be running** when you call the scripts and make sure to **configure a separate ZooKeeper root path** for each HA cluster you are starting.
+
+#### Example: Standalone Cluster with 2 JobManagers
+
+1. **Configure high availability mode and ZooKeeper quorum** in `conf/flink-conf.yaml`:
+
+   <pre>
+high-availability: zookeeper
+high-availability.zookeeper.quorum: localhost:2181
+high-availability.zookeeper.path.root: /flink
+high-availability.zookeeper.path.cluster-id: /cluster_one # important: customize per cluster
+high-availability.zookeeper.storageDir: hdfs:///flink/recovery</pre>
+
+2. **Configure masters** in `conf/masters`:
+
+   <pre>
+localhost:8081
+localhost:8082</pre>
+
+3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
+
+   <pre>server.0=localhost:2888:3888</pre>
+
+4. **Start ZooKeeper quorum**:
+
+   <pre>
+$ bin/start-zookeeper-quorum.sh
+Starting zookeeper daemon on host localhost.</pre>
+
+5. **Start an HA-cluster**:
+
+   <pre>
+$ bin/start-cluster.sh
+Starting HA cluster with 2 masters and 1 peers in ZooKeeper quorum.
+Starting jobmanager daemon on host localhost.
+Starting jobmanager daemon on host localhost.
+Starting taskmanager daemon on host localhost.</pre>
+
+6. **Stop ZooKeeper quorum and cluster**:
+
+   <pre>
+$ bin/stop-cluster.sh
+Stopping taskmanager daemon (pid: 7647) on localhost.
+Stopping jobmanager daemon (pid: 7495) on host localhost.
+Stopping jobmanager daemon (pid: 7349) on host localhost.
+$ bin/stop-zookeeper-quorum.sh
+Stopping zookeeper daemon (pid: 7101) on host localhost.</pre>
+
+## YARN Cluster High Availability
+
+When running a highly available YARN cluster, **we don't run multiple JobManager (ApplicationMaster) instances**, but only one, which is restarted by YARN on failures. The exact behaviour depends on on the specific YARN version you are using.
+
+### Configuration
+
+#### Maximum Application Master Attempts (yarn-site.xml)
+
+You have to configure the maximum number of attempts for the application masters for **your** YARN setup in `yarn-site.xml`:
+
+{% highlight xml %}
+<property>
+  <name>yarn.resourcemanager.am.max-attempts</name>
+  <value>4</value>
+  <description>
+    The maximum number of application master execution attempts.
+  </description>
+</property>
+{% endhighlight %}
+
+The default for current YARN versions is 2 (meaning a single JobManager failure is tolerated).
+
+#### Application Attempts (flink-conf.yaml)
+
+In addition to the HA configuration ([see above](#configuration)), you have to configure the maximum attempts in `conf/flink-conf.yaml`:
+
+<pre>yarn.application-attempts: 10</pre>
+
+This means that the application can be restarted 10 times before YARN fails the application. It's important to note that `yarn.resourcemanager.am.max-attempts` is an upper bound for the application restarts. Therfore, the number of application attempts set within Flink cannot exceed the YARN cluster setting with which YARN was started.
+
+#### Container Shutdown Behaviour
+
+- **YARN 2.3.0 < version < 2.4.0**. All containers are restarted if the application master fails.
+- **YARN 2.4.0 < version < 2.6.0**. TaskManager containers are kept alive across application master failures. This has the advantage that the startup time is faster and that the user does not have to wait for obtaining the container resources again.
+- **YARN 2.6.0 <= version**: Sets the attempt failure validity interval to the Flinks' Akka timeout value. The attempt failure validity interval says that an application is only killed after the system has seen the maximum number of application attempts during one interval. This avoids that a long lasting job will deplete it's application attempts.
+
+<p style="border-radius: 5px; padding: 5px" class="bg-danger"><b>Note</b>: Hadoop YARN 2.4.0 has a major bug (fixed in 2.5.0) preventing container restarts from a restarted Application Master/Job Manager container. See <a href="https://issues.apache.org/jira/browse/FLINK-4142">FLINK-4142</a> for details. We recommend using at least Hadoop 2.5.0 for high availability setups on YARN.</p>
+
+#### Example: Highly Available YARN Session
+
+1. **Configure HA mode and ZooKeeper quorum** in `conf/flink-conf.yaml`:
+
+   <pre>
+high-availability: zookeeper
+high-availability.zookeeper.quorum: localhost:2181
+high-availability.zookeeper.storageDir: hdfs:///flink/recovery
+high-availability.zookeeper.path.root: /flink
+yarn.application-attempts: 10</pre>
+
+3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine):
+
+   <pre>server.0=localhost:2888:3888</pre>
+
+4. **Start ZooKeeper quorum**:
+
+   <pre>
+$ bin/start-zookeeper-quorum.sh
+Starting zookeeper daemon on host localhost.</pre>
+
+5. **Start an HA-cluster**:
+
+   <pre>
+$ bin/yarn-session.sh -n 2</pre>
+
+## Configuring for Zookeeper Security
+
+If ZooKeeper is running in secure mode with Kerberos, you can override the following configurations in `flink-conf.yaml` as necessary:
+
+<pre>
+zookeeper.sasl.service-name: zookeeper     # default is "zookeeper". If the ZooKeeper quorum is configured
+                                           # with a different service name then it can be supplied here.
+zookeeper.sasl.login-context-name: Client  # default is "Client". The value needs to match one of the values
+                                           # configured in "security.kerberos.login.contexts".
+</pre>
+
+For more information on Flink configuration for Kerberos security, please see [here]({{ site.baseurl}}/ops/config.html).
+You can also find [here]({{ site.baseurl}}/ops/security-kerberos.html) further details on how Flink internally setups Kerberos-based security.
+
+## Bootstrap ZooKeeper
+
+If you don't have a running ZooKeeper installation, you can use the helper scripts, which ship with Flink.
+
+There is a ZooKeeper configuration template in `conf/zoo.cfg`. You can configure the hosts to run ZooKeeper on with the `server.X` entries, where X is a unique ID of each server:
+
+<pre>
+server.X=addressX:peerPort:leaderPort
+[...]
+server.Y=addressY:peerPort:leaderPort
+</pre>
+
+The script `bin/start-zookeeper-quorum.sh` will start a ZooKeeper server on each of the configured hosts. The started processes start ZooKeeper servers via a Flink wrapper, which reads the configuration from `conf/zoo.cfg` and makes sure to set some required configuration values for convenience. In production setups, it is recommended to manage your own ZooKeeper installation.

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/production_ready.md
----------------------------------------------------------------------
diff --git a/docs/ops/production_ready.md b/docs/ops/production_ready.md
index 2cce8d0..c58ce5b 100644
--- a/docs/ops/production_ready.md
+++ b/docs/ops/production_ready.md
@@ -1,7 +1,7 @@
 ---
 title: "Production Readiness Checklist"
-nav-parent_id: setup
-nav-pos: 20
+nav-parent_id: ops
+nav-pos: 5
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
@@ -64,7 +64,7 @@ parallelism as a function of the parallelism when the job is first started:
 
 ### Set UUIDs for operators
 
-As mentioned in the documentation for [savepoints]({{ site.baseurl }}/setup/savepoints.html), users should set uids for
+As mentioned in the documentation for [savepoints]({{ site.baseurl }}/ops/state/savepoints.html), users should set uids for
 operators. Those operator uids are important for Flink's mapping of operator states to operators which, in turn, is 
 essential for savepoints. By default operator uids are generated by traversing the JobGraph and hashing certain operator 
 properties. While this is comfortable from a user perspective, it is also very fragile, as changes to the JobGraph (e.g.

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/security-kerberos.md
----------------------------------------------------------------------
diff --git a/docs/ops/security-kerberos.md b/docs/ops/security-kerberos.md
index 3e5cad9..eac72f1 100644
--- a/docs/ops/security-kerberos.md
+++ b/docs/ops/security-kerberos.md
@@ -1,6 +1,6 @@
 ---
 title:  "Kerberos Authentication Setup and Configuration"
-nav-parent_id: setup
+nav-parent_id: ops
 nav-pos: 10
 nav-title: Kerberos
 ---
@@ -83,7 +83,7 @@ Here is some information specific to each deployment mode.
 
 Steps to run a secure Flink cluster in standalone/cluster mode:
 
-1. Add security-related configuration options to the Flink configuration file (on all cluster nodes) (see [here]({{site.baseurl}}/setup/config.html#kerberos-based-security)).
+1. Add security-related configuration options to the Flink configuration file (on all cluster nodes) (see [here](config.html#kerberos-based-security)).
 2. Ensure that the keytab file exists at the path indicated by `security.kerberos.login.keytab` on all cluster nodes.
 3. Deploy Flink cluster as normal.
 
@@ -91,7 +91,7 @@ Steps to run a secure Flink cluster in standalone/cluster mode:
 
 Steps to run a secure Flink cluster in YARN/Mesos mode:
 
-1. Add security-related configuration options to the Flink configuration file on the client (see [here]({{site.baseurl}}/setup/config.html#kerberos-based-security)).
+1. Add security-related configuration options to the Flink configuration file on the client (see [here](config.html#kerberos-based-security)).
 2. Ensure that the keytab file exists at the path as indicated by `security.kerberos.login.keytab` on the client node.
 3. Deploy Flink cluster as normal.
 
@@ -107,7 +107,7 @@ The main drawback is that the cluster is necessarily short-lived since the gener
 
 Steps to run a secure Flink cluster using `kinit`:
 
-1. Add security-related configuration options to the Flink configuration file on the client (see [here]({{site.baseurl}}/setup/config.html#kerberos-based-security)).
+1. Add security-related configuration options to the Flink configuration file on the client (see [here](config.html#kerberos-based-security)).
 2. Login using the `kinit` command.
 3. Deploy Flink cluster as normal.
 

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/security-ssl.md
----------------------------------------------------------------------
diff --git a/docs/ops/security-ssl.md b/docs/ops/security-ssl.md
new file mode 100644
index 0000000..7c7268a
--- /dev/null
+++ b/docs/ops/security-ssl.md
@@ -0,0 +1,144 @@
+---
+title: "SSL Setup"
+nav-parent_id: ops
+nav-pos: 10
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This page provides instructions on how to enable SSL for the network communication between different flink components.
+
+## SSL Configuration
+
+SSL can be enabled for all network communication between flink components. SSL keystores and truststore has to be deployed on each flink node and configured (conf/flink-conf.yaml) using keys in the security.ssl.* namespace (Please see the [configuration page](config.html) for details). SSL can be selectively enabled/disabled for different transports using the following flags. These flags are only applicable when security.ssl.enabled is set to true.
+
+* **taskmanager.data.ssl.enabled**: SSL flag for data communication between task managers
+* **blob.service.ssl.enabled**: SSL flag for blob service client/server communication
+* **akka.ssl.enabled**: SSL flag for the akka based control connection between the flink client, jobmanager and taskmanager 
+* **jobmanager.web.ssl.enabled**: Flag to enable https access to the jobmanager's web frontend
+
+## Deploying Keystores and Truststores
+
+You need to have a Java Keystore generated and copied to each node in the flink cluster. The common name or subject alternative names in the certificate should match the node's hostname and IP address. Keystores and truststores can be generated using the keytool utility (https://docs.oracle.com/javase/8/docs/technotes/tools/unix/keytool.html). All flink components should have read access to the keystore and truststore files.
+
+### Example: Creating self signed CA and keystores for a 2 node cluster
+
+Execute the following keytool commands to create a truststore with a self signed CA
+
+~~~
+keytool -genkeypair -alias ca -keystore ca.keystore -dname "CN=Sample CA" -storepass password -keypass password -keyalg RSA -ext bc=ca:true
+keytool -keystore ca.keystore -storepass password -alias ca -exportcert > ca.cer
+keytool -importcert -keystore ca.truststore -alias ca -storepass password -noprompt -file ca.cer
+~~~
+
+Now create keystores for each node with certificates signed by the above CA. Let node1.company.org and node2.company.org be the hostnames with IPs 192.168.1.1 and 192.168.1.2 respectively
+
+#### Node 1
+~~~
+keytool -genkeypair -alias node1 -keystore node1.keystore -dname "CN=node1.company.org" -ext SAN=dns:node1.company.org,ip:192.168.1.1 -storepass password -keypass password -keyalg RSA
+keytool -certreq -keystore node1.keystore -storepass password -alias node1 -file node1.csr
+keytool -gencert -keystore ca.keystore -storepass password -alias ca -ext SAN=dns:node1.company.org,ip:192.168.1.1 -infile node1.csr -outfile node1.cer
+keytool -importcert -keystore node1.keystore -storepass password -file ca.cer -alias ca -noprompt
+keytool -importcert -keystore node1.keystore -storepass password -file node1.cer -alias node1 -noprompt
+~~~
+
+#### Node 2
+~~~
+keytool -genkeypair -alias node2 -keystore node2.keystore -dname "CN=node2.company.org" -ext SAN=dns:node2.company.org,ip:192.168.1.2 -storepass password -keypass password -keyalg RSA
+keytool -certreq -keystore node2.keystore -storepass password -alias node2 -file node2.csr
+keytool -gencert -keystore ca.keystore -storepass password -alias ca -ext SAN=dns:node2.company.org,ip:192.168.1.2 -infile node2.csr -outfile node2.cer
+keytool -importcert -keystore node2.keystore -storepass password -file ca.cer -alias ca -noprompt
+keytool -importcert -keystore node2.keystore -storepass password -file node2.cer -alias node2 -noprompt
+~~~
+
+## Standalone Deployment
+Configure each node in the standalone cluster to pick up the keystore and truststore files present in the local file system.
+
+### Example: 2 node cluster
+
+* Generate 2 keystores, one for each node, and copy them to the filesystem on the respective node. Also copy the pulic key of the CA (which was used to sign the certificates in the keystore) as a Java truststore on both the nodes
+* Configure conf/flink-conf.yaml to pick up these files
+
+#### Node 1
+~~~
+security.ssl.enabled: true
+security.ssl.keystore: /usr/local/node1.keystore
+security.ssl.keystore-password: abc123
+security.ssl.key-password: abc123
+security.ssl.truststore: /usr/local/ca.truststore
+security.ssl.truststore-password: abc123
+~~~
+
+#### Node 2
+~~~
+security.ssl.enabled: true
+security.ssl.keystore: /usr/local/node2.keystore
+security.ssl.keystore-password: abc123
+security.ssl.key-password: abc123
+security.ssl.truststore: /usr/local/ca.truststore
+security.ssl.truststore-password: abc123
+~~~
+
+* Restart the flink components to enable SSL for all of flink's internal communication
+* Verify by accessing the jobmanager's UI using https url. The task manager's path in the UI should show akka.ssl.tcp:// as the protocol
+* The blob server and task manager's data communication can be verified from the log files
+
+## YARN Deployment
+The keystores and truststore can be deployed in a YARN setup in multiple ways depending on the cluster setup. Following are 2 ways to achieve this
+
+### 1. Deploy keystores before starting the YARN session
+The keystores and truststore should be generated and deployed on all nodes in the YARN setup where flink components can potentially be executed. The same flink config file from the flink YARN client is used for all the flink components running in the YARN cluster. Therefore we need to ensure the keystore is deployed and accessible using the same filepath in all the YARN nodes.
+
+#### Example config
+~~~
+security.ssl.enabled: true
+security.ssl.keystore: /usr/local/node.keystore
+security.ssl.keystore-password: abc123
+security.ssl.key-password: abc123
+security.ssl.truststore: /usr/local/ca.truststore
+security.ssl.truststore-password: abc123
+~~~
+
+Now you can start the YARN session from the CLI like you would normally do.
+
+### 2. Use YARN cli to deploy the keystores and truststore
+We can use the YARN client's ship files option (-yt) to distribute the keystores and truststore. Since the same keystore will be deployed at all nodes, we need to ensure a single certificate in the keystore can be served for all nodes. This can be done by either using the Subject Alternative Name(SAN) extension in the certificate and setting it to cover all nodes (hostname and ip addresses) in the cluster or by using wildcard subdomain names (if the cluster is setup accordingly). 
+
+#### Example
+* Supply the following parameters to the keytool command when generating the keystore: -ext SAN=dns:node1.company.org,ip:192.168.1.1,dns:node2.company.org,ip:192.168.1.2
+* Copy the keystore and the CA's truststore into a local directory (at the cli's working directory), say deploy-keys/
+* Update the configuration to pick up the files from a relative path
+
+~~~
+security.ssl.enabled: true
+security.ssl.keystore: deploy-keys/node.keystore
+security.ssl.keystore-password: password
+security.ssl.key-password: password
+security.ssl.truststore: deploy-keys/ca.truststore
+security.ssl.truststore-password: password
+~~~
+
+* Start the YARN session using the -yt parameter
+
+~~~
+flink run -m yarn-cluster -yt deploy-keys/ TestJob.jar
+~~~
+
+When deployed using YARN, flink's web dashboard is accessible through YARN proxy's Tracking URL. To ensure that the YARN proxy is able to access flink's https url you need to configure YARN proxy to accept flink's SSL certificates. Add the custom CA certificate into Java's default trustore on the YARN Proxy node.
+

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/state/checkpoints.md
----------------------------------------------------------------------
diff --git a/docs/ops/state/checkpoints.md b/docs/ops/state/checkpoints.md
new file mode 100644
index 0000000..4f2a9da
--- /dev/null
+++ b/docs/ops/state/checkpoints.md
@@ -0,0 +1,101 @@
+---
+title: "Checkpoints"
+nav-parent_id: ops_state
+nav-pos: 7
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+* toc
+{:toc}
+
+## Overview
+
+Checkpoints make state in Flink fault tolerant by allowing state and the
+corresponding stream positions to be recovered, thereby giving the application
+the same semantics as a failure-free execution.
+
+See [Checkpointing](../../dev/stream/state/checkpointing.html) for how to enable and
+configure checkpoints for your program.
+
+## Externalized Checkpoints
+
+Checkpoints are by default not persisted externally and are only used to
+resume a job from failures. They are deleted when a program is cancelled.
+You can, however, configure periodic checkpoints to be persisted externally
+similarly to [savepoints](savepoints.html). These *externalized checkpoints*
+write their meta data out to persistent storage and are *not* automatically
+cleaned up when the job fails. This way, you will have a checkpoint around
+to resume from if your job fails.
+
+```java
+CheckpointConfig config = env.getCheckpointConfig();
+config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
+```
+
+The `ExternalizedCheckpointCleanup` mode configures what happens with externalized checkpoints when you cancel the job:
+
+- **`ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION`**: Retain the externalized checkpoint when the job is cancelled. Note that you have to manually clean up the checkpoint state after cancellation in this case.
+
+- **`ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION`**: Delete the externalized checkpoint when the job is cancelled. The checkpoint state will only be available if the job fails.
+
+### Directory Structure
+
+Similarly to [savepoints](savepoints.html), an externalized checkpoint consists
+of a meta data file and, depending on the state back-end, some additional data
+files. The **target directory** for the externalized checkpoint's meta data is
+determined from the configuration key `state.checkpoints.dir` which, currently,
+can only be set via the configuration files.
+
+```
+state.checkpoints.dir: hdfs:///checkpoints/
+```
+
+This directory will then contain the checkpoint meta data required to restore
+the checkpoint. For the `MemoryStateBackend`, this meta data file will be
+self-contained and no further files are needed.
+
+`FsStateBackend` and `RocksDBStateBackend` write separate data files
+and only write the paths to these files into the meta data file. These data
+files are stored at the path given to the state back-end during construction.
+
+```java
+env.setStateBackend(new RocksDBStateBackend("hdfs:///checkpoints-data/");
+```
+
+### Difference to Savepoints
+
+Externalized checkpoints have a few differences from [savepoints](savepoints.html). They
+- use a state backend specific (low-level) data format,
+- may be incremental,
+- do not support Flink specific features like rescaling.
+
+### Resuming from an externalized checkpoint
+
+A job may be resumed from an externalized checkpoint just as from a savepoint
+by using the checkpoint's meta data file instead (see the
+[savepoint restore guide](../cli.html#restore-a-savepoint)). Note that if the
+meta data file is not self-contained, the jobmanager needs to have access to
+the data files it refers to (see [Directory Structure](#directory-structure)
+above).
+
+```sh
+$ bin/flink run -s :checkpointMetaDataPath [:runArgs]
+```

http://git-wip-us.apache.org/repos/asf/flink/blob/47070674/docs/ops/state/index.md
----------------------------------------------------------------------
diff --git a/docs/ops/state/index.md b/docs/ops/state/index.md
new file mode 100644
index 0000000..8725f87
--- /dev/null
+++ b/docs/ops/state/index.md
@@ -0,0 +1,24 @@
+---
+nav-title: 'State & Fault Tolerance'
+nav-id: ops_state
+nav-parent_id: ops
+nav-pos: 3
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->