You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by se...@apache.org on 2017/05/06 17:47:53 UTC

[09/12] flink git commit: [FLINK-6443] [docs] Add more links to concepts docs

[FLINK-6443] [docs] Add more links to concepts docs

This closes #3822


Project: http://git-wip-us.apache.org/repos/asf/flink/repo
Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/b01d737a
Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/b01d737a
Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/b01d737a

Branch: refs/heads/master
Commit: b01d737ae1452dbfafd4696ff14d52dce5b60efd
Parents: 6c48f9b
Author: David Anderson <da...@alpinegizmo.com>
Authored: Thu May 4 11:27:33 2017 +0200
Committer: Stephan Ewen <se...@apache.org>
Committed: Sat May 6 19:41:53 2017 +0200

----------------------------------------------------------------------
 docs/concepts/programming-model.md | 22 +++++++++++++++++-----
 docs/concepts/runtime.md           | 16 ++++++++--------
 2 files changed, 25 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/flink/blob/b01d737a/docs/concepts/programming-model.md
----------------------------------------------------------------------
diff --git a/docs/concepts/programming-model.md b/docs/concepts/programming-model.md
index 3d2aebb..d83cf00 100644
--- a/docs/concepts/programming-model.md
+++ b/docs/concepts/programming-model.md
@@ -48,7 +48,7 @@ Flink offers different levels of abstraction to develop streaming/batch applicat
     for certain operations only. The *DataSet API* offers additional primitives on bounded data sets, like loops/iterations.
 
   - The **Table API** is a declarative DSL centered around *tables*, which may be dynamically changing tables (when representing streams).
-    The Table API follows the (extended) relational model: Tables have a schema attached (similar to tables in relational databases)
+    The [Table API](../dev/table_api.html) follows the (extended) relational model: Tables have a schema attached (similar to tables in relational databases)
     and the API offers comparable operations, such as select, project, join, group-by, aggregate, etc.
     Table API programs declaratively define *what logical operation should be done* rather than specifying exactly
    *how the code for the operation looks*. Though the Table API is extensible by various types of user-defined
@@ -60,7 +60,7 @@ Flink offers different levels of abstraction to develop streaming/batch applicat
 
   - The highest level abstraction offered by Flink is **SQL**. This abstraction is similar to the *Table API* both in semantics and
     expressiveness, but represents programs as SQL query expressions.
-    The SQL abstraction closely interacts with the Table API, and SQL queries can be executed over tables defined in the *Table API*.
+    The [SQL](../dev/table_api.html#sql) abstraction closely interacts with the Table API, and SQL queries can be executed over tables defined in the *Table API*.
 
 
 ## Programs and Dataflows
@@ -81,6 +81,9 @@ arbitrary **directed acyclic graphs** *(DAGs)*. Although special forms of cycles
 Often there is a one-to-one correspondence between the transformations in the programs and the operators
 in the dataflow. Sometimes, however, one transformation may consist of multiple transformation operators.
 
+Sources and sinks are documented in the [streaming connectors](../dev/connectors/index.html) and [batch connectors](../dev/batch/connectors.html) docs.
+Transformations are documented in [DataStream transformations](../dev/datastream_api.html#datastream-transformations) and [DataSet transformations](../dev/batch/dataset_transformations.html).
+
 {% top %}
 
 ## Parallel Dataflows
@@ -112,6 +115,8 @@ Streams can transport data between two operators in a *one-to-one* (or *forwardi
     is preserved, but the parallelism does introduce non-determinism regarding the order in
     which the aggregated results for different keys arrive at the sink.
 
+Details about configuring and controlling parallelism can be found in the docs on [parallel execution](../dev/parallel.html).
+
 {% top %}
 
 ## Windows
@@ -128,6 +133,7 @@ One typically distinguishes different types of windows, such as *tumbling window
 <img src="../fig/windows.svg" alt="Time- and Count Windows" class="offset" width="80%" />
 
 More window examples can be found in this [blog post](https://flink.apache.org/news/2015/12/04/Introducing-windows.html).
+More details are in the [window docs](../dev/windows.html).
 
 {% top %}
 
@@ -165,6 +171,8 @@ This alignment also allows Flink to redistribute the state and adjust the stream
 
 <img src="../fig/state_partitioning.svg" alt="State and Partitioning" class="offset" width="50%" />
 
+For more information, see the documentation on [working with state](../dev/stream/state.html).
+
 {% top %}
 
 ## Checkpoints for Fault Tolerance
@@ -178,17 +186,21 @@ point of the checkpoint.
 The checkpoint interval is a means of trading off the overhead of fault tolerance during execution with the recovery time (the number
 of events that need to be replayed).
 
-More details on checkpoints and fault tolerance are in the [fault tolerance docs]({{ site.baseurl }}/internals/stream_checkpointing.html).
+The description of the [fault tolerance internals]({{ site.baseurl }}/internals/stream_checkpointing.html) provides
+more information about how Flink manages checkpoints and related topics.
+Details about enabling and configuring checkpointing are in the [checkpointing API docs](../dev/stream/checkpointing.html).
+
 
 {% top %}
 
 ## Batch on Streaming
 
-Flink executes batch programs as a special case of streaming programs, where the streams are bounded (finite number of elements).
+Flink executes [batch programs](../dev/batch/index.html) as a special case of streaming programs, where the streams are bounded (finite number of elements).
 A *DataSet* is treated internally as a stream of data. The concepts above thus apply to batch programs in the
 same way as well as they apply to streaming programs, with minor exceptions:
 
-  - Programs in the DataSet API do not use checkpoints. Recovery happens by fully replaying the streams.
+  - [Fault tolerance for batch programs](../dev/batch/fault_tolerance.html) does not use checkpointing.
+    Recovery happens by fully replaying the streams.
     That is possible, because inputs are bounded. This pushes the cost more towards the recovery,
     but makes the regular processing cheaper, because it avoids checkpoints.
 

http://git-wip-us.apache.org/repos/asf/flink/blob/b01d737a/docs/concepts/runtime.md
----------------------------------------------------------------------
diff --git a/docs/concepts/runtime.md b/docs/concepts/runtime.md
index 0d4e017..c598b12 100644
--- a/docs/concepts/runtime.md
+++ b/docs/concepts/runtime.md
@@ -31,7 +31,7 @@ under the License.
 For distributed execution, Flink *chains* operator subtasks together into *tasks*. Each task is executed by one thread.
 Chaining operators together into tasks is a useful optimization: it reduces the overhead of thread-to-thread
 handover and buffering, and increases overall throughput while decreasing latency.
-The chaining behavior can be configured in the APIs.
+The chaining behavior can be configured; see the [chaining docs](../dev/datastream_api.html#task-chaining-and-resource-groups) for details.
 
 The sample dataflow in the figure below is executed with five subtasks, and hence with five parallel threads.
 
@@ -54,9 +54,9 @@ The Flink runtime consists of two types of processes:
 
     There must always be at least one TaskManager.
 
-The JobManagers and TaskManagers can be started in various ways: directly on the machines, in
-containers, or managed by resource frameworks like YARN. TaskManagers connect to JobManagers, announcing
-themselves as available, and are assigned work.
+The JobManagers and TaskManagers can be started in various ways: directly on the machines as a [standalone cluster](../setup/cluster_setup.html), in
+containers, or managed by resource frameworks like [YARN](../setup/yarn_setup.html) or [Mesos](../setup/mesos.html).
+TaskManagers connect to JobManagers, announcing themselves as available, and are assigned work.
 
 The **client** is not part of the runtime and program execution, but is used to prepare and send a dataflow to the JobManager.
 After that, the client can disconnect, or stay connected to receive progress reports. The client runs either as part of the
@@ -98,7 +98,7 @@ job. Allowing this *slot sharing* has two main benefits:
 
 <img src="../fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" />
 
-The APIs also include a *resource group* mechanism which can be used to prevent undesirable slot sharing. 
+The APIs also include a *[resource group](../dev/datastream_api.html#task-chaining-and-resource-groups)* mechanism which can be used to prevent undesirable slot sharing. 
 
 As a rule-of-thumb, a good default number of task slots would be the number of CPU cores.
 With hyper-threading, each slot then takes 2 or more hardware thread contexts.
@@ -107,7 +107,7 @@ With hyper-threading, each slot then takes 2 or more hardware thread contexts.
 
 ## State Backends
 
-The exact data structures in which the key/values indexes are stored depends on the chosen **state backend**. One state backend
+The exact data structures in which the key/values indexes are stored depends on the chosen [state backend](../ops/state_backends.html). One state backend
 stores data in an in-memory hash map, another state backend uses [RocksDB](http://rocksdb.org) as the key/value store.
 In addition to defining the data structure that holds the state, the state backends also implement the logic to
 take a point-in-time snapshot of the key/value state and store that snapshot as part of a checkpoint.
@@ -120,8 +120,8 @@ take a point-in-time snapshot of the key/value state and store that snapshot as
 
 Programs written in the Data Stream API can resume execution from a **savepoint**. Savepoints allow both updating your programs and your Flink cluster without losing any state. 
 
-Savepoints are **manually triggered checkpoints**, which take a snapshot of the program and write it out to a state backend. They rely on the regular checkpointing mechanism for this. During execution programs are periodically snapshotted on the worker nodes and produce checkpoints. For recovery only the last completed checkpoint is needed and older checkpoints can be safely discarded as soon as a new one is completed.
+[Savepoints](..//setup/savepoints.html) are **manually triggered checkpoints**, which take a snapshot of the program and write it out to a state backend. They rely on the regular checkpointing mechanism for this. During execution programs are periodically snapshotted on the worker nodes and produce checkpoints. For recovery only the last completed checkpoint is needed and older checkpoints can be safely discarded as soon as a new one is completed.
 
-Savepoints are similar to these periodic checkpoints except that they are **triggered by the user** and **don't automatically expire** when newer checkpoints are completed.
+Savepoints are similar to these periodic checkpoints except that they are **triggered by the user** and **don't automatically expire** when newer checkpoints are completed. Savepoints can be created from the [command line](../setup/cli.html#savepoints) or when cancelling a job via the [REST API](../monitoring/rest_api.html#cancel-job-with-savepoint).
 
 {% top %}