You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by se...@apache.org on 2019/08/06 07:52:05 UTC

[flink] 02/02: [FLINK-12928][docs] Remove old Flink ML docs

This is an automated email from the ASF dual-hosted git repository.

sewen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git

commit 2ec645a5bfd3cfadaf0057412401e91da0b21873
Author: Seth Wiesman <sj...@gmail.com>
AuthorDate: Fri Jun 21 10:12:47 2019 -0500

    [FLINK-12928][docs] Remove old Flink ML docs
    
    This closes #8827
---
 docs/dev/libs/ml/als.md                            | 177 --------
 docs/dev/libs/ml/als.zh.md                         | 177 --------
 docs/dev/libs/ml/contribution_guide.md             | 108 -----
 docs/dev/libs/ml/contribution_guide.zh.md          | 108 -----
 docs/dev/libs/ml/cross_validation.md               | 173 --------
 docs/dev/libs/ml/cross_validation.zh.md            | 173 --------
 docs/dev/libs/ml/distance_metrics.md               | 109 -----
 docs/dev/libs/ml/distance_metrics.zh.md            | 109 -----
 docs/dev/libs/ml/index.md                          | 150 -------
 docs/dev/libs/ml/index.zh.md                       | 150 -------
 docs/dev/libs/ml/knn.md                            | 146 -------
 docs/dev/libs/ml/knn.zh.md                         | 146 -------
 docs/dev/libs/ml/min_max_scaler.md                 | 114 ------
 docs/dev/libs/ml/min_max_scaler.zh.md              | 114 ------
 docs/dev/libs/ml/multiple_linear_regression.md     | 154 -------
 docs/dev/libs/ml/multiple_linear_regression.zh.md  | 154 -------
 docs/dev/libs/ml/optimization.md                   | 421 --------------------
 docs/dev/libs/ml/optimization.zh.md                | 421 --------------------
 docs/dev/libs/ml/pipelines.md                      | 443 ---------------------
 docs/dev/libs/ml/pipelines.zh.md                   | 443 ---------------------
 docs/dev/libs/ml/polynomial_features.md            | 110 -----
 docs/dev/libs/ml/polynomial_features.zh.md         | 110 -----
 docs/dev/libs/ml/quickstart.md                     | 262 ------------
 docs/dev/libs/ml/quickstart.zh.md                  | 262 ------------
 docs/dev/libs/ml/sos.md                            | 122 ------
 docs/dev/libs/ml/sos.zh.md                         | 122 ------
 docs/dev/libs/ml/standard_scaler.md                | 115 ------
 docs/dev/libs/ml/standard_scaler.zh.md             | 115 ------
 docs/dev/libs/ml/svm.md                            | 222 -----------
 docs/dev/libs/ml/svm.zh.md                         | 222 -----------
 docs/internals/components.md                       |   3 +-
 docs/internals/components.zh.md                    |   3 +-
 docs/redirects/{ml.md => als.md}                   |   6 +-
 docs/redirects/{ml.md => contribution_guide.md}    |   6 +-
 docs/redirects/{ml.md => cross_validation.md}      |   6 +-
 docs/redirects/{ml.md => distance_metrics.md}      |   6 +-
 docs/redirects/{ml.md => flinkml_quickstart.md}    |   6 +-
 docs/redirects/{ml.md => knn.md}                   |   6 +-
 docs/redirects/{ml.md => min_max_scaler.md}        |   6 +-
 docs/redirects/ml.md                               |   4 +-
 .../{ml.md => multiple_linear_regression.md}       |   6 +-
 docs/redirects/{ml.md => optimization.md}          |   6 +-
 docs/redirects/{ml.md => pipelines.md}             |   6 +-
 docs/redirects/{ml.md => polynomial_features.md}   |   6 +-
 docs/redirects/{ml.md => sos.md}                   |   6 +-
 docs/redirects/{ml.md => standard_scaler.md}       |   6 +-
 docs/redirects/{ml.md => svm.md}                   |   6 +-
 47 files changed, 46 insertions(+), 5700 deletions(-)

diff --git a/docs/dev/libs/ml/als.md b/docs/dev/libs/ml/als.md
deleted file mode 100644
index 87c80f8..0000000
--- a/docs/dev/libs/ml/als.md
+++ /dev/null
@@ -1,177 +0,0 @@
----
-mathjax: include
-title: Alternating Least Squares
-nav-title: ALS
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-The alternating least squares (ALS) algorithm factorizes a given matrix $R$ into two factors $U$ and $V$ such that $R \approx U^TV$.
-The unknown row dimension is given as a parameter to the algorithm and is called latent factors.
-Since matrix factorization can be used in the context of recommendation, the matrices $U$ and $V$ can be called user and item matrix, respectively.
-The $i$th column of the user matrix is denoted by $u_i$ and the $i$th column of the item matrix is $v_i$.
-The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$.
-
-In order to find the user and item matrix, the following problem is solved:
-
-$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2 +
-\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$
-
-with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated.
-This regularization scheme to avoid overfitting is called weighted-$\lambda$-regularization.
-Details can be found in the work of [Zhou et al.](http://dx.doi.org/10.1007/978-3-540-68880-8_32).
-
-By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which can be solved directly.
-The solution of the modified problem is guaranteed to monotonically decrease the overall cost function.
-By applying this step alternately to the matrices $U$ and $V$, we can iteratively improve the matrix factorization.
-
-The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
-
-## Operations
-
-`ALS` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-ALS is trained on the sparse representation of the rating matrix:
-
-* `fit: DataSet[(Int, Int, Double)] => Unit`
-
-### Predict
-
-ALS predicts for each tuple of row and column index the rating:
-
-* `predict: DataSet[(Int, Int)] => DataSet[(Int, Int, Double)]`
-
-## Parameters
-
-The alternating least squares implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>NumFactors</strong></td>
-        <td>
-          <p>
-            The number of latent factors to use for the underlying model.
-            It is equivalent to the dimension of the calculated user and item vectors.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Lambda</strong></td>
-        <td>
-          <p>
-            Regularization factor. Tune this value in order to avoid overfitting or poor performance due to strong generalization.
-            (Default value: <strong>1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Iterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Blocks</strong></td>
-        <td>
-          <p>
-            The number of blocks into which the user and item matrix are grouped.
-            The fewer blocks one uses, the less data is sent redundantly.
-            However, bigger blocks entail bigger update messages which have to be stored on the heap.
-            If the algorithm fails because of an OutOfMemoryException, then try to increase the number of blocks.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Seed</strong></td>
-        <td>
-          <p>
-            Random seed used to generate the initial item matrix for the algorithm.
-            (Default value: <strong>0</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>TemporaryPath</strong></td>
-        <td>
-          <p>
-            Path to a temporary directory into which intermediate results are stored.
-            If this value is set, then the algorithm is split into two preprocessing steps, the ALS iteration and a post-processing step which calculates a last ALS half-step.
-            The preprocessing steps calculate the <code>OutBlockInformation</code> and <code>InBlockInformation</code> for the given rating matrix.
-            The results of the individual steps are stored in the specified directory.
-            By splitting the algorithm into multiple smaller steps, Flink does not have to split the available memory amongst too many operators.
-            This allows the system to process bigger individual messages and improves the overall performance.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-// Read input data set from a csv file
-val inputDS: DataSet[(Int, Int, Double)] = env.readCsvFile[(Int, Int, Double)](
-  pathToTrainingFile)
-
-// Setup the ALS learner
-val als = ALS()
-.setIterations(10)
-.setNumFactors(10)
-.setBlocks(100)
-.setTemporaryPath("hdfs://tempPath")
-
-// Set the other parameters via a parameter map
-val parameters = ParameterMap()
-.add(ALS.Lambda, 0.9)
-.add(ALS.Seed, 42L)
-
-// Calculate the factorization
-als.fit(inputDS, parameters)
-
-// Read the testing data set from a csv file
-val testingDS: DataSet[(Int, Int)] = env.readCsvFile[(Int, Int)](pathToData)
-
-// Calculate the ratings according to the matrix factorization
-val predictedRatings = als.predict(testingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/als.zh.md b/docs/dev/libs/ml/als.zh.md
deleted file mode 100644
index 87c80f8..0000000
--- a/docs/dev/libs/ml/als.zh.md
+++ /dev/null
@@ -1,177 +0,0 @@
----
-mathjax: include
-title: Alternating Least Squares
-nav-title: ALS
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-The alternating least squares (ALS) algorithm factorizes a given matrix $R$ into two factors $U$ and $V$ such that $R \approx U^TV$.
-The unknown row dimension is given as a parameter to the algorithm and is called latent factors.
-Since matrix factorization can be used in the context of recommendation, the matrices $U$ and $V$ can be called user and item matrix, respectively.
-The $i$th column of the user matrix is denoted by $u_i$ and the $i$th column of the item matrix is $v_i$.
-The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} = r_{i,j}$$.
-
-In order to find the user and item matrix, the following problem is solved:
-
-$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} - u_{i}^Tv_{j}\right)^2 +
-\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j} n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$
-
-with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the number of items the user $i$ has rated and $$n_{v_j}$$ being the number of times the item $j$ has been rated.
-This regularization scheme to avoid overfitting is called weighted-$\lambda$-regularization.
-Details can be found in the work of [Zhou et al.](http://dx.doi.org/10.1007/978-3-540-68880-8_32).
-
-By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which can be solved directly.
-The solution of the modified problem is guaranteed to monotonically decrease the overall cost function.
-By applying this step alternately to the matrices $U$ and $V$, we can iteratively improve the matrix factorization.
-
-The matrix $R$ is given in its sparse representation as a tuple of $(i, j, r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix value at position $(i,j)$.
-
-## Operations
-
-`ALS` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-ALS is trained on the sparse representation of the rating matrix:
-
-* `fit: DataSet[(Int, Int, Double)] => Unit`
-
-### Predict
-
-ALS predicts for each tuple of row and column index the rating:
-
-* `predict: DataSet[(Int, Int)] => DataSet[(Int, Int, Double)]`
-
-## Parameters
-
-The alternating least squares implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>NumFactors</strong></td>
-        <td>
-          <p>
-            The number of latent factors to use for the underlying model.
-            It is equivalent to the dimension of the calculated user and item vectors.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Lambda</strong></td>
-        <td>
-          <p>
-            Regularization factor. Tune this value in order to avoid overfitting or poor performance due to strong generalization.
-            (Default value: <strong>1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Iterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Blocks</strong></td>
-        <td>
-          <p>
-            The number of blocks into which the user and item matrix are grouped.
-            The fewer blocks one uses, the less data is sent redundantly.
-            However, bigger blocks entail bigger update messages which have to be stored on the heap.
-            If the algorithm fails because of an OutOfMemoryException, then try to increase the number of blocks.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Seed</strong></td>
-        <td>
-          <p>
-            Random seed used to generate the initial item matrix for the algorithm.
-            (Default value: <strong>0</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>TemporaryPath</strong></td>
-        <td>
-          <p>
-            Path to a temporary directory into which intermediate results are stored.
-            If this value is set, then the algorithm is split into two preprocessing steps, the ALS iteration and a post-processing step which calculates a last ALS half-step.
-            The preprocessing steps calculate the <code>OutBlockInformation</code> and <code>InBlockInformation</code> for the given rating matrix.
-            The results of the individual steps are stored in the specified directory.
-            By splitting the algorithm into multiple smaller steps, Flink does not have to split the available memory amongst too many operators.
-            This allows the system to process bigger individual messages and improves the overall performance.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-// Read input data set from a csv file
-val inputDS: DataSet[(Int, Int, Double)] = env.readCsvFile[(Int, Int, Double)](
-  pathToTrainingFile)
-
-// Setup the ALS learner
-val als = ALS()
-.setIterations(10)
-.setNumFactors(10)
-.setBlocks(100)
-.setTemporaryPath("hdfs://tempPath")
-
-// Set the other parameters via a parameter map
-val parameters = ParameterMap()
-.add(ALS.Lambda, 0.9)
-.add(ALS.Seed, 42L)
-
-// Calculate the factorization
-als.fit(inputDS, parameters)
-
-// Read the testing data set from a csv file
-val testingDS: DataSet[(Int, Int)] = env.readCsvFile[(Int, Int)](pathToData)
-
-// Calculate the ratings according to the matrix factorization
-val predictedRatings = als.predict(testingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/contribution_guide.md b/docs/dev/libs/ml/contribution_guide.md
deleted file mode 100644
index cad39b7..0000000
--- a/docs/dev/libs/ml/contribution_guide.md
+++ /dev/null
@@ -1,108 +0,0 @@
----
-mathjax: include
-title: How to Contribute
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-The Flink community highly appreciates all sorts of contributions to FlinkML.
-FlinkML offers people interested in machine learning to work on a highly active open source project which makes scalable ML reality.
-The following document describes how to contribute to FlinkML.
-
-* This will be replaced by the TOC
-{:toc}
-
-## Getting Started
-
-In order to get started first read Flink's [contribution guide](https://flink.apache.org/contributing/how-to-contribute.html).
-Everything from this guide also applies to FlinkML.
-
-## Pick a Topic
-
-If you are looking for some new ideas you should first look into our [roadmap](https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap), then you should check out the list of [unresolved issues on JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC).
-Once you decide to contribute to one of these issues, you should take ownership of it and track your progress with this issue.
-That way, the other contributors know the state of the different issues and redundant work is avoided.
-
-If you already know what you want to contribute to FlinkML all the better.
-It is still advisable to create a JIRA issue for your idea to tell the Flink community what you want to do, though.
-
-## Testing
-
-New contributions should come with tests to verify the correct behavior of the algorithm.
-The tests help to maintain the algorithm's correctness throughout code changes, e.g. refactorings.
-
-We distinguish between unit tests, which are executed during Maven's test phase, and integration tests, which are executed during maven's verify phase.
-Maven automatically makes this distinction by using the following naming rules:
-All test cases whose class name ends with a suffix fulfilling the regular expression `(IT|Integration)(Test|Suite|Case)`, are considered integration tests.
-The rest are considered unit tests and should only test behavior which is local to the component under test.
-
-An integration test is a test which requires the full Flink system to be started.
-In order to do that properly, all integration test cases have to mix in the trait `FlinkTestBase`.
-This trait will set the right `ExecutionEnvironment` so that the test will be executed on Flink's `MiniCluster`.
-Thus, an integration test could look the following:
-
-{% highlight scala %}
-class ExampleITSuite extends FlatSpec with FlinkTestBase {
-  behavior of "An example algorithm"
-
-  it should "do something" in {
-    ...
-  }
-}
-{% endhighlight %}
-
-The test style does not have to be `FlatSpec` but can be any other scalatest `Suite` subclass.
-See [ScalaTest testing styles](http://scalatest.org/user_guide/selecting_a_style) for more information.
-
-## Documentation
-
-When contributing new algorithms, it is required to add code comments describing the way the algorithm works and its parameters with which the user can control its behavior.
-Additionally, we would like to encourage contributors to add this information to the online documentation.
-The online documentation for FlinkML's components can be found in the directory `docs/libs/ml`.
-
-Every new algorithm is described by a single markdown file.
-This file should contain at least the following points:
-
-1. What does the algorithm do
-2. How does the algorithm work (or reference to description)
-3. Parameter description with default values
-4. Code snippet showing how the algorithm is used
-
-In order to use latex syntax in the markdown file, you have to include `mathjax: include` in the YAML front matter.
-
-{% highlight java %}
----
-mathjax: include
-htmlTitle: FlinkML - Example title
-title: <a href="../ml">FlinkML</a> - Example title
----
-{% endhighlight %}
-
-In order to use displayed mathematics, you have to put your latex code in `$$ ... $$`.
-For in-line mathematics, use `$ ... $`.
-Additionally some predefined latex commands are included into the scope of your markdown file.
-See `docs/_include/latex_commands.html` for the complete list of predefined latex commands.
-
-## Contributing
-
-Once you have implemented the algorithm with adequate test coverage and added documentation, you are ready to open a pull request.
-Details of how to open a pull request can be found [here](https://flink.apache.org/contributing/how-to-contribute.html).
-
-{% top %}
diff --git a/docs/dev/libs/ml/contribution_guide.zh.md b/docs/dev/libs/ml/contribution_guide.zh.md
deleted file mode 100644
index cad39b7..0000000
--- a/docs/dev/libs/ml/contribution_guide.zh.md
+++ /dev/null
@@ -1,108 +0,0 @@
----
-mathjax: include
-title: How to Contribute
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-The Flink community highly appreciates all sorts of contributions to FlinkML.
-FlinkML offers people interested in machine learning to work on a highly active open source project which makes scalable ML reality.
-The following document describes how to contribute to FlinkML.
-
-* This will be replaced by the TOC
-{:toc}
-
-## Getting Started
-
-In order to get started first read Flink's [contribution guide](https://flink.apache.org/contributing/how-to-contribute.html).
-Everything from this guide also applies to FlinkML.
-
-## Pick a Topic
-
-If you are looking for some new ideas you should first look into our [roadmap](https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap), then you should check out the list of [unresolved issues on JIRA](https://issues.apache.org/jira/issues/?jql=component%20%3D%20%22Machine%20Learning%20Library%22%20AND%20project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC).
-Once you decide to contribute to one of these issues, you should take ownership of it and track your progress with this issue.
-That way, the other contributors know the state of the different issues and redundant work is avoided.
-
-If you already know what you want to contribute to FlinkML all the better.
-It is still advisable to create a JIRA issue for your idea to tell the Flink community what you want to do, though.
-
-## Testing
-
-New contributions should come with tests to verify the correct behavior of the algorithm.
-The tests help to maintain the algorithm's correctness throughout code changes, e.g. refactorings.
-
-We distinguish between unit tests, which are executed during Maven's test phase, and integration tests, which are executed during maven's verify phase.
-Maven automatically makes this distinction by using the following naming rules:
-All test cases whose class name ends with a suffix fulfilling the regular expression `(IT|Integration)(Test|Suite|Case)`, are considered integration tests.
-The rest are considered unit tests and should only test behavior which is local to the component under test.
-
-An integration test is a test which requires the full Flink system to be started.
-In order to do that properly, all integration test cases have to mix in the trait `FlinkTestBase`.
-This trait will set the right `ExecutionEnvironment` so that the test will be executed on Flink's `MiniCluster`.
-Thus, an integration test could look the following:
-
-{% highlight scala %}
-class ExampleITSuite extends FlatSpec with FlinkTestBase {
-  behavior of "An example algorithm"
-
-  it should "do something" in {
-    ...
-  }
-}
-{% endhighlight %}
-
-The test style does not have to be `FlatSpec` but can be any other scalatest `Suite` subclass.
-See [ScalaTest testing styles](http://scalatest.org/user_guide/selecting_a_style) for more information.
-
-## Documentation
-
-When contributing new algorithms, it is required to add code comments describing the way the algorithm works and its parameters with which the user can control its behavior.
-Additionally, we would like to encourage contributors to add this information to the online documentation.
-The online documentation for FlinkML's components can be found in the directory `docs/libs/ml`.
-
-Every new algorithm is described by a single markdown file.
-This file should contain at least the following points:
-
-1. What does the algorithm do
-2. How does the algorithm work (or reference to description)
-3. Parameter description with default values
-4. Code snippet showing how the algorithm is used
-
-In order to use latex syntax in the markdown file, you have to include `mathjax: include` in the YAML front matter.
-
-{% highlight java %}
----
-mathjax: include
-htmlTitle: FlinkML - Example title
-title: <a href="../ml">FlinkML</a> - Example title
----
-{% endhighlight %}
-
-In order to use displayed mathematics, you have to put your latex code in `$$ ... $$`.
-For in-line mathematics, use `$ ... $`.
-Additionally some predefined latex commands are included into the scope of your markdown file.
-See `docs/_include/latex_commands.html` for the complete list of predefined latex commands.
-
-## Contributing
-
-Once you have implemented the algorithm with adequate test coverage and added documentation, you are ready to open a pull request.
-Details of how to open a pull request can be found [here](https://flink.apache.org/contributing/how-to-contribute.html).
-
-{% top %}
diff --git a/docs/dev/libs/ml/cross_validation.md b/docs/dev/libs/ml/cross_validation.md
deleted file mode 100644
index 01b01f9..0000000
--- a/docs/dev/libs/ml/cross_validation.md
+++ /dev/null
@@ -1,173 +0,0 @@
----
-mathjax: include
-title: Cross Validation
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- A prevalent problem when utilizing machine learning algorithms is *overfitting*, or when an algorithm "memorizes" the training data but does a poor job extrapolating to out of sample cases. A common method for dealing with the overfitting problem is to hold back some subset of data from the original training algorithm and then measure the fit algorithm's performance on this hold-out set. This is commonly known as *cross validation*.  A model is trained on one subset of data and then *va [...]
-
-## Cross Validation Strategies
-
-There are several strategies for holding out data. FlinkML has convenience methods for
-- Train-Test Splits
-- Train-Test-Holdout Splits
-- K-Fold Splits
-- Multi-Random Splits
-
-### Train-Test Splits
-
-The simplest method of splitting is the `trainTestSplit`. This split takes a DataSet and a parameter *fraction*.  The *fraction* indicates the portion of the DataSet that should be allocated to the training set. This split also takes two additional optional parameters, *precise* and *seed*.  
-
-By default, the Split is done by randomly deciding whether or not an observation is assigned to the training DataSet with probability = *fraction*.  When *precise* is `true` however, additional steps are taken to ensure the training set is as close as possible to the length of the DataSet  $\cdot$ *fraction*.
-
-The method returns a new `TrainTestDataSet` object which has a `.training` attribute containing the training DataSet and a `.testing` attribute containing the testing DataSet.
-
-
-### Train-Test-Holdout Splits
-
-In some cases, algorithms have been known to 'learn' the testing set.  To combat this issue, a train-test-hold out strategy introduces a secondary holdout set, aptly called the *holdout* set.
-
-Traditionally, training and testing would be done to train an algorithms as normal and then a final test of the algorithm on the holdout set would be done.  Ideally, prediction errors/model scores in the holdout set would not be significantly different than those observed in the testing set.
-
-In a train-test-holdout strategy we sacrifice the sample size of the initial fitting algorithm for increased confidence that our model is not over-fit.
-
-When using `trainTestHoldout` splitter, the *fraction* `Double` is replaced by a *fraction* array of length three. The first element corresponds to the portion to be used for training, second for testing, and third for holdout.  The weights of this array are *relative*, e.g. an array `Array(3.0, 2.0, 1.0)` would results in approximately 50% of the observations being in the training set, 33% of the observations in the testing set, and 17% of the observations in holdout set.
-
-### K-Fold Splits
-
-In a *k-fold* strategy, the DataSet is split into *k* equal subsets. Then for each of the *k* subsets, a `TrainTestDataSet` is created where the subset is the `.training` DataSet, and the remaining subsets are the `.testing` set.
-
-For each training set, an algorithm is trained and then is evaluated based on the predictions based on the associated testing set. When an algorithm that has consistent grades (e.g. prediction errors) across held out datasets we can have some confidence that our approach (e.g. choice of algorithm / algorithm parameters / number of iterations) is robust against overfitting.
-
-<a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation">K-Fold Cross Validation</a>
-
-### Multi-Random Splits
-
-The *multi-random* strategy can be thought of as a more general form of the *train-test-holdout* strategy. In fact, `.trainTestHoldoutSplit` is a simple wrapper for `multiRandomSplit` which also packages the datasets into a `trainTestHoldoutDataSet` object.
-
-The first major difference, is that `multiRandomSplit` takes an array of fractions of any length. E.g. one can create multiple holdout sets.  Alternatively, one could think of `kFoldSplit` as a wrapper for `multiRandomSplit` (which it is), the difference being `kFoldSplit` creates subsets of approximately equal size, where `multiRandomSplit` will create subsets of any size.
-
-The second major difference is that `multiRandomSplit` returns an array of DataSets, equal in size and proportion to the *fraction array* that it was passed as an argument.
-
-## Parameters
-
-The various `Splitter` methods share many parameters.
-
- <table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Parameter</th>
-      <th class="text-center">Type</th>
-      <th class="text-center">Description</th>
-      <th class="text-right">Used by Method</th>
-    </tr>
-  </thead>
-
-  <tbody>
-    <tr>
-      <td><code>input</code></td>
-      <td><code>DataSet[Any]</code></td>
-      <td>DataSet to be split.</td>
-      <td>
-      <code>randomSplit</code><br>
-      <code>multiRandomSplit</code><br>
-      <code>kFoldSplit</code><br>
-      <code>trainTestSplit</code><br>
-      <code>trainTestHoldoutSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>seed</code></td>
-      <td><code>Long</code></td>
-      <td>
-        <p>
-          Used for seeding the random number generator which sorts DataSets into other DataSets.
-        </p>
-      </td>
-      <td>
-      <code>randomSplit</code><br>
-      <code>multiRandomSplit</code><br>
-      <code>kFoldSplit</code><br>
-      <code>trainTestSplit</code><br>
-      <code>trainTestHoldoutSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>precise</code></td>
-      <td><code>Boolean</code></td>
-      <td>When true, make additional effort to make DataSets as close to the prescribed proportions as possible.</td>
-      <td>
-      <code>randomSplit</code><br>
-      <code>trainTestSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>fraction</code></td>
-      <td><code>Double</code></td>
-      <td>The portion of the `input` to assign to the first or <code>.training</code> DataSet. Must be in the range (0,1)</td>
-      <td><code>randomSplit</code><br>
-        <code>trainTestSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>fracArray</code></td>
-      <td><code>Array[Double]</code></td>
-      <td>An array that prescribes the proportions of the output datasets (proportions need not sum to 1 or be within the range (0,1))</td>
-      <td>
-      <code>multiRandomSplit</code><br>
-      <code>trainTestHoldoutSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>kFolds</code></td>
-      <td><code>Int</code></td>
-      <td>The number of subsets to break the <code>input</code> DataSet into.</td>
-      <td><code>kFoldSplit</code></td>
-      </tr>
-
-  </tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-// An input dataset- does not have to be of type LabeledVector
-val data: DataSet[LabeledVector] = ...
-
-// A Simple Train-Test-Split
-val dataTrainTest: TrainTestDataSet = Splitter.trainTestSplit(data, 0.6, true)
-
-// Create a train test holdout DataSet
-val dataTrainTestHO: trainTestHoldoutDataSet = Splitter.trainTestHoldoutSplit(data, Array(6.0, 3.0, 1.0))
-
-// Create an Array of K TrainTestDataSets
-val dataKFolded: Array[TrainTestDataSet] =  Splitter.kFoldSplit(data, 10)
-
-// create an array of 5 datasets
-val dataMultiRandom: Array[DataSet[T]] = Splitter.multiRandomSplit(data, Array(0.5, 0.1, 0.1, 0.1, 0.1))
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/cross_validation.zh.md b/docs/dev/libs/ml/cross_validation.zh.md
deleted file mode 100644
index 82bc7e3..0000000
--- a/docs/dev/libs/ml/cross_validation.zh.md
+++ /dev/null
@@ -1,173 +0,0 @@
----
-mathjax: include
-title: 交叉验证
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- A prevalent problem when utilizing machine learning algorithms is *overfitting*, or when an algorithm "memorizes" the training data but does a poor job extrapolating to out of sample cases. A common method for dealing with the overfitting problem is to hold back some subset of data from the original training algorithm and then measure the fit algorithm's performance on this hold-out set. This is commonly known as *cross validation*.  A model is trained on one subset of data and then *va [...]
-
-## Cross Validation Strategies
-
-There are several strategies for holding out data. FlinkML has convenience methods for
-- Train-Test Splits
-- Train-Test-Holdout Splits
-- K-Fold Splits
-- Multi-Random Splits
-
-### Train-Test Splits
-
-The simplest method of splitting is the `trainTestSplit`. This split takes a DataSet and a parameter *fraction*.  The *fraction* indicates the portion of the DataSet that should be allocated to the training set. This split also takes two additional optional parameters, *precise* and *seed*.  
-
-By default, the Split is done by randomly deciding whether or not an observation is assigned to the training DataSet with probability = *fraction*.  When *precise* is `true` however, additional steps are taken to ensure the training set is as close as possible to the length of the DataSet  $\cdot$ *fraction*.
-
-The method returns a new `TrainTestDataSet` object which has a `.training` attribute containing the training DataSet and a `.testing` attribute containing the testing DataSet.
-
-
-### Train-Test-Holdout Splits
-
-In some cases, algorithms have been known to 'learn' the testing set.  To combat this issue, a train-test-hold out strategy introduces a secondary holdout set, aptly called the *holdout* set.
-
-Traditionally, training and testing would be done to train an algorithms as normal and then a final test of the algorithm on the holdout set would be done.  Ideally, prediction errors/model scores in the holdout set would not be significantly different than those observed in the testing set.
-
-In a train-test-holdout strategy we sacrifice the sample size of the initial fitting algorithm for increased confidence that our model is not over-fit.
-
-When using `trainTestHoldout` splitter, the *fraction* `Double` is replaced by a *fraction* array of length three. The first element corresponds to the portion to be used for training, second for testing, and third for holdout.  The weights of this array are *relative*, e.g. an array `Array(3.0, 2.0, 1.0)` would results in approximately 50% of the observations being in the training set, 33% of the observations in the testing set, and 17% of the observations in holdout set.
-
-### K-Fold Splits
-
-In a *k-fold* strategy, the DataSet is split into *k* equal subsets. Then for each of the *k* subsets, a `TrainTestDataSet` is created where the subset is the `.training` DataSet, and the remaining subsets are the `.testing` set.
-
-For each training set, an algorithm is trained and then is evaluated based on the predictions based on the associated testing set. When an algorithm that has consistent grades (e.g. prediction errors) across held out datasets we can have some confidence that our approach (e.g. choice of algorithm / algorithm parameters / number of iterations) is robust against overfitting.
-
-<a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation">K-Fold Cross Validation</a>
-
-### Multi-Random Splits
-
-The *multi-random* strategy can be thought of as a more general form of the *train-test-holdout* strategy. In fact, `.trainTestHoldoutSplit` is a simple wrapper for `multiRandomSplit` which also packages the datasets into a `trainTestHoldoutDataSet` object.
-
-The first major difference, is that `multiRandomSplit` takes an array of fractions of any length. E.g. one can create multiple holdout sets.  Alternatively, one could think of `kFoldSplit` as a wrapper for `multiRandomSplit` (which it is), the difference being `kFoldSplit` creates subsets of approximately equal size, where `multiRandomSplit` will create subsets of any size.
-
-The second major difference is that `multiRandomSplit` returns an array of DataSets, equal in size and proportion to the *fraction array* that it was passed as an argument.
-
-## Parameters
-
-The various `Splitter` methods share many parameters.
-
- <table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Parameter</th>
-      <th class="text-center">Type</th>
-      <th class="text-center">Description</th>
-      <th class="text-right">Used by Method</th>
-    </tr>
-  </thead>
-
-  <tbody>
-    <tr>
-      <td><code>input</code></td>
-      <td><code>DataSet[Any]</code></td>
-      <td>DataSet to be split.</td>
-      <td>
-      <code>randomSplit</code><br>
-      <code>multiRandomSplit</code><br>
-      <code>kFoldSplit</code><br>
-      <code>trainTestSplit</code><br>
-      <code>trainTestHoldoutSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>seed</code></td>
-      <td><code>Long</code></td>
-      <td>
-        <p>
-          Used for seeding the random number generator which sorts DataSets into other DataSets.
-        </p>
-      </td>
-      <td>
-      <code>randomSplit</code><br>
-      <code>multiRandomSplit</code><br>
-      <code>kFoldSplit</code><br>
-      <code>trainTestSplit</code><br>
-      <code>trainTestHoldoutSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>precise</code></td>
-      <td><code>Boolean</code></td>
-      <td>When true, make additional effort to make DataSets as close to the prescribed proportions as possible.</td>
-      <td>
-      <code>randomSplit</code><br>
-      <code>trainTestSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>fraction</code></td>
-      <td><code>Double</code></td>
-      <td>The portion of the `input` to assign to the first or <code>.training</code> DataSet. Must be in the range (0,1)</td>
-      <td><code>randomSplit</code><br>
-        <code>trainTestSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>fracArray</code></td>
-      <td><code>Array[Double]</code></td>
-      <td>An array that prescribes the proportions of the output datasets (proportions need not sum to 1 or be within the range (0,1))</td>
-      <td>
-      <code>multiRandomSplit</code><br>
-      <code>trainTestHoldoutSplit</code>
-      </td>
-    </tr>
-    <tr>
-      <td><code>kFolds</code></td>
-      <td><code>Int</code></td>
-      <td>The number of subsets to break the <code>input</code> DataSet into.</td>
-      <td><code>kFoldSplit</code></td>
-      </tr>
-
-  </tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-// An input dataset- does not have to be of type LabeledVector
-val data: DataSet[LabeledVector] = ...
-
-// A Simple Train-Test-Split
-val dataTrainTest: TrainTestDataSet = Splitter.trainTestSplit(data, 0.6, true)
-
-// Create a train test holdout DataSet
-val dataTrainTestHO: trainTestHoldoutDataSet = Splitter.trainTestHoldoutSplit(data, Array(6.0, 3.0, 1.0))
-
-// Create an Array of K TrainTestDataSets
-val dataKFolded: Array[TrainTestDataSet] =  Splitter.kFoldSplit(data, 10)
-
-// create an array of 5 datasets
-val dataMultiRandom: Array[DataSet[T]] = Splitter.multiRandomSplit(data, Array(0.5, 0.1, 0.1, 0.1, 0.1))
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/distance_metrics.md b/docs/dev/libs/ml/distance_metrics.md
deleted file mode 100644
index 3119479..0000000
--- a/docs/dev/libs/ml/distance_metrics.md
+++ /dev/null
@@ -1,109 +0,0 @@
----
-mathjax: include
-title: Distance Metrics
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-Different metrics of distance are convenient for different types of analysis. Flink ML provides
-built-in implementations for many standard distance metrics. You can create custom
-distance metrics by implementing the `DistanceMetric` trait.
-
-## Built-in Implementations
-
-Currently, FlinkML supports the following metrics:
-
-<table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Metric</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Euclidean Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \sqrt{\sum_{i=1}^n \left(x_i - y_i \right)^2}$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Squared Euclidean Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \sum_{i=1}^n \left(x_i - y_i \right)^2$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Cosine Similarity</strong></td>
-        <td>
-          $$d(\x, \y) = 1 - \frac{\x^T \y}{\Vert \x \Vert \Vert \y \Vert}$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Chebyshev Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \max_{i}\left(\left \vert x_i - y_i \right\vert \right)$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Manhattan Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \sum_{i=1}^n \left\vert x_i - y_i \right\vert$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Minkowski Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \left( \sum_{i=1}^{n} \left( x_i - y_i \right)^p \right)^{\rfrac{1}{p}}$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Tanimoto Distance</strong></td>
-        <td>
-          $$d(\x, \y) = 1 - \frac{\x^T\y}{\Vert \x \Vert^2 + \Vert \y \Vert^2 - \x^T\y}$$
-          with $\x$ and $\y$ being bit-vectors
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Custom Implementation
-
-You can create your own distance metric by implementing the `DistanceMetric` trait.
-
-{% highlight scala %}
-class MyDistance extends DistanceMetric {
-  override def distance(a: Vector, b: Vector) = ... // your implementation for distance metric
-}
-
-object MyDistance {
-  def apply() = new MyDistance()
-}
-
-val myMetric = MyDistance()
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/distance_metrics.zh.md b/docs/dev/libs/ml/distance_metrics.zh.md
deleted file mode 100644
index d4b45cd..0000000
--- a/docs/dev/libs/ml/distance_metrics.zh.md
+++ /dev/null
@@ -1,109 +0,0 @@
----
-mathjax: include
-title: 距离指标
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-Different metrics of distance are convenient for different types of analysis. Flink ML provides
-built-in implementations for many standard distance metrics. You can create custom
-distance metrics by implementing the `DistanceMetric` trait.
-
-## Built-in Implementations
-
-Currently, FlinkML supports the following metrics:
-
-<table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Metric</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Euclidean Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \sqrt{\sum_{i=1}^n \left(x_i - y_i \right)^2}$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Squared Euclidean Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \sum_{i=1}^n \left(x_i - y_i \right)^2$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Cosine Similarity</strong></td>
-        <td>
-          $$d(\x, \y) = 1 - \frac{\x^T \y}{\Vert \x \Vert \Vert \y \Vert}$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Chebyshev Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \max_{i}\left(\left \vert x_i - y_i \right\vert \right)$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Manhattan Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \sum_{i=1}^n \left\vert x_i - y_i \right\vert$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Minkowski Distance</strong></td>
-        <td>
-          $$d(\x, \y) = \left( \sum_{i=1}^{n} \left( x_i - y_i \right)^p \right)^{\rfrac{1}{p}}$$
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Tanimoto Distance</strong></td>
-        <td>
-          $$d(\x, \y) = 1 - \frac{\x^T\y}{\Vert \x \Vert^2 + \Vert \y \Vert^2 - \x^T\y}$$
-          with $\x$ and $\y$ being bit-vectors
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Custom Implementation
-
-You can create your own distance metric by implementing the `DistanceMetric` trait.
-
-{% highlight scala %}
-class MyDistance extends DistanceMetric {
-  override def distance(a: Vector, b: Vector) = ... // your implementation for distance metric
-}
-
-object MyDistance {
-  def apply() = new MyDistance()
-}
-
-val myMetric = MyDistance()
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/index.md b/docs/dev/libs/ml/index.md
deleted file mode 100644
index a623a83..0000000
--- a/docs/dev/libs/ml/index.md
+++ /dev/null
@@ -1,150 +0,0 @@
----
-title: "FlinkML - Machine Learning for Flink"
-nav-id: ml
-nav-show_overview: true
-nav-title: Machine Learning
-nav-parent_id: libs
-nav-pos: 4
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community,
-with a growing list of algorithms and contributors. With FlinkML we aim to provide
-scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML
-systems. You can see more details about our goals and where the library is headed in our [vision
-and roadmap here](https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap).
-
-* This will be replaced by the TOC
-{:toc}
-
-## Supported Algorithms
-
-FlinkML currently supports the following algorithms:
-
-### Supervised Learning
-
-* [SVM using Communication efficient distributed dual coordinate ascent (CoCoA)](svm.html)
-* [Multiple linear regression](multiple_linear_regression.html)
-* [Optimization Framework](optimization.html)
-
-### Unsupervised Learning
-
-* [k-Nearest neighbors join](knn.html)
-
-### Data Preprocessing
-
-* [Polynomial Features](polynomial_features.html)
-* [Standard Scaler](standard_scaler.html)
-* [MinMax Scaler](min_max_scaler.html)
-
-### Recommendation
-
-* [Alternating Least Squares (ALS)](als.html)
-
-### Outlier selection
-
-* [Stochastic Outlier Selection (SOS)](sos.html)
-
-### Utilities
-
-* [Distance Metrics](distance_metrics.html)
-* [Cross Validation](cross_validation.html)
-
-## Getting Started
-
-You can check out our [quickstart guide](quickstart.html) for a comprehensive getting started
-example.
-
-If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/dev/projectsetup/dependencies.html).
-Next, you have to add the FlinkML dependency to the `pom.xml` of your project.
-
-{% highlight xml %}
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-ml{{ site.scala_version_suffix }}</artifactId>
-  <version>{{site.version }}</version>
-</dependency>
-{% endhighlight %}
-
-Note that FlinkML is currently not part of the binary distribution.
-See linking with it for cluster execution [here]({{site.baseurl}}/dev/projectsetup/dependencies.html).
-
-Now you can start solving your analysis task.
-The following code snippet shows how easy it is to train a multiple linear regression model.
-
-{% highlight scala %}
-// LabeledVector is a feature vector with a label (class or real value)
-val trainingData: DataSet[LabeledVector] = ...
-val testingData: DataSet[Vector] = ...
-
-// Alternatively, a Splitter is used to break up a DataSet into training and testing data.
-val dataSet: DataSet[LabeledVector] = ...
-val trainTestData: DataSet[TrainTestDataSet] = Splitter.trainTestSplit(dataSet)
-val trainingData: DataSet[LabeledVector] = trainTestData.training
-val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector)
-
-val mlr = MultipleLinearRegression()
-  .setStepsize(1.0)
-  .setIterations(100)
-  .setConvergenceThreshold(0.001)
-
-mlr.fit(trainingData)
-
-// The fitted model can now be used to make predictions
-val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
-{% endhighlight %}
-
-## Pipelines
-
-A key concept of FlinkML is its [scikit-learn](http://scikit-learn.org) inspired pipelining mechanism.
-It allows you to quickly build complex data analysis pipelines how they appear in every data scientist's daily work.
-An in-depth description of FlinkML's pipelines and their internal workings can be found [here](pipelines.html).
-
-The following example code shows how easy it is to set up an analysis pipeline with FlinkML.
-
-{% highlight scala %}
-val trainingData: DataSet[LabeledVector] = ...
-val testingData: DataSet[Vector] = ...
-
-val scaler = StandardScaler()
-val polyFeatures = PolynomialFeatures().setDegree(3)
-val mlr = MultipleLinearRegression()
-
-// Construct pipeline of standard scaler, polynomial features and multiple linear regression
-val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
-
-// Train pipeline
-pipeline.fit(trainingData)
-
-// Calculate predictions
-val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
-{% endhighlight %}
-
-One can chain a `Transformer` to another `Transformer` or a set of chained `Transformers` by calling the method `chainTransformer`.
-If one wants to chain a `Predictor` to a `Transformer` or a set of chained `Transformers`, one has to call the method `chainPredictor`.
-
-
-## How to contribute
-
-The Flink community welcomes all contributors who want to get involved in the development of Flink and its libraries.
-In order to get quickly started with contributing to FlinkML, please read our official
-[contribution guide]({{site.baseurl}}/dev/libs/ml/contribution_guide.html).
-
-{% top %}
diff --git a/docs/dev/libs/ml/index.zh.md b/docs/dev/libs/ml/index.zh.md
deleted file mode 100644
index 9ee7ff5..0000000
--- a/docs/dev/libs/ml/index.zh.md
+++ /dev/null
@@ -1,150 +0,0 @@
----
-title: "FlinkML - Machine Learning for Flink"
-nav-id: ml
-nav-show_overview: true
-nav-title: 机器学习
-nav-parent_id: libs
-nav-pos: 4
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community,
-with a growing list of algorithms and contributors. With FlinkML we aim to provide
-scalable ML algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML
-systems. You can see more details about our goals and where the library is headed in our [vision
-and roadmap here](https://cwiki.apache.org/confluence/display/FLINK/FlinkML%3A+Vision+and+Roadmap).
-
-* This will be replaced by the TOC
-{:toc}
-
-## Supported Algorithms
-
-FlinkML currently supports the following algorithms:
-
-### Supervised Learning
-
-* [SVM using Communication efficient distributed dual coordinate ascent (CoCoA)](svm.html)
-* [Multiple linear regression](multiple_linear_regression.html)
-* [Optimization Framework](optimization.html)
-
-### Unsupervised Learning
-
-* [k-Nearest neighbors join](knn.html)
-
-### Data Preprocessing
-
-* [Polynomial Features](polynomial_features.html)
-* [Standard Scaler](standard_scaler.html)
-* [MinMax Scaler](min_max_scaler.html)
-
-### Recommendation
-
-* [Alternating Least Squares (ALS)](als.html)
-
-### Outlier selection
-
-* [Stochastic Outlier Selection (SOS)](sos.html)
-
-### Utilities
-
-* [Distance Metrics](distance_metrics.html)
-* [Cross Validation](cross_validation.html)
-
-## Getting Started
-
-You can check out our [quickstart guide](quickstart.html) for a comprehensive getting started
-example.
-
-If you want to jump right in, you have to [set up a Flink program]({{ site.baseurl }}/dev/projectsetup/dependencies.html).
-Next, you have to add the FlinkML dependency to the `pom.xml` of your project.
-
-{% highlight xml %}
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-ml{{ site.scala_version_suffix }}</artifactId>
-  <version>{{site.version }}</version>
-</dependency>
-{% endhighlight %}
-
-Note that FlinkML is currently not part of the binary distribution.
-See linking with it for cluster execution [here]({{site.baseurl}}/dev/projectsetup/dependencies.html).
-
-Now you can start solving your analysis task.
-The following code snippet shows how easy it is to train a multiple linear regression model.
-
-{% highlight scala %}
-// LabeledVector is a feature vector with a label (class or real value)
-val trainingData: DataSet[LabeledVector] = ...
-val testingData: DataSet[Vector] = ...
-
-// Alternatively, a Splitter is used to break up a DataSet into training and testing data.
-val dataSet: DataSet[LabeledVector] = ...
-val trainTestData: DataSet[TrainTestDataSet] = Splitter.trainTestSplit(dataSet)
-val trainingData: DataSet[LabeledVector] = trainTestData.training
-val testingData: DataSet[Vector] = trainTestData.testing.map(lv => lv.vector)
-
-val mlr = MultipleLinearRegression()
-  .setStepsize(1.0)
-  .setIterations(100)
-  .setConvergenceThreshold(0.001)
-
-mlr.fit(trainingData)
-
-// The fitted model can now be used to make predictions
-val predictions: DataSet[LabeledVector] = mlr.predict(testingData)
-{% endhighlight %}
-
-## Pipelines
-
-A key concept of FlinkML is its [scikit-learn](http://scikit-learn.org) inspired pipelining mechanism.
-It allows you to quickly build complex data analysis pipelines how they appear in every data scientist's daily work.
-An in-depth description of FlinkML's pipelines and their internal workings can be found [here](pipelines.html).
-
-The following example code shows how easy it is to set up an analysis pipeline with FlinkML.
-
-{% highlight scala %}
-val trainingData: DataSet[LabeledVector] = ...
-val testingData: DataSet[Vector] = ...
-
-val scaler = StandardScaler()
-val polyFeatures = PolynomialFeatures().setDegree(3)
-val mlr = MultipleLinearRegression()
-
-// Construct pipeline of standard scaler, polynomial features and multiple linear regression
-val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
-
-// Train pipeline
-pipeline.fit(trainingData)
-
-// Calculate predictions
-val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
-{% endhighlight %}
-
-One can chain a `Transformer` to another `Transformer` or a set of chained `Transformers` by calling the method `chainTransformer`.
-If one wants to chain a `Predictor` to a `Transformer` or a set of chained `Transformers`, one has to call the method `chainPredictor`.
-
-
-## How to contribute
-
-The Flink community welcomes all contributors who want to get involved in the development of Flink and its libraries.
-In order to get quickly started with contributing to FlinkML, please read our official
-[contribution guide]({{site.baseurl}}/dev/libs/ml/contribution_guide.html).
-
-{% top %}
diff --git a/docs/dev/libs/ml/knn.md b/docs/dev/libs/ml/knn.md
deleted file mode 100644
index 43f8d13..0000000
--- a/docs/dev/libs/ml/knn.md
+++ /dev/null
@@ -1,146 +0,0 @@
----
-mathjax: include
-title: k-Nearest Neighbors Join
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-Implements an exact k-nearest neighbors join algorithm.  Given a training set $A$ and a testing set $B$, the algorithm returns
-
-$$
-KNNJ(A, B, k) = \{ \left( b, KNN(b, A, k) \right) \text{ where } b \in B \text{ and } KNN(b, A, k) \text{ are the k-nearest points to }b\text{ in }A \}
-$$
-
-The brute-force approach is to compute the distance between every training and testing point. To ease the brute-force computation of computing the distance between every training point a quadtree is used. The quadtree scales well in the number of training points, though poorly in the spatial dimension. The algorithm will automatically choose whether or not to use the quadtree, though the user can override that decision by setting a parameter to force use or not use a quadtree.
-
-## Operations
-
-`KNN` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-KNN is trained by a given set of `Vector`:
-
-* `fit[T <: Vector]: DataSet[T] => Unit`
-
-### Predict
-
-KNN predicts for all subtypes of FlinkML's `Vector` the corresponding class label:
-
-* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where the `(T, Array[Vector])` tuple
-  corresponds to (test point, K-nearest training points)
-
-## Parameters
-
-The KNN implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>K</strong></td>
-        <td>
-          <p>
-            Defines the number of nearest-neighbors to search for. That is, for each test point, the algorithm finds the K-nearest neighbors in the training set
-            (Default value: <strong>5</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>DistanceMetric</strong></td>
-        <td>
-          <p>
-            Sets the distance metric we use to calculate the distance between two points. If no metric is specified, then [[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
-            (Default value: <strong>EuclideanDistanceMetric</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Blocks</strong></td>
-        <td>
-          <p>
-            Sets the number of blocks into which the input data will be split. This number should be set
-            at least to the degree of parallelism. If no value is specified, then the parallelism of the
-            input [[DataSet]] is used as the number of blocks.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>UseQuadTree</strong></td>
-        <td>
-          <p>
-            A boolean variable that whether or not to use a quadtree to partition the training set to potentially simplify the KNN search. If no value is specified, the code will automatically decide whether or not to use a quadtree. Use of a quadtree scales well with the number of training and testing points, though poorly with the dimension.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>SizeHint</strong></td>
-        <td>
-          <p>Specifies whether the training set or test set is small to optimize the cross product operation needed for the KNN search. If the training set is small this should be `CrossHint.FIRST_IS_SMALL` and set to `CrossHint.SECOND_IS_SMALL` if the test set is small.
-             (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-import org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
-import org.apache.flink.api.scala._
-import org.apache.flink.ml.nn.KNN
-import org.apache.flink.ml.math.Vector
-import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric
-
-val env = ExecutionEnvironment.getExecutionEnvironment
-
-// prepare data
-val trainingSet: DataSet[Vector] = ...
-val testingSet: DataSet[Vector] = ...
-
-val knn = KNN()
-  .setK(3)
-  .setBlocks(10)
-  .setDistanceMetric(SquaredEuclideanDistanceMetric())
-  .setUseQuadTree(false)
-  .setSizeHint(CrossHint.SECOND_IS_SMALL)
-
-// run knn join
-knn.fit(trainingSet)
-val result = knn.predict(testingSet).collect()
-{% endhighlight %}
-
-For more details on the computing KNN with and without and quadtree, here is a presentation: [http://danielblazevski.github.io/](http://danielblazevski.github.io/)
-
-{% top %}
diff --git a/docs/dev/libs/ml/knn.zh.md b/docs/dev/libs/ml/knn.zh.md
deleted file mode 100644
index 43f8d13..0000000
--- a/docs/dev/libs/ml/knn.zh.md
+++ /dev/null
@@ -1,146 +0,0 @@
----
-mathjax: include
-title: k-Nearest Neighbors Join
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-Implements an exact k-nearest neighbors join algorithm.  Given a training set $A$ and a testing set $B$, the algorithm returns
-
-$$
-KNNJ(A, B, k) = \{ \left( b, KNN(b, A, k) \right) \text{ where } b \in B \text{ and } KNN(b, A, k) \text{ are the k-nearest points to }b\text{ in }A \}
-$$
-
-The brute-force approach is to compute the distance between every training and testing point. To ease the brute-force computation of computing the distance between every training point a quadtree is used. The quadtree scales well in the number of training points, though poorly in the spatial dimension. The algorithm will automatically choose whether or not to use the quadtree, though the user can override that decision by setting a parameter to force use or not use a quadtree.
-
-## Operations
-
-`KNN` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-KNN is trained by a given set of `Vector`:
-
-* `fit[T <: Vector]: DataSet[T] => Unit`
-
-### Predict
-
-KNN predicts for all subtypes of FlinkML's `Vector` the corresponding class label:
-
-* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Array[Vector])]`, where the `(T, Array[Vector])` tuple
-  corresponds to (test point, K-nearest training points)
-
-## Parameters
-
-The KNN implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>K</strong></td>
-        <td>
-          <p>
-            Defines the number of nearest-neighbors to search for. That is, for each test point, the algorithm finds the K-nearest neighbors in the training set
-            (Default value: <strong>5</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>DistanceMetric</strong></td>
-        <td>
-          <p>
-            Sets the distance metric we use to calculate the distance between two points. If no metric is specified, then [[org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric]] is used.
-            (Default value: <strong>EuclideanDistanceMetric</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Blocks</strong></td>
-        <td>
-          <p>
-            Sets the number of blocks into which the input data will be split. This number should be set
-            at least to the degree of parallelism. If no value is specified, then the parallelism of the
-            input [[DataSet]] is used as the number of blocks.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>UseQuadTree</strong></td>
-        <td>
-          <p>
-            A boolean variable that whether or not to use a quadtree to partition the training set to potentially simplify the KNN search. If no value is specified, the code will automatically decide whether or not to use a quadtree. Use of a quadtree scales well with the number of training and testing points, though poorly with the dimension.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>SizeHint</strong></td>
-        <td>
-          <p>Specifies whether the training set or test set is small to optimize the cross product operation needed for the KNN search. If the training set is small this should be `CrossHint.FIRST_IS_SMALL` and set to `CrossHint.SECOND_IS_SMALL` if the test set is small.
-             (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-import org.apache.flink.api.common.operators.base.CrossOperatorBase.CrossHint
-import org.apache.flink.api.scala._
-import org.apache.flink.ml.nn.KNN
-import org.apache.flink.ml.math.Vector
-import org.apache.flink.ml.metrics.distances.SquaredEuclideanDistanceMetric
-
-val env = ExecutionEnvironment.getExecutionEnvironment
-
-// prepare data
-val trainingSet: DataSet[Vector] = ...
-val testingSet: DataSet[Vector] = ...
-
-val knn = KNN()
-  .setK(3)
-  .setBlocks(10)
-  .setDistanceMetric(SquaredEuclideanDistanceMetric())
-  .setUseQuadTree(false)
-  .setSizeHint(CrossHint.SECOND_IS_SMALL)
-
-// run knn join
-knn.fit(trainingSet)
-val result = knn.predict(testingSet).collect()
-{% endhighlight %}
-
-For more details on the computing KNN with and without and quadtree, here is a presentation: [http://danielblazevski.github.io/](http://danielblazevski.github.io/)
-
-{% top %}
diff --git a/docs/dev/libs/ml/min_max_scaler.md b/docs/dev/libs/ml/min_max_scaler.md
deleted file mode 100644
index c44a875..0000000
--- a/docs/dev/libs/ml/min_max_scaler.md
+++ /dev/null
@@ -1,114 +0,0 @@
----
-mathjax: include
-title: MinMax Scaler
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
- In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
- Given a set of input data $x_1, x_2,... x_n$, with minimum value:
-
- $$x_{min} = min({x_1, x_2,..., x_n})$$
-
- and maximum value:
-
- $$x_{max} = max({x_1, x_2,..., x_n})$$
-
-The scaled data set $z_1, z_2,...,z_n$ will be:
-
- $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$
-
-where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
-
-## Operations
-
-`MinMaxScaler` is a `Transformer`.
-As such, it supports the `fit` and `transform` operation.
-
-### Fit
-
-MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
-
-* `fit[T <: Vector]: DataSet[T] => Unit`
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Transform
-
-MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
-
-* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
-* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
-
-## Parameters
-
-The MinMax scaler implementation can be controlled by the following two parameters:
-
- <table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Parameters</th>
-      <th class="text-center">Description</th>
-    </tr>
-  </thead>
-
-  <tbody>
-    <tr>
-      <td><strong>Min</strong></td>
-      <td>
-        <p>
-          The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
-        </p>
-      </td>
-    </tr>
-    <tr>
-      <td><strong>Max</strong></td>
-      <td>
-        <p>
-          The maximum value of the range for the scaled data set. (Default value: <strong>1.0</strong>)
-        </p>
-      </td>
-    </tr>
-  </tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-// Create MinMax scaler transformer
-val minMaxscaler = MinMaxScaler()
-  .setMin(-1.0)
-
-// Obtain data set to be scaled
-val dataSet: DataSet[Vector] = ...
-
-// Learn the minimum and maximum values of the training data
-minMaxscaler.fit(dataSet)
-
-// Scale the provided data set to have min=-1.0 and max=1.0
-val scaledDS = minMaxscaler.transform(dataSet)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/min_max_scaler.zh.md b/docs/dev/libs/ml/min_max_scaler.zh.md
deleted file mode 100644
index c44a875..0000000
--- a/docs/dev/libs/ml/min_max_scaler.zh.md
+++ /dev/null
@@ -1,114 +0,0 @@
----
-mathjax: include
-title: MinMax Scaler
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
- In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
- Given a set of input data $x_1, x_2,... x_n$, with minimum value:
-
- $$x_{min} = min({x_1, x_2,..., x_n})$$
-
- and maximum value:
-
- $$x_{max} = max({x_1, x_2,..., x_n})$$
-
-The scaled data set $z_1, z_2,...,z_n$ will be:
-
- $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$
-
-where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
-
-## Operations
-
-`MinMaxScaler` is a `Transformer`.
-As such, it supports the `fit` and `transform` operation.
-
-### Fit
-
-MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
-
-* `fit[T <: Vector]: DataSet[T] => Unit`
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Transform
-
-MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
-
-* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
-* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
-
-## Parameters
-
-The MinMax scaler implementation can be controlled by the following two parameters:
-
- <table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Parameters</th>
-      <th class="text-center">Description</th>
-    </tr>
-  </thead>
-
-  <tbody>
-    <tr>
-      <td><strong>Min</strong></td>
-      <td>
-        <p>
-          The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
-        </p>
-      </td>
-    </tr>
-    <tr>
-      <td><strong>Max</strong></td>
-      <td>
-        <p>
-          The maximum value of the range for the scaled data set. (Default value: <strong>1.0</strong>)
-        </p>
-      </td>
-    </tr>
-  </tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-// Create MinMax scaler transformer
-val minMaxscaler = MinMaxScaler()
-  .setMin(-1.0)
-
-// Obtain data set to be scaled
-val dataSet: DataSet[Vector] = ...
-
-// Learn the minimum and maximum values of the training data
-minMaxscaler.fit(dataSet)
-
-// Scale the provided data set to have min=-1.0 and max=1.0
-val scaledDS = minMaxscaler.transform(dataSet)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/multiple_linear_regression.md b/docs/dev/libs/ml/multiple_linear_regression.md
deleted file mode 100644
index c6b7ed6..0000000
--- a/docs/dev/libs/ml/multiple_linear_regression.md
+++ /dev/null
@@ -1,154 +0,0 @@
----
-mathjax: include
-title: Multiple Linear Regression
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- Multiple linear regression tries to find a linear function which best fits the provided input data.
- Given a set of input data with its value $(\mathbf{x}, y)$, multiple linear regression finds
- a vector $\mathbf{w}$ such that the sum of the squared residuals is minimized:
-
- $$ S(\mathbf{w}) = \sum_{i=1} \left(y - \mathbf{w}^T\mathbf{x_i} \right)^2$$
-
- Written in matrix notation, we obtain the following formulation:
-
- $$\mathbf{w}^* = \arg \min_{\mathbf{w}} (\mathbf{y} - X\mathbf{w})^2$$
-
- This problem has a closed form solution which is given by:
-
-  $$\mathbf{w}^* = \left(X^TX\right)^{-1}X^T\mathbf{y}$$
-
-  However, in cases where the input data set is so huge that a complete parse over the whole data
-  set is prohibitive, one can apply stochastic gradient descent (SGD) to approximate the solution.
-  SGD first calculates for a random subset of the input data set the gradients. The gradient
-  for a given point $\mathbf{x}_i$ is given by:
-
-  $$\nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i}) = 2\left(\mathbf{w}^T\mathbf{x_i} -
-    y\right)\mathbf{x_i}$$
-
-  The gradients are averaged and scaled. The scaling is defined by $\gamma = \frac{s}{\sqrt{j}}$
-  with $s$ being the initial step size and $j$ being the current iteration number. The resulting gradient is subtracted from the
-  current weight vector giving the new weight vector for the next iteration:
-
-  $$\mathbf{w}_{t+1} = \mathbf{w}_t - \gamma \frac{1}{n}\sum_{i=1}^n \nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i})$$
-
-  The multiple linear regression algorithm computes either a fixed number of SGD iterations or terminates based on a dynamic convergence criterion.
-  The convergence criterion is the relative change in the sum of squared residuals:
-
-  $$\frac{S_{k-1} - S_k}{S_{k-1}} < \rho$$
-
-## Operations
-
-`MultipleLinearRegression` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-MultipleLinearRegression is trained on a set of `LabeledVector`:
-
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Predict
-
-MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value:
-
-* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Double)]`
-
-## Parameters
-
-  The multiple linear regression implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Iterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations. (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Stepsize</strong></td>
-        <td>
-          <p>
-            Initial step size for the gradient descent method.
-            This value controls how far the gradient descent method moves in the opposite direction of the gradient.
-            Tuning this parameter might be crucial to make it stable and to obtain a better performance.
-            (Default value: <strong>0.1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>ConvergenceThreshold</strong></td>
-        <td>
-          <p>
-            Threshold for relative change of the sum of squared residuals until the iteration is stopped.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LearningRateMethod</strong></td>
-        <td>
-            <p>
-                Learning rate method used to calculate the effective learning rate for each iteration.
-                See the list of supported <a href="optimization.html">learning rate methods</a>.
-                (Default value: <strong>LearningRateMethod.Default</strong>)
-            </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-// Create multiple linear regression learner
-val mlr = MultipleLinearRegression()
-.setIterations(10)
-.setStepsize(0.5)
-.setConvergenceThreshold(0.001)
-
-// Obtain training and testing data set
-val trainingDS: DataSet[LabeledVector] = ...
-val testingDS: DataSet[Vector] = ...
-
-// Fit the linear model to the provided data
-mlr.fit(trainingDS)
-
-// Calculate the predictions for the test data
-val predictions = mlr.predict(testingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/multiple_linear_regression.zh.md b/docs/dev/libs/ml/multiple_linear_regression.zh.md
deleted file mode 100644
index c6b7ed6..0000000
--- a/docs/dev/libs/ml/multiple_linear_regression.zh.md
+++ /dev/null
@@ -1,154 +0,0 @@
----
-mathjax: include
-title: Multiple Linear Regression
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- Multiple linear regression tries to find a linear function which best fits the provided input data.
- Given a set of input data with its value $(\mathbf{x}, y)$, multiple linear regression finds
- a vector $\mathbf{w}$ such that the sum of the squared residuals is minimized:
-
- $$ S(\mathbf{w}) = \sum_{i=1} \left(y - \mathbf{w}^T\mathbf{x_i} \right)^2$$
-
- Written in matrix notation, we obtain the following formulation:
-
- $$\mathbf{w}^* = \arg \min_{\mathbf{w}} (\mathbf{y} - X\mathbf{w})^2$$
-
- This problem has a closed form solution which is given by:
-
-  $$\mathbf{w}^* = \left(X^TX\right)^{-1}X^T\mathbf{y}$$
-
-  However, in cases where the input data set is so huge that a complete parse over the whole data
-  set is prohibitive, one can apply stochastic gradient descent (SGD) to approximate the solution.
-  SGD first calculates for a random subset of the input data set the gradients. The gradient
-  for a given point $\mathbf{x}_i$ is given by:
-
-  $$\nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i}) = 2\left(\mathbf{w}^T\mathbf{x_i} -
-    y\right)\mathbf{x_i}$$
-
-  The gradients are averaged and scaled. The scaling is defined by $\gamma = \frac{s}{\sqrt{j}}$
-  with $s$ being the initial step size and $j$ being the current iteration number. The resulting gradient is subtracted from the
-  current weight vector giving the new weight vector for the next iteration:
-
-  $$\mathbf{w}_{t+1} = \mathbf{w}_t - \gamma \frac{1}{n}\sum_{i=1}^n \nabla_{\mathbf{w}} S(\mathbf{w}, \mathbf{x_i})$$
-
-  The multiple linear regression algorithm computes either a fixed number of SGD iterations or terminates based on a dynamic convergence criterion.
-  The convergence criterion is the relative change in the sum of squared residuals:
-
-  $$\frac{S_{k-1} - S_k}{S_{k-1}} < \rho$$
-
-## Operations
-
-`MultipleLinearRegression` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-MultipleLinearRegression is trained on a set of `LabeledVector`:
-
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Predict
-
-MultipleLinearRegression predicts for all subtypes of `Vector` the corresponding regression value:
-
-* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Double)]`
-
-## Parameters
-
-  The multiple linear regression implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Iterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations. (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Stepsize</strong></td>
-        <td>
-          <p>
-            Initial step size for the gradient descent method.
-            This value controls how far the gradient descent method moves in the opposite direction of the gradient.
-            Tuning this parameter might be crucial to make it stable and to obtain a better performance.
-            (Default value: <strong>0.1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>ConvergenceThreshold</strong></td>
-        <td>
-          <p>
-            Threshold for relative change of the sum of squared residuals until the iteration is stopped.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LearningRateMethod</strong></td>
-        <td>
-            <p>
-                Learning rate method used to calculate the effective learning rate for each iteration.
-                See the list of supported <a href="optimization.html">learning rate methods</a>.
-                (Default value: <strong>LearningRateMethod.Default</strong>)
-            </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-// Create multiple linear regression learner
-val mlr = MultipleLinearRegression()
-.setIterations(10)
-.setStepsize(0.5)
-.setConvergenceThreshold(0.001)
-
-// Obtain training and testing data set
-val trainingDS: DataSet[LabeledVector] = ...
-val testingDS: DataSet[Vector] = ...
-
-// Fit the linear model to the provided data
-mlr.fit(trainingDS)
-
-// Calculate the predictions for the test data
-val predictions = mlr.predict(testingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/optimization.md b/docs/dev/libs/ml/optimization.md
deleted file mode 100644
index 5ccde25..0000000
--- a/docs/dev/libs/ml/optimization.md
+++ /dev/null
@@ -1,421 +0,0 @@
----
-mathjax: include
-title: Optimization
-# Sub navigation
-sub-nav-group: batch
-sub-nav-parent: flinkml
-sub-nav-title: Optimization
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* Table of contents
-{:toc}
-
-## Mathematical Formulation
-
-The optimization framework in FlinkML is a developer-oriented package that can be used to solve
-[optimization](https://en.wikipedia.org/wiki/Mathematical_optimization)
-problems common in Machine Learning (ML) tasks. In the supervised learning context, this usually
-involves finding a model, as defined by a set of parameters $w$, that minimize a function $f(\wv)$
-given a set of $(\x, y)$ examples,
-where $\x$ is a feature vector and $y$ is a real number, which can represent either a real value in
-the regression case, or a class label in the classification case. In supervised learning, the
-function to be minimized is usually of the form:
-
-
-\begin{equation} \label{eq:objectiveFunc}
-    f(\wv) :=
-    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
-    \lambda\, R(\wv)
-    \ .
-\end{equation}
-
-
-where $L$ is the loss function and $R(\wv)$ the regularization penalty. We use $L$ to measure how
-well the model fits the observed data, and we use $R$ in order to impose a complexity cost to the
-model, with $\lambda > 0$ being the regularization parameter.
-
-### Loss Functions
-
-In supervised learning, we use loss functions in order to measure the model fit, by
-penalizing errors in the predictions $p$ made by the model compared to the true $y$ for each
-example. Different loss functions can be used for regression (e.g. Squared Loss) and classification
-(e.g. Hinge Loss) tasks.
-
-Some common loss functions are:
-
-* Squared Loss: $ \frac{1}{2} \left(\wv^T \cdot \x - y\right)^2, \quad y \in \R $
-* Hinge Loss: $ \max \left(0, 1 - y ~ \wv^T \cdot \x\right), \quad y \in \{-1, +1\} $
-* Logistic Loss: $ \log\left(1+\exp\left( -y ~ \wv^T \cdot \x\right)\right), \quad y \in \{-1, +1\}$
-
-### Regularization Types
-
-[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) in machine learning
-imposes penalties to the estimated models, in order to reduce overfitting. The most common penalties
-are the $L_1$ and $L_2$ penalties, defined as:
-
-* $L_1$: $R(\wv) = \norm{\wv}_1$
-* $L_2$: $R(\wv) = \frac{1}{2}\norm{\wv}_2^2$
-
-The $L_2$ penalty penalizes large weights, favoring solutions with more small weights rather than
-few large ones.
-The $L_1$ penalty can be used to drive a number of the solution coefficients to 0, thereby
-producing sparse solutions.
-The regularization constant $\lambda$ in $\eqref{eq:objectiveFunc}$ determines the amount of regularization applied to the model,
-and is usually determined through model cross-validation.
-A good comparison of regularization types can be found in [this](http://www.robotics.stanford.edu/~ang/papers/icml04-l1l2.pdf) paper by Andrew Ng.
-Which regularization type is supported depends on the actually used optimization algorithm.
-
-## Stochastic Gradient Descent
-
-In order to find a (local) minimum of a function, Gradient Descent methods take steps in the
-direction opposite to the gradient of the function $\eqref{eq:objectiveFunc}$ taken with
-respect to the current parameters (weights).
-In order to compute the exact gradient we need to perform one pass through all the points in
-a dataset, making the process computationally expensive.
-An alternative is Stochastic Gradient Descent (SGD) where at each iteration we sample one point
-from the complete dataset and update the parameters for each point, in an online manner.
-
-In mini-batch SGD we instead sample random subsets of the dataset, and compute the gradient
-over each batch. At each iteration of the algorithm we update the weights once, based on
-the average of the gradients computed from each mini-batch.
-
-An important parameter is the learning rate $\eta$, or step size, which can be determined by one of five methods, listed below. The setting of the initial step size can significantly affect the performance of the
-algorithm. For some practical tips on tuning SGD see Leon Botou's
-"[Stochastic Gradient Descent Tricks](http://research.microsoft.com/pubs/192769/tricks-2012.pdf)".
-
-The current implementation of SGD  uses the whole partition, making it
-effectively a batch gradient descent. Once a sampling operator has been introduced in Flink, true
-mini-batch SGD will be performed.
-
-
-### Parameters
-
-  The stochastic gradient descent implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameter</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><strong>RegularizationPenalty</strong></td>
-        <td>
-          <p>
-            The regularization function to apply. (Default value: <strong>NoRegularization</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>RegularizationConstant</strong></td>
-        <td>
-          <p>
-            The amount of regularization to apply. (Default value: <strong>0.1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LossFunction</strong></td>
-        <td>
-          <p>
-            The loss function to be optimized. (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Iterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations. (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LearningRate</strong></td>
-        <td>
-          <p>
-            Initial learning rate for the gradient descent method.
-            This value controls how far the gradient descent method moves in the opposite direction
-            of the gradient.
-            (Default value: <strong>0.1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>ConvergenceThreshold</strong></td>
-        <td>
-          <p>
-            When set, iterations stop if the relative change in the value of the objective function $\eqref{eq:objectiveFunc}$ is less than the provided threshold, $\tau$.
-            The convergence criterion is defined as follows: $\left| \frac{f(\wv)_{i-1} - f(\wv)_i}{f(\wv)_{i-1}}\right| < \tau$.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LearningRateMethod</strong></td>
-        <td>
-          <p>
-            (Default value: <strong>LearningRateMethod.Default</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Decay</strong></td>
-        <td>
-          <p>
-            (Default value: <strong>0.0</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-### Regularization
-
-FlinkML supports Stochastic Gradient Descent with L1, L2 and no regularization. The regularization type has to implement the `RegularizationPenalty` interface,
-which calculates the new weights based on the gradient and regularization type.
-The following list contains the supported regularization functions.
-
-<table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Class Name</th>
-      <th class="text-center">Regularization function $R(\wv)$</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><strong>NoRegularization</strong></td>
-      <td>$R(\wv) = 0$</td>
-    </tr>
-    <tr>
-      <td><strong>L1Regularization</strong></td>
-      <td>$R(\wv) = \norm{\wv}_1$</td>
-    </tr>
-    <tr>
-      <td><strong>L2Regularization</strong></td>
-      <td>$R(\wv) = \frac{1}{2}\norm{\wv}_2^2$</td>
-    </tr>
-  </tbody>
-</table>
-
-### Loss Function
-
-The loss function which is minimized has to implement the `LossFunction` interface, which defines methods to compute the loss and the gradient of it.
-Either one defines ones own `LossFunction` or one uses the `GenericLossFunction` class which constructs the loss function from an outer loss function and a prediction function.
-An example can be seen here
-
-{% highlight scala %}
-val lossFunction = GenericLossFunction(SquaredLoss, LinearPrediction)
-{% endhighlight %}
-
-The full list of supported outer loss functions can be found [here](#partial-loss-function-values).
-The full list of supported prediction functions can be found [here](#prediction-function-values).
-
-#### Partial Loss Function Values ##
-
-  <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Function Name</th>
-        <th class="text-center">Description</th>
-        <th class="text-center">Loss</th>
-        <th class="text-center">Loss Derivative</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><strong>SquaredLoss</strong></td>
-        <td>
-          <p>
-            Loss function most commonly used for regression tasks.
-          </p>
-        </td>
-        <td class="text-center">$\frac{1}{2} (\wv^T \cdot \x - y)^2$</td>
-        <td class="text-center">$\wv^T \cdot \x - y$</td>
-      </tr>
-      <tr>
-        <td><strong>LogisticLoss</strong></td>
-        <td>
-          <p>
-            Loss function used for classification tasks.
-          </p>
-        </td>
-        <td class="text-center">$\log\left(1+\exp\left( -y ~ \wv^T \cdot \x\right)\right), \quad y \in \{-1, +1\}$</td>
-        <td class="text-center">$\frac{-y}{1+\exp\left(y ~ \wv^T \cdot \x\right)}$</td>
-      </tr>
-      <tr>
-        <td><strong>HingeLoss</strong></td>
-        <td>
-          <p>
-            Loss function used for classification tasks.
-          </p>
-        </td>
-        <td class="text-center">$\max \left(0, 1 - y ~ \wv^T \cdot \x\right), \quad y \in \{-1, +1\}$</td>
-        <td class="text-center">$\begin{cases}
-                                 -y&\text{if } y ~ \wv^T <= 1 \\
-                                 0&\text{if } y ~ \wv^T > 1
-                                 \end{cases}$</td>
-      </tr>
-    </tbody>
-  </table>
-
-#### Prediction Function Values ##
-
-  <table class="table table-bordered">
-      <thead>
-        <tr>
-          <th class="text-left" style="width: 20%">Function Name</th>
-          <th class="text-center">Description</th>
-          <th class="text-center">Prediction</th>
-          <th class="text-center">Prediction Gradient</th>
-        </tr>
-      </thead>
-      <tbody>
-        <tr>
-          <td><strong>LinearPrediction</strong></td>
-          <td>
-            <p>
-              The function most commonly used for linear models, such as linear regression and
-              linear classifiers.
-            </p>
-          </td>
-          <td class="text-center">$\x^T \cdot \wv$</td>
-          <td class="text-center">$\x$</td>
-        </tr>
-      </tbody>
-    </table>
-
-#### Effective Learning Rate ##
-
-Where:
-
-- $j$ is the iteration number
-
-- $\eta_j$ is the step size on step $j$
-
-- $\eta_0$ is the initial step size
-
-- $\lambda$ is the regularization constant
-
-- $\tau$ is the decay constant, which causes the learning rate to be a decreasing function of $j$, that is to say as iterations increase, learning rate decreases. The exact rate of decay is function specific, see **Inverse Scaling** and **Wei Xu's Method** (which is an extension of the **Inverse Scaling** method).
-
-<table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Function Name</th>
-        <th class="text-center">Description</th>
-        <th class="text-center">Function</th>
-        <th class="text-center">Called As</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><strong>Default</strong></td>
-        <td>
-          <p>
-            The function default method used for determining the step size. This is equivalent to the inverse scaling method for $\tau$ = 0.5.  This special case is kept as the default to maintain backwards compatibility.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0/\sqrt{j}$</td>
-        <td class="text-center"><code>LearningRateMethod.Default</code></td>
-      </tr>
-      <tr>
-        <td><strong>Constant</strong></td>
-        <td>
-          <p>
-            The step size is constant throughout the learning task.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0$</td>
-        <td class="text-center"><code>LearningRateMethod.Constant</code></td>
-      </tr>
-      <tr>
-        <td><strong>Leon Bottou's Method</strong></td>
-        <td>
-          <p>
-            This is the <code>'optimal'</code> method of sklearn.
-            The optimal initial value $t_0$ has to be provided.
-            Sklearn uses the following heuristic: $t_0 = \max(1.0, L^\prime(-\beta, 1.0) / (\alpha \cdot \beta)$
-            with $\beta = \sqrt{\frac{1}{\sqrt{\alpha}}}$ and $L^\prime(prediction, truth)$ being the derivative of the loss function.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = 1 / (\lambda \cdot (t_0 + j -1)) $</td>
-        <td class="text-center"><code>LearningRateMethod.Bottou</code></td>
-      </tr>
-      <tr>
-        <td><strong>Inverse Scaling</strong></td>
-        <td>
-          <p>
-            A very common method for determining the step size.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0 / j^{\tau}$</td>
-        <td class="text-center"><code>LearningRateMethod.InvScaling</code></td>
-      </tr>
-      <tr>
-        <td><strong>Wei Xu's Method</strong></td>
-        <td>
-          <p>
-            Method proposed by Wei Xu in <a href="http://arxiv.org/pdf/1107.2490.pdf">Towards Optimal One Pass Large Scale Learning with
-            Averaged Stochastic Gradient Descent</a>
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0 \cdot (1+ \lambda \cdot \eta_0 \cdot j)^{-\tau} $</td>
-        <td class="text-center"><code>LearningRateMethod.Xu</code></td>
-      </tr>
-    </tbody>
-  </table>
-
-### Examples
-
-In the Flink implementation of SGD, given a set of examples in a `DataSet[LabeledVector]` and
-optionally some initial weights, we can use `GradientDescent.optimize()` in order to optimize
-the weights for the given data.
-
-The user can provide an initial `DataSet[WeightVector]`,
-which contains one `WeightVector` element, or use the default weights which are all set to 0.
-A `WeightVector` is a container class for the weights, which separates the intercept from the
-weight vector. This allows us to avoid applying regularization to the intercept.
-
-
-
-{% highlight scala %}
-// Create stochastic gradient descent solver
-val sgd = GradientDescent()
-  .setLossFunction(SquaredLoss())
-  .setRegularizationPenalty(L1Regularization)
-  .setRegularizationConstant(0.2)
-  .setIterations(100)
-  .setLearningRate(0.01)
-  .setLearningRateMethod(LearningRateMethod.Xu(-0.75))
-
-
-// Obtain data
-val trainingDS: DataSet[LabeledVector] = ...
-
-// Optimize the weights, according to the provided data
-val weightDS = sgd.optimize(trainingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/optimization.zh.md b/docs/dev/libs/ml/optimization.zh.md
deleted file mode 100644
index 5ccde25..0000000
--- a/docs/dev/libs/ml/optimization.zh.md
+++ /dev/null
@@ -1,421 +0,0 @@
----
-mathjax: include
-title: Optimization
-# Sub navigation
-sub-nav-group: batch
-sub-nav-parent: flinkml
-sub-nav-title: Optimization
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* Table of contents
-{:toc}
-
-## Mathematical Formulation
-
-The optimization framework in FlinkML is a developer-oriented package that can be used to solve
-[optimization](https://en.wikipedia.org/wiki/Mathematical_optimization)
-problems common in Machine Learning (ML) tasks. In the supervised learning context, this usually
-involves finding a model, as defined by a set of parameters $w$, that minimize a function $f(\wv)$
-given a set of $(\x, y)$ examples,
-where $\x$ is a feature vector and $y$ is a real number, which can represent either a real value in
-the regression case, or a class label in the classification case. In supervised learning, the
-function to be minimized is usually of the form:
-
-
-\begin{equation} \label{eq:objectiveFunc}
-    f(\wv) :=
-    \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) +
-    \lambda\, R(\wv)
-    \ .
-\end{equation}
-
-
-where $L$ is the loss function and $R(\wv)$ the regularization penalty. We use $L$ to measure how
-well the model fits the observed data, and we use $R$ in order to impose a complexity cost to the
-model, with $\lambda > 0$ being the regularization parameter.
-
-### Loss Functions
-
-In supervised learning, we use loss functions in order to measure the model fit, by
-penalizing errors in the predictions $p$ made by the model compared to the true $y$ for each
-example. Different loss functions can be used for regression (e.g. Squared Loss) and classification
-(e.g. Hinge Loss) tasks.
-
-Some common loss functions are:
-
-* Squared Loss: $ \frac{1}{2} \left(\wv^T \cdot \x - y\right)^2, \quad y \in \R $
-* Hinge Loss: $ \max \left(0, 1 - y ~ \wv^T \cdot \x\right), \quad y \in \{-1, +1\} $
-* Logistic Loss: $ \log\left(1+\exp\left( -y ~ \wv^T \cdot \x\right)\right), \quad y \in \{-1, +1\}$
-
-### Regularization Types
-
-[Regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) in machine learning
-imposes penalties to the estimated models, in order to reduce overfitting. The most common penalties
-are the $L_1$ and $L_2$ penalties, defined as:
-
-* $L_1$: $R(\wv) = \norm{\wv}_1$
-* $L_2$: $R(\wv) = \frac{1}{2}\norm{\wv}_2^2$
-
-The $L_2$ penalty penalizes large weights, favoring solutions with more small weights rather than
-few large ones.
-The $L_1$ penalty can be used to drive a number of the solution coefficients to 0, thereby
-producing sparse solutions.
-The regularization constant $\lambda$ in $\eqref{eq:objectiveFunc}$ determines the amount of regularization applied to the model,
-and is usually determined through model cross-validation.
-A good comparison of regularization types can be found in [this](http://www.robotics.stanford.edu/~ang/papers/icml04-l1l2.pdf) paper by Andrew Ng.
-Which regularization type is supported depends on the actually used optimization algorithm.
-
-## Stochastic Gradient Descent
-
-In order to find a (local) minimum of a function, Gradient Descent methods take steps in the
-direction opposite to the gradient of the function $\eqref{eq:objectiveFunc}$ taken with
-respect to the current parameters (weights).
-In order to compute the exact gradient we need to perform one pass through all the points in
-a dataset, making the process computationally expensive.
-An alternative is Stochastic Gradient Descent (SGD) where at each iteration we sample one point
-from the complete dataset and update the parameters for each point, in an online manner.
-
-In mini-batch SGD we instead sample random subsets of the dataset, and compute the gradient
-over each batch. At each iteration of the algorithm we update the weights once, based on
-the average of the gradients computed from each mini-batch.
-
-An important parameter is the learning rate $\eta$, or step size, which can be determined by one of five methods, listed below. The setting of the initial step size can significantly affect the performance of the
-algorithm. For some practical tips on tuning SGD see Leon Botou's
-"[Stochastic Gradient Descent Tricks](http://research.microsoft.com/pubs/192769/tricks-2012.pdf)".
-
-The current implementation of SGD  uses the whole partition, making it
-effectively a batch gradient descent. Once a sampling operator has been introduced in Flink, true
-mini-batch SGD will be performed.
-
-
-### Parameters
-
-  The stochastic gradient descent implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameter</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><strong>RegularizationPenalty</strong></td>
-        <td>
-          <p>
-            The regularization function to apply. (Default value: <strong>NoRegularization</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>RegularizationConstant</strong></td>
-        <td>
-          <p>
-            The amount of regularization to apply. (Default value: <strong>0.1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LossFunction</strong></td>
-        <td>
-          <p>
-            The loss function to be optimized. (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Iterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations. (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LearningRate</strong></td>
-        <td>
-          <p>
-            Initial learning rate for the gradient descent method.
-            This value controls how far the gradient descent method moves in the opposite direction
-            of the gradient.
-            (Default value: <strong>0.1</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>ConvergenceThreshold</strong></td>
-        <td>
-          <p>
-            When set, iterations stop if the relative change in the value of the objective function $\eqref{eq:objectiveFunc}$ is less than the provided threshold, $\tau$.
-            The convergence criterion is defined as follows: $\left| \frac{f(\wv)_{i-1} - f(\wv)_i}{f(\wv)_{i-1}}\right| < \tau$.
-            (Default value: <strong>None</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>LearningRateMethod</strong></td>
-        <td>
-          <p>
-            (Default value: <strong>LearningRateMethod.Default</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>Decay</strong></td>
-        <td>
-          <p>
-            (Default value: <strong>0.0</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-### Regularization
-
-FlinkML supports Stochastic Gradient Descent with L1, L2 and no regularization. The regularization type has to implement the `RegularizationPenalty` interface,
-which calculates the new weights based on the gradient and regularization type.
-The following list contains the supported regularization functions.
-
-<table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Class Name</th>
-      <th class="text-center">Regularization function $R(\wv)$</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td><strong>NoRegularization</strong></td>
-      <td>$R(\wv) = 0$</td>
-    </tr>
-    <tr>
-      <td><strong>L1Regularization</strong></td>
-      <td>$R(\wv) = \norm{\wv}_1$</td>
-    </tr>
-    <tr>
-      <td><strong>L2Regularization</strong></td>
-      <td>$R(\wv) = \frac{1}{2}\norm{\wv}_2^2$</td>
-    </tr>
-  </tbody>
-</table>
-
-### Loss Function
-
-The loss function which is minimized has to implement the `LossFunction` interface, which defines methods to compute the loss and the gradient of it.
-Either one defines ones own `LossFunction` or one uses the `GenericLossFunction` class which constructs the loss function from an outer loss function and a prediction function.
-An example can be seen here
-
-{% highlight scala %}
-val lossFunction = GenericLossFunction(SquaredLoss, LinearPrediction)
-{% endhighlight %}
-
-The full list of supported outer loss functions can be found [here](#partial-loss-function-values).
-The full list of supported prediction functions can be found [here](#prediction-function-values).
-
-#### Partial Loss Function Values ##
-
-  <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Function Name</th>
-        <th class="text-center">Description</th>
-        <th class="text-center">Loss</th>
-        <th class="text-center">Loss Derivative</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><strong>SquaredLoss</strong></td>
-        <td>
-          <p>
-            Loss function most commonly used for regression tasks.
-          </p>
-        </td>
-        <td class="text-center">$\frac{1}{2} (\wv^T \cdot \x - y)^2$</td>
-        <td class="text-center">$\wv^T \cdot \x - y$</td>
-      </tr>
-      <tr>
-        <td><strong>LogisticLoss</strong></td>
-        <td>
-          <p>
-            Loss function used for classification tasks.
-          </p>
-        </td>
-        <td class="text-center">$\log\left(1+\exp\left( -y ~ \wv^T \cdot \x\right)\right), \quad y \in \{-1, +1\}$</td>
-        <td class="text-center">$\frac{-y}{1+\exp\left(y ~ \wv^T \cdot \x\right)}$</td>
-      </tr>
-      <tr>
-        <td><strong>HingeLoss</strong></td>
-        <td>
-          <p>
-            Loss function used for classification tasks.
-          </p>
-        </td>
-        <td class="text-center">$\max \left(0, 1 - y ~ \wv^T \cdot \x\right), \quad y \in \{-1, +1\}$</td>
-        <td class="text-center">$\begin{cases}
-                                 -y&\text{if } y ~ \wv^T <= 1 \\
-                                 0&\text{if } y ~ \wv^T > 1
-                                 \end{cases}$</td>
-      </tr>
-    </tbody>
-  </table>
-
-#### Prediction Function Values ##
-
-  <table class="table table-bordered">
-      <thead>
-        <tr>
-          <th class="text-left" style="width: 20%">Function Name</th>
-          <th class="text-center">Description</th>
-          <th class="text-center">Prediction</th>
-          <th class="text-center">Prediction Gradient</th>
-        </tr>
-      </thead>
-      <tbody>
-        <tr>
-          <td><strong>LinearPrediction</strong></td>
-          <td>
-            <p>
-              The function most commonly used for linear models, such as linear regression and
-              linear classifiers.
-            </p>
-          </td>
-          <td class="text-center">$\x^T \cdot \wv$</td>
-          <td class="text-center">$\x$</td>
-        </tr>
-      </tbody>
-    </table>
-
-#### Effective Learning Rate ##
-
-Where:
-
-- $j$ is the iteration number
-
-- $\eta_j$ is the step size on step $j$
-
-- $\eta_0$ is the initial step size
-
-- $\lambda$ is the regularization constant
-
-- $\tau$ is the decay constant, which causes the learning rate to be a decreasing function of $j$, that is to say as iterations increase, learning rate decreases. The exact rate of decay is function specific, see **Inverse Scaling** and **Wei Xu's Method** (which is an extension of the **Inverse Scaling** method).
-
-<table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Function Name</th>
-        <th class="text-center">Description</th>
-        <th class="text-center">Function</th>
-        <th class="text-center">Called As</th>
-      </tr>
-    </thead>
-    <tbody>
-      <tr>
-        <td><strong>Default</strong></td>
-        <td>
-          <p>
-            The function default method used for determining the step size. This is equivalent to the inverse scaling method for $\tau$ = 0.5.  This special case is kept as the default to maintain backwards compatibility.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0/\sqrt{j}$</td>
-        <td class="text-center"><code>LearningRateMethod.Default</code></td>
-      </tr>
-      <tr>
-        <td><strong>Constant</strong></td>
-        <td>
-          <p>
-            The step size is constant throughout the learning task.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0$</td>
-        <td class="text-center"><code>LearningRateMethod.Constant</code></td>
-      </tr>
-      <tr>
-        <td><strong>Leon Bottou's Method</strong></td>
-        <td>
-          <p>
-            This is the <code>'optimal'</code> method of sklearn.
-            The optimal initial value $t_0$ has to be provided.
-            Sklearn uses the following heuristic: $t_0 = \max(1.0, L^\prime(-\beta, 1.0) / (\alpha \cdot \beta)$
-            with $\beta = \sqrt{\frac{1}{\sqrt{\alpha}}}$ and $L^\prime(prediction, truth)$ being the derivative of the loss function.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = 1 / (\lambda \cdot (t_0 + j -1)) $</td>
-        <td class="text-center"><code>LearningRateMethod.Bottou</code></td>
-      </tr>
-      <tr>
-        <td><strong>Inverse Scaling</strong></td>
-        <td>
-          <p>
-            A very common method for determining the step size.
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0 / j^{\tau}$</td>
-        <td class="text-center"><code>LearningRateMethod.InvScaling</code></td>
-      </tr>
-      <tr>
-        <td><strong>Wei Xu's Method</strong></td>
-        <td>
-          <p>
-            Method proposed by Wei Xu in <a href="http://arxiv.org/pdf/1107.2490.pdf">Towards Optimal One Pass Large Scale Learning with
-            Averaged Stochastic Gradient Descent</a>
-          </p>
-        </td>
-        <td class="text-center">$\eta_j = \eta_0 \cdot (1+ \lambda \cdot \eta_0 \cdot j)^{-\tau} $</td>
-        <td class="text-center"><code>LearningRateMethod.Xu</code></td>
-      </tr>
-    </tbody>
-  </table>
-
-### Examples
-
-In the Flink implementation of SGD, given a set of examples in a `DataSet[LabeledVector]` and
-optionally some initial weights, we can use `GradientDescent.optimize()` in order to optimize
-the weights for the given data.
-
-The user can provide an initial `DataSet[WeightVector]`,
-which contains one `WeightVector` element, or use the default weights which are all set to 0.
-A `WeightVector` is a container class for the weights, which separates the intercept from the
-weight vector. This allows us to avoid applying regularization to the intercept.
-
-
-
-{% highlight scala %}
-// Create stochastic gradient descent solver
-val sgd = GradientDescent()
-  .setLossFunction(SquaredLoss())
-  .setRegularizationPenalty(L1Regularization)
-  .setRegularizationConstant(0.2)
-  .setIterations(100)
-  .setLearningRate(0.01)
-  .setLearningRateMethod(LearningRateMethod.Xu(-0.75))
-
-
-// Obtain data
-val trainingDS: DataSet[LabeledVector] = ...
-
-// Optimize the weights, according to the provided data
-val weightDS = sgd.optimize(trainingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/pipelines.md b/docs/dev/libs/ml/pipelines.md
deleted file mode 100644
index 514d557..0000000
--- a/docs/dev/libs/ml/pipelines.md
+++ /dev/null
@@ -1,443 +0,0 @@
----
-mathjax: include
-title: Looking under the hood of pipelines
-nav-title: Pipelines
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Introduction
-
-The ability to chain together different transformers and predictors is an important feature for
-any Machine Learning (ML) library. In FlinkML we wanted to provide an intuitive API,
-and at the same
-time utilize the capabilities of the Scala language to provide
-type-safe implementations of our pipelines. What we hope to achieve then is an easy to use API,
-that protects users from type errors at pre-flight (before the job is launched) time, thereby
-eliminating cases where long
-running jobs are submitted to the cluster only to see them fail due to some
-error in the series of data transformations that commonly happen in an ML pipeline.
-
-In this guide then we will describe the choices we made during the implementation of chainable
-transformers and predictors in FlinkML, and provide guidelines on how developers can create their
-own algorithms that make use of these capabilities.
-
-## The what and the why
-
-So what do we mean by "ML pipelines"? Pipelines in the ML context can be thought of as chains of
-operations that have some data as input, perform a number of transformations to that data,
-and
-then output the transformed data, either to be used as the input (features) of a predictor
-function, such as a learning model, or just output the transformed data themselves, to be used in
-some other task. The end learner can of course be a part of the pipeline as well.
-ML pipelines can often be complicated sets of operations ([in-depth explanation](http://research.google.com/pubs/pub43146.html)) and
-can become sources of errors for end-to-end learning systems.
-
-The purpose of ML pipelines is then to create a
-framework that can be used to manage the complexity introduced by these chains of operations.
-Pipelines should make it easy for developers to define chained transformations that can be
-applied to the
-training data, in order to create the end features that will be used to train a
-learning model, and then perform the same set of transformations just as easily to unlabeled
-(test) data. Pipelines should also simplify cross-validation and model selection on
-these chains of operations.
-
-Finally, by ensuring that the consecutive links in the pipeline chain "fit together" we also
-avoid costly type errors. Since each step in a pipeline can be a computationally-heavy operation,
-we want to avoid running a pipelined job, unless we are sure that all the input/output pairs in a
-pipeline "fit".
-
-## Pipelines in FlinkML
-
-The building blocks for pipelines in FlinkML can be found in the `ml.pipeline` package.
-FlinkML follows an API inspired by [sklearn](http://scikit-learn.org) which means that we have
-`Estimator`, `Transformer` and `Predictor` interfaces. For an in-depth look at the design of the
-sklearn API the interested reader is referred to [this](http://arxiv.org/abs/1309.0238) paper.
-In short, the `Estimator` is the base class from which `Transformer` and `Predictor` inherit.
-`Estimator` defines a `fit` method, and `Transformer` also defines a `transform` method and
-`Predictor` defines a `predict` method.
-
-The `fit` method of the `Estimator` performs the actual training of the model, for example
-finding the correct weights in a linear regression task, or the mean and standard deviation of
-the data in a feature scaler.
-As evident by the naming, classes that implement
-`Transformer` are transform operations like [scaling the input](standard_scaler.html) and
-`Predictor` implementations are learning algorithms such as [Multiple Linear Regression]({{site.baseurl}}/dev/libs/ml/multiple_linear_regression.html).
-Pipelines can be created by chaining together a number of Transformers, and the final link in a pipeline can be a Predictor or another Transformer.
-Pipelines that end with Predictor cannot be chained any further.
-Below is an example of how a pipeline can be formed:
-
-{% highlight scala %}
-// Training data
-val input: DataSet[LabeledVector] = ...
-// Test data
-val unlabeled: DataSet[Vector] = ...
-
-val scaler = StandardScaler()
-val polyFeatures = PolynomialFeatures()
-val mlr = MultipleLinearRegression()
-
-// Construct the pipeline
-val pipeline = scaler
-  .chainTransformer(polyFeatures)
-  .chainPredictor(mlr)
-
-// Train the pipeline (scaler and multiple linear regression)
-pipeline.fit(input)
-
-// Calculate predictions for the testing data
-val predictions: DataSet[LabeledVector] = pipeline.predict(unlabeled)
-
-{% endhighlight %}
-
-As we mentioned, FlinkML pipelines are type-safe.
-If we tried to chain a transformer with output of type `A` to another with input of type `B` we
-would get an error at pre-flight time if `A` != `B`. FlinkML achieves this kind of type-safety
-through the use of Scala's implicits.
-
-### Scala implicits
-
-If you are not familiar with Scala's implicits we can recommend [this excerpt](https://www.artima.com/pins1ed/implicit-conversions-and-parameters.html)
-from Martin Odersky's "Programming in Scala". In short, implicit conversions allow for ad-hoc
-polymorphism in Scala by providing conversions from one type to another, and implicit values
-provide the compiler with default values that can be supplied to function calls through implicit parameters.
-The combination of implicit conversions and implicit parameters is what allows us to chain transform
-and predict operations together in a type-safe manner.
-
-### Operations
-
-As we mentioned, the trait (abstract class) `Estimator` defines a `fit` method. The method has two
-parameter lists
-(i.e. is a [curried function](http://docs.scala-lang.org/tutorials/tour/currying.html)). The
-first parameter list
-takes the input (training) `DataSet` and the parameters for the estimator. The second parameter
-list takes one `implicit` parameter, of type `FitOperation`. `FitOperation` is a class that also
-defines a `fit` method, and this is where the actual logic of training the concrete Estimators
-should be implemented. The `fit` method of `Estimator` is essentially a wrapper around the  fit
-method of `FitOperation`. The `predict` method of `Predictor` and the `transform` method of
-`Transform` are designed in a similar manner, with a respective operation class.
-
-In these methods the operation object is provided as an implicit parameter.
-Scala will [look for implicits](http://docs.scala-lang.org/tutorials/FAQ/finding-implicits.html)
-in the companion object of a type, so classes that implement these interfaces should provide these
-objects as implicit objects inside the companion object.
-
-As an example we can look at the `StandardScaler` class. `StandardScaler` extends `Transformer`, so it has access to its `fit` and `transform` functions.
-These two functions expect objects of `FitOperation` and `TransformOperation` as implicit parameters,
-for the `fit` and `transform` methods respectively, which `StandardScaler` provides in its companion
-object, through `transformVectors` and `fitVectorStandardScaler`:
-
-{% highlight scala %}
-class StandardScaler extends Transformer[StandardScaler] {
-  ...
-}
-
-object StandardScaler {
-
-  ...
-
-  implicit def fitVectorStandardScaler[T <: Vector] = new FitOperation[StandardScaler, T] {
-    override def fit(instance: StandardScaler, fitParameters: ParameterMap, input: DataSet[T])
-      : Unit = {
-        ...
-      }
-
-  implicit def transformVectors[T <: Vector: VectorConverter: TypeInformation: ClassTag] = {
-      new TransformOperation[StandardScaler, T, T] {
-        override def transform(
-          instance: StandardScaler,
-          transformParameters: ParameterMap,
-          input: DataSet[T])
-        : DataSet[T] = {
-          ...
-        }
-
-}
-
-{% endhighlight %}
-
-Note that `StandardScaler` does **not** override the `fit` method of `Estimator` or the `transform`
-method of `Transformer`. Rather, its implementations of `FitOperation` and `TransformOperation`
-override their respective `fit` and `transform` methods, which are then called by the `fit` and
-`transform` methods of `Estimator` and `Transformer`.  Similarly, a class that implements
-`Predictor` should define an implicit `PredictOperation` object inside its companion object.
-
-#### Types and type safety
-
-Apart from the `fit` and `transform` operations that we listed above, the `StandardScaler` also
-provides `fit` and `transform` operations for input of type `LabeledVector`.
-This allows us to use the  algorithm for input that is labeled or unlabeled, and this happens
-automatically, depending on  the type of the input that we give to the fit and transform
-operations. The correct implicit operation is chosen by the compiler, depending on the input type.
-
-If we try to call the `fit` or `transform` methods with types that are not supported we will get a
-runtime error before the job is launched.
-While it would be possible to catch these kinds of errors at compile time as well, the error
-messages that we are able to provide the user would be much less informative, which is why we chose
-to throw runtime exceptions instead.
-
-### Chaining
-
-Chaining is achieved by calling `chainTransformer` or `chainPredictor` on an object
-of a class that implements `Transformer`. These methods return a `ChainedTransformer` or
-`ChainedPredictor` object respectively. As we mentioned, `ChainedTransformer` objects can be
-chained further, while `ChainedPredictor` objects cannot. These classes take care of applying
-fit, transform, and predict operations for a pair of successive transformers or
-a transformer and a predictor. They also act recursively if the length of the
-chain is larger than two, since every `ChainedTransformer` defines a `transform` and `fit`
-operation that can be further chained with more transformers or a predictor.
-
-It is important to note that developers and users do not need to worry about chaining when
-implementing their algorithms, all this is handled automatically by FlinkML.
-
-### How to Implement a Pipeline Operator
-
-In order to support FlinkML's pipelining, algorithms have to adhere to a certain design pattern, which we will describe in this section.
-Let's assume that we want to implement a pipeline operator which changes the mean of your data.
-Since centering data is a common pre-processing step in many analysis pipelines, we will implement it as a `Transformer`.
-Therefore, we first create a `MeanTransformer` class which inherits from `Transformer`
-
-{% highlight scala %}
-class MeanTransformer extends Transformer[MeanTransformer] {}
-{% endhighlight %}
-
-Since we want to be able to configure the mean of the resulting data, we have to add a configuration parameter.
-
-{% highlight scala %}
-class MeanTransformer extends Transformer[MeanTransformer] {
-  def setMean(mean: Double): this.type = {
-    parameters.add(MeanTransformer.Mean, mean)
-    this
-  }
-}
-
-object MeanTransformer {
-  case object Mean extends Parameter[Double] {
-    override val defaultValue: Option[Double] = Some(0.0)
-  }
-
-  def apply(): MeanTransformer = new MeanTransformer
-}
-{% endhighlight %}
-
-Parameters are defined in the companion object of the transformer class and extend the `Parameter` class.
-Since the parameter instances are supposed to act as immutable keys for a parameter map, they should be implemented as `case objects`.
-The default value will be used if no other value has been set by the user of this component.
-If no default value has been specified, meaning that `defaultValue = None`, then the algorithm has to handle this situation accordingly.
-
-We can now instantiate a `MeanTransformer` object and set the mean value of the transformed data.
-But we still have to implement how the transformation works.
-The workflow can be separated into two phases.
-Within the first phase, the transformer learns the mean of the given training data.
-This knowledge can then be used in the second phase to transform the provided data with respect to the configured resulting mean value.
-
-The learning of the mean can be implemented within the `fit` operation of our `Transformer`, which it inherited from `Estimator`.
-Within the `fit` operation, a pipeline component is trained with respect to the given training data.
-The algorithm is, however, **not** implemented by overriding the `fit` method but by providing an implementation of a corresponding `FitOperation` for the correct type.
-Taking a look at the definition of the `fit` method in `Estimator`, which is the parent class of `Transformer`, reveals what why this is the case.
-
-{% highlight scala %}
-trait Estimator[Self] extends WithParameters with Serializable {
-  that: Self =>
-
-  def fit[Training](
-      training: DataSet[Training],
-      fitParameters: ParameterMap = ParameterMap.Empty)
-      (implicit fitOperation: FitOperation[Self, Training]): Unit = {
-    FlinkMLTools.registerFlinkMLTypes(training.getExecutionEnvironment)
-    fitOperation.fit(this, fitParameters, training)
-  }
-}
-{% endhighlight %}
-
-We see that the `fit` method is called with an input data set of type `Training`, an optional parameter list and in the second parameter list with an implicit parameter of type `FitOperation`.
-Within the body of the function, first some machine learning types are registered and then the `fit` method of the `FitOperation` parameter is called.
-The instance gives itself, the parameter map and the training data set as a parameters to the method.
-Thus, all the program logic takes place within the `FitOperation`.
-
-The `FitOperation` has two type parameters.
-The first defines the pipeline operator type for which this `FitOperation` shall work and the second type parameter defines the type of the data set elements.
-If we first wanted to implement the `MeanTransformer` to work on `DenseVector`, we would, thus, have to provide an implementation for `FitOperation[MeanTransformer, DenseVector]`.
-
-{% highlight scala %}
-val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
-  override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
-    import org.apache.flink.ml.math.Breeze._
-    val meanTrainingData: DataSet[DenseVector] = input
-      .map{ x => (x.asBreeze, 1) }
-      .reduce{
-        (left, right) =>
-          (left._1 + right._1, left._2 + right._2)
-      }
-      .map{ p => (p._1/p._2).fromBreeze }
-  }
-}
-{% endhighlight %}
-
-A `FitOperation[T, I]` has a `fit` method which is called with an instance of type `T`, a parameter map and an input `DataSet[I]`.
-In our case `T=MeanTransformer` and `I=DenseVector`.
-The parameter map is necessary if our fit step depends on some parameter values which were not given directly at creation time of the `Transformer`.
-The `FitOperation` of the `MeanTransformer` sums the `DenseVector` instances of the given input data set up and divides the result by the total number of vectors.
-That way, we obtain a `DataSet[DenseVector]` with a single element which is the mean value.
-
-But if we look closely at the implementation, we see that the result of the mean computation is never stored anywhere.
-If we want to use this knowledge in a later step to adjust the mean of some other input, we have to keep it around.
-And here is where the parameter of type `MeanTransformer` which is given to the `fit` method comes into play.
-We can use this instance to store state, which is used by a subsequent `transform` operation which works on the same object.
-But first we have to extend `MeanTransformer` by a member field and then adjust the `FitOperation` implementation.
-
-{% highlight scala %}
-class MeanTransformer extends Transformer[Centering] {
-  var meanOption: Option[DataSet[DenseVector]] = None
-
-  def setMean(mean: Double): Mean = {
-    parameters.add(MeanTransformer.Mean, mu)
-  }
-}
-
-val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
-  override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
-    import org.apache.flink.ml.math.Breeze._
-
-    instance.meanOption = Some(input
-      .map{ x => (x.asBreeze, 1) }
-      .reduce{
-        (left, right) =>
-          (left._1 + right._1, left._2 + right._2)
-      }
-      .map{ p => (p._1/p._2).fromBreeze })
-  }
-}
-{% endhighlight %}
-
-If we look at the `transform` method in `Transformer`, we will see that we also need an implementation of `TransformOperation`.
-A possible mean transforming implementation could look like the following.
-
-{% highlight scala %}
-
-val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] {
-  override def transform(
-      instance: MeanTransformer,
-      transformParameters: ParameterMap,
-      input: DataSet[DenseVector])
-    : DataSet[DenseVector] = {
-    val resultingParameters = parameters ++ transformParameters
-
-    val resultingMean = resultingParameters(MeanTransformer.Mean)
-
-    instance.meanOption match {
-      case Some(trainingMean) => {
-        input.map{ new MeanTransformMapper(resultingMean) }.withBroadcastSet(trainingMean, "trainingMean")
-      }
-      case None => throw new RuntimeException("MeanTransformer has not been fitted to data.")
-    }
-  }
-}
-
-class MeanTransformMapper(resultingMean: Double) extends RichMapFunction[DenseVector, DenseVector] {
-  var trainingMean: DenseVector = null
-
-  override def open(parameters: Configuration): Unit = {
-    trainingMean = getRuntimeContext().getBroadcastVariable[DenseVector]("trainingMean").get(0)
-  }
-
-  override def map(vector: DenseVector): DenseVector = {
-    import org.apache.flink.ml.math.Breeze._
-
-    val result = vector.asBreeze - trainingMean.asBreeze + resultingMean
-
-    result.fromBreeze
-  }
-}
-{% endhighlight %}
-
-Now we have everything implemented to fit our `MeanTransformer` to a training data set of `DenseVector` instances and to transform them.
-However, when we execute the `fit` operation
-
-{% highlight scala %}
-val trainingData: DataSet[DenseVector] = ...
-val meanTransformer = MeanTransformer()
-
-meanTransformer.fit(trainingData)
-{% endhighlight %}
-
-we receive the following error at runtime: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.math.DenseVector]"`.
-The reason is that the Scala compiler could not find a fitting `FitOperation` value with the right type parameters for the implicit parameter of the `fit` method.
-Therefore, it chose a fallback implicit value which gives you this error message at runtime.
-In order to make the compiler aware of our implementation, we have to define it as an implicit value and put it in the scope of the `MeanTransformer's` companion object.
-
-{% highlight scala %}
-object MeanTransformer{
-  implicit val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] ...
-
-  implicit val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] ...
-}
-{% endhighlight %}
-
-Now we can call `fit` and `transform` of our `MeanTransformer` with `DataSet[DenseVector]` as input.
-Furthermore, we can now use this transformer as part of an analysis pipeline where we have a `DenseVector` as input and expected output.
-
-{% highlight scala %}
-val trainingData: DataSet[DenseVector] = ...
-
-val mean = MeanTransformer.setMean(1.0)
-val polyFeatures = PolynomialFeatures().setDegree(3)
-
-val pipeline = mean.chainTransformer(polyFeatures)
-
-pipeline.fit(trainingData)
-{% endhighlight %}
-
-It is noteworthy that there is no additional code needed to enable chaining.
-The system automatically constructs the pipeline logic using the operations of the individual components.
-
-So far everything works fine with `DenseVector`.
-But what happens, if we call our transformer with `LabeledVector` instead?
-{% highlight scala %}
-val trainingData: DataSet[LabeledVector] = ...
-
-val mean = MeanTransformer()
-
-mean.fit(trainingData)
-{% endhighlight %}
-
-As before we see the following exception upon execution of the program: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.common.LabeledVector]"`.
-It is noteworthy, that this exception is thrown in the pre-flight phase, which means that the job has not been submitted to the runtime system.
-This has the advantage that you won't see a job which runs for a couple of days and then fails because of an incompatible pipeline component.
-Type compatibility is, thus, checked at the very beginning for the complete job.
-
-In order to make the `MeanTransformer` work on `LabeledVector` as well, we have to provide the corresponding operations.
-Consequently, we have to define a `FitOperation[MeanTransformer, LabeledVector]` and `TransformOperation[MeanTransformer, LabeledVector, LabeledVector]` as implicit values in the scope of `MeanTransformer`'s companion object.
-
-{% highlight scala %}
-object MeanTransformer {
-  implicit val labeledVectorFitOperation = new FitOperation[MeanTransformer, LabeledVector] ...
-
-  implicit val labeledVectorTransformOperation = new TransformOperation[MeanTransformer, LabeledVector, LabeledVector] ...
-}
-{% endhighlight %}
-
-If we wanted to implement a `Predictor` instead of a `Transformer`, then we would have to provide a `FitOperation`, too.
-Moreover, a `Predictor` requires a `PredictOperation` which implements how predictions are calculated from testing data.
-
-{% top %}
diff --git a/docs/dev/libs/ml/pipelines.zh.md b/docs/dev/libs/ml/pipelines.zh.md
deleted file mode 100644
index 514d557..0000000
--- a/docs/dev/libs/ml/pipelines.zh.md
+++ /dev/null
@@ -1,443 +0,0 @@
----
-mathjax: include
-title: Looking under the hood of pipelines
-nav-title: Pipelines
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Introduction
-
-The ability to chain together different transformers and predictors is an important feature for
-any Machine Learning (ML) library. In FlinkML we wanted to provide an intuitive API,
-and at the same
-time utilize the capabilities of the Scala language to provide
-type-safe implementations of our pipelines. What we hope to achieve then is an easy to use API,
-that protects users from type errors at pre-flight (before the job is launched) time, thereby
-eliminating cases where long
-running jobs are submitted to the cluster only to see them fail due to some
-error in the series of data transformations that commonly happen in an ML pipeline.
-
-In this guide then we will describe the choices we made during the implementation of chainable
-transformers and predictors in FlinkML, and provide guidelines on how developers can create their
-own algorithms that make use of these capabilities.
-
-## The what and the why
-
-So what do we mean by "ML pipelines"? Pipelines in the ML context can be thought of as chains of
-operations that have some data as input, perform a number of transformations to that data,
-and
-then output the transformed data, either to be used as the input (features) of a predictor
-function, such as a learning model, or just output the transformed data themselves, to be used in
-some other task. The end learner can of course be a part of the pipeline as well.
-ML pipelines can often be complicated sets of operations ([in-depth explanation](http://research.google.com/pubs/pub43146.html)) and
-can become sources of errors for end-to-end learning systems.
-
-The purpose of ML pipelines is then to create a
-framework that can be used to manage the complexity introduced by these chains of operations.
-Pipelines should make it easy for developers to define chained transformations that can be
-applied to the
-training data, in order to create the end features that will be used to train a
-learning model, and then perform the same set of transformations just as easily to unlabeled
-(test) data. Pipelines should also simplify cross-validation and model selection on
-these chains of operations.
-
-Finally, by ensuring that the consecutive links in the pipeline chain "fit together" we also
-avoid costly type errors. Since each step in a pipeline can be a computationally-heavy operation,
-we want to avoid running a pipelined job, unless we are sure that all the input/output pairs in a
-pipeline "fit".
-
-## Pipelines in FlinkML
-
-The building blocks for pipelines in FlinkML can be found in the `ml.pipeline` package.
-FlinkML follows an API inspired by [sklearn](http://scikit-learn.org) which means that we have
-`Estimator`, `Transformer` and `Predictor` interfaces. For an in-depth look at the design of the
-sklearn API the interested reader is referred to [this](http://arxiv.org/abs/1309.0238) paper.
-In short, the `Estimator` is the base class from which `Transformer` and `Predictor` inherit.
-`Estimator` defines a `fit` method, and `Transformer` also defines a `transform` method and
-`Predictor` defines a `predict` method.
-
-The `fit` method of the `Estimator` performs the actual training of the model, for example
-finding the correct weights in a linear regression task, or the mean and standard deviation of
-the data in a feature scaler.
-As evident by the naming, classes that implement
-`Transformer` are transform operations like [scaling the input](standard_scaler.html) and
-`Predictor` implementations are learning algorithms such as [Multiple Linear Regression]({{site.baseurl}}/dev/libs/ml/multiple_linear_regression.html).
-Pipelines can be created by chaining together a number of Transformers, and the final link in a pipeline can be a Predictor or another Transformer.
-Pipelines that end with Predictor cannot be chained any further.
-Below is an example of how a pipeline can be formed:
-
-{% highlight scala %}
-// Training data
-val input: DataSet[LabeledVector] = ...
-// Test data
-val unlabeled: DataSet[Vector] = ...
-
-val scaler = StandardScaler()
-val polyFeatures = PolynomialFeatures()
-val mlr = MultipleLinearRegression()
-
-// Construct the pipeline
-val pipeline = scaler
-  .chainTransformer(polyFeatures)
-  .chainPredictor(mlr)
-
-// Train the pipeline (scaler and multiple linear regression)
-pipeline.fit(input)
-
-// Calculate predictions for the testing data
-val predictions: DataSet[LabeledVector] = pipeline.predict(unlabeled)
-
-{% endhighlight %}
-
-As we mentioned, FlinkML pipelines are type-safe.
-If we tried to chain a transformer with output of type `A` to another with input of type `B` we
-would get an error at pre-flight time if `A` != `B`. FlinkML achieves this kind of type-safety
-through the use of Scala's implicits.
-
-### Scala implicits
-
-If you are not familiar with Scala's implicits we can recommend [this excerpt](https://www.artima.com/pins1ed/implicit-conversions-and-parameters.html)
-from Martin Odersky's "Programming in Scala". In short, implicit conversions allow for ad-hoc
-polymorphism in Scala by providing conversions from one type to another, and implicit values
-provide the compiler with default values that can be supplied to function calls through implicit parameters.
-The combination of implicit conversions and implicit parameters is what allows us to chain transform
-and predict operations together in a type-safe manner.
-
-### Operations
-
-As we mentioned, the trait (abstract class) `Estimator` defines a `fit` method. The method has two
-parameter lists
-(i.e. is a [curried function](http://docs.scala-lang.org/tutorials/tour/currying.html)). The
-first parameter list
-takes the input (training) `DataSet` and the parameters for the estimator. The second parameter
-list takes one `implicit` parameter, of type `FitOperation`. `FitOperation` is a class that also
-defines a `fit` method, and this is where the actual logic of training the concrete Estimators
-should be implemented. The `fit` method of `Estimator` is essentially a wrapper around the  fit
-method of `FitOperation`. The `predict` method of `Predictor` and the `transform` method of
-`Transform` are designed in a similar manner, with a respective operation class.
-
-In these methods the operation object is provided as an implicit parameter.
-Scala will [look for implicits](http://docs.scala-lang.org/tutorials/FAQ/finding-implicits.html)
-in the companion object of a type, so classes that implement these interfaces should provide these
-objects as implicit objects inside the companion object.
-
-As an example we can look at the `StandardScaler` class. `StandardScaler` extends `Transformer`, so it has access to its `fit` and `transform` functions.
-These two functions expect objects of `FitOperation` and `TransformOperation` as implicit parameters,
-for the `fit` and `transform` methods respectively, which `StandardScaler` provides in its companion
-object, through `transformVectors` and `fitVectorStandardScaler`:
-
-{% highlight scala %}
-class StandardScaler extends Transformer[StandardScaler] {
-  ...
-}
-
-object StandardScaler {
-
-  ...
-
-  implicit def fitVectorStandardScaler[T <: Vector] = new FitOperation[StandardScaler, T] {
-    override def fit(instance: StandardScaler, fitParameters: ParameterMap, input: DataSet[T])
-      : Unit = {
-        ...
-      }
-
-  implicit def transformVectors[T <: Vector: VectorConverter: TypeInformation: ClassTag] = {
-      new TransformOperation[StandardScaler, T, T] {
-        override def transform(
-          instance: StandardScaler,
-          transformParameters: ParameterMap,
-          input: DataSet[T])
-        : DataSet[T] = {
-          ...
-        }
-
-}
-
-{% endhighlight %}
-
-Note that `StandardScaler` does **not** override the `fit` method of `Estimator` or the `transform`
-method of `Transformer`. Rather, its implementations of `FitOperation` and `TransformOperation`
-override their respective `fit` and `transform` methods, which are then called by the `fit` and
-`transform` methods of `Estimator` and `Transformer`.  Similarly, a class that implements
-`Predictor` should define an implicit `PredictOperation` object inside its companion object.
-
-#### Types and type safety
-
-Apart from the `fit` and `transform` operations that we listed above, the `StandardScaler` also
-provides `fit` and `transform` operations for input of type `LabeledVector`.
-This allows us to use the  algorithm for input that is labeled or unlabeled, and this happens
-automatically, depending on  the type of the input that we give to the fit and transform
-operations. The correct implicit operation is chosen by the compiler, depending on the input type.
-
-If we try to call the `fit` or `transform` methods with types that are not supported we will get a
-runtime error before the job is launched.
-While it would be possible to catch these kinds of errors at compile time as well, the error
-messages that we are able to provide the user would be much less informative, which is why we chose
-to throw runtime exceptions instead.
-
-### Chaining
-
-Chaining is achieved by calling `chainTransformer` or `chainPredictor` on an object
-of a class that implements `Transformer`. These methods return a `ChainedTransformer` or
-`ChainedPredictor` object respectively. As we mentioned, `ChainedTransformer` objects can be
-chained further, while `ChainedPredictor` objects cannot. These classes take care of applying
-fit, transform, and predict operations for a pair of successive transformers or
-a transformer and a predictor. They also act recursively if the length of the
-chain is larger than two, since every `ChainedTransformer` defines a `transform` and `fit`
-operation that can be further chained with more transformers or a predictor.
-
-It is important to note that developers and users do not need to worry about chaining when
-implementing their algorithms, all this is handled automatically by FlinkML.
-
-### How to Implement a Pipeline Operator
-
-In order to support FlinkML's pipelining, algorithms have to adhere to a certain design pattern, which we will describe in this section.
-Let's assume that we want to implement a pipeline operator which changes the mean of your data.
-Since centering data is a common pre-processing step in many analysis pipelines, we will implement it as a `Transformer`.
-Therefore, we first create a `MeanTransformer` class which inherits from `Transformer`
-
-{% highlight scala %}
-class MeanTransformer extends Transformer[MeanTransformer] {}
-{% endhighlight %}
-
-Since we want to be able to configure the mean of the resulting data, we have to add a configuration parameter.
-
-{% highlight scala %}
-class MeanTransformer extends Transformer[MeanTransformer] {
-  def setMean(mean: Double): this.type = {
-    parameters.add(MeanTransformer.Mean, mean)
-    this
-  }
-}
-
-object MeanTransformer {
-  case object Mean extends Parameter[Double] {
-    override val defaultValue: Option[Double] = Some(0.0)
-  }
-
-  def apply(): MeanTransformer = new MeanTransformer
-}
-{% endhighlight %}
-
-Parameters are defined in the companion object of the transformer class and extend the `Parameter` class.
-Since the parameter instances are supposed to act as immutable keys for a parameter map, they should be implemented as `case objects`.
-The default value will be used if no other value has been set by the user of this component.
-If no default value has been specified, meaning that `defaultValue = None`, then the algorithm has to handle this situation accordingly.
-
-We can now instantiate a `MeanTransformer` object and set the mean value of the transformed data.
-But we still have to implement how the transformation works.
-The workflow can be separated into two phases.
-Within the first phase, the transformer learns the mean of the given training data.
-This knowledge can then be used in the second phase to transform the provided data with respect to the configured resulting mean value.
-
-The learning of the mean can be implemented within the `fit` operation of our `Transformer`, which it inherited from `Estimator`.
-Within the `fit` operation, a pipeline component is trained with respect to the given training data.
-The algorithm is, however, **not** implemented by overriding the `fit` method but by providing an implementation of a corresponding `FitOperation` for the correct type.
-Taking a look at the definition of the `fit` method in `Estimator`, which is the parent class of `Transformer`, reveals what why this is the case.
-
-{% highlight scala %}
-trait Estimator[Self] extends WithParameters with Serializable {
-  that: Self =>
-
-  def fit[Training](
-      training: DataSet[Training],
-      fitParameters: ParameterMap = ParameterMap.Empty)
-      (implicit fitOperation: FitOperation[Self, Training]): Unit = {
-    FlinkMLTools.registerFlinkMLTypes(training.getExecutionEnvironment)
-    fitOperation.fit(this, fitParameters, training)
-  }
-}
-{% endhighlight %}
-
-We see that the `fit` method is called with an input data set of type `Training`, an optional parameter list and in the second parameter list with an implicit parameter of type `FitOperation`.
-Within the body of the function, first some machine learning types are registered and then the `fit` method of the `FitOperation` parameter is called.
-The instance gives itself, the parameter map and the training data set as a parameters to the method.
-Thus, all the program logic takes place within the `FitOperation`.
-
-The `FitOperation` has two type parameters.
-The first defines the pipeline operator type for which this `FitOperation` shall work and the second type parameter defines the type of the data set elements.
-If we first wanted to implement the `MeanTransformer` to work on `DenseVector`, we would, thus, have to provide an implementation for `FitOperation[MeanTransformer, DenseVector]`.
-
-{% highlight scala %}
-val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
-  override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
-    import org.apache.flink.ml.math.Breeze._
-    val meanTrainingData: DataSet[DenseVector] = input
-      .map{ x => (x.asBreeze, 1) }
-      .reduce{
-        (left, right) =>
-          (left._1 + right._1, left._2 + right._2)
-      }
-      .map{ p => (p._1/p._2).fromBreeze }
-  }
-}
-{% endhighlight %}
-
-A `FitOperation[T, I]` has a `fit` method which is called with an instance of type `T`, a parameter map and an input `DataSet[I]`.
-In our case `T=MeanTransformer` and `I=DenseVector`.
-The parameter map is necessary if our fit step depends on some parameter values which were not given directly at creation time of the `Transformer`.
-The `FitOperation` of the `MeanTransformer` sums the `DenseVector` instances of the given input data set up and divides the result by the total number of vectors.
-That way, we obtain a `DataSet[DenseVector]` with a single element which is the mean value.
-
-But if we look closely at the implementation, we see that the result of the mean computation is never stored anywhere.
-If we want to use this knowledge in a later step to adjust the mean of some other input, we have to keep it around.
-And here is where the parameter of type `MeanTransformer` which is given to the `fit` method comes into play.
-We can use this instance to store state, which is used by a subsequent `transform` operation which works on the same object.
-But first we have to extend `MeanTransformer` by a member field and then adjust the `FitOperation` implementation.
-
-{% highlight scala %}
-class MeanTransformer extends Transformer[Centering] {
-  var meanOption: Option[DataSet[DenseVector]] = None
-
-  def setMean(mean: Double): Mean = {
-    parameters.add(MeanTransformer.Mean, mu)
-  }
-}
-
-val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] {
-  override def fit(instance: MeanTransformer, fitParameters: ParameterMap, input: DataSet[DenseVector]) : Unit = {
-    import org.apache.flink.ml.math.Breeze._
-
-    instance.meanOption = Some(input
-      .map{ x => (x.asBreeze, 1) }
-      .reduce{
-        (left, right) =>
-          (left._1 + right._1, left._2 + right._2)
-      }
-      .map{ p => (p._1/p._2).fromBreeze })
-  }
-}
-{% endhighlight %}
-
-If we look at the `transform` method in `Transformer`, we will see that we also need an implementation of `TransformOperation`.
-A possible mean transforming implementation could look like the following.
-
-{% highlight scala %}
-
-val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] {
-  override def transform(
-      instance: MeanTransformer,
-      transformParameters: ParameterMap,
-      input: DataSet[DenseVector])
-    : DataSet[DenseVector] = {
-    val resultingParameters = parameters ++ transformParameters
-
-    val resultingMean = resultingParameters(MeanTransformer.Mean)
-
-    instance.meanOption match {
-      case Some(trainingMean) => {
-        input.map{ new MeanTransformMapper(resultingMean) }.withBroadcastSet(trainingMean, "trainingMean")
-      }
-      case None => throw new RuntimeException("MeanTransformer has not been fitted to data.")
-    }
-  }
-}
-
-class MeanTransformMapper(resultingMean: Double) extends RichMapFunction[DenseVector, DenseVector] {
-  var trainingMean: DenseVector = null
-
-  override def open(parameters: Configuration): Unit = {
-    trainingMean = getRuntimeContext().getBroadcastVariable[DenseVector]("trainingMean").get(0)
-  }
-
-  override def map(vector: DenseVector): DenseVector = {
-    import org.apache.flink.ml.math.Breeze._
-
-    val result = vector.asBreeze - trainingMean.asBreeze + resultingMean
-
-    result.fromBreeze
-  }
-}
-{% endhighlight %}
-
-Now we have everything implemented to fit our `MeanTransformer` to a training data set of `DenseVector` instances and to transform them.
-However, when we execute the `fit` operation
-
-{% highlight scala %}
-val trainingData: DataSet[DenseVector] = ...
-val meanTransformer = MeanTransformer()
-
-meanTransformer.fit(trainingData)
-{% endhighlight %}
-
-we receive the following error at runtime: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.math.DenseVector]"`.
-The reason is that the Scala compiler could not find a fitting `FitOperation` value with the right type parameters for the implicit parameter of the `fit` method.
-Therefore, it chose a fallback implicit value which gives you this error message at runtime.
-In order to make the compiler aware of our implementation, we have to define it as an implicit value and put it in the scope of the `MeanTransformer's` companion object.
-
-{% highlight scala %}
-object MeanTransformer{
-  implicit val denseVectorMeanFitOperation = new FitOperation[MeanTransformer, DenseVector] ...
-
-  implicit val denseVectorMeanTransformOperation = new TransformOperation[MeanTransformer, DenseVector, DenseVector] ...
-}
-{% endhighlight %}
-
-Now we can call `fit` and `transform` of our `MeanTransformer` with `DataSet[DenseVector]` as input.
-Furthermore, we can now use this transformer as part of an analysis pipeline where we have a `DenseVector` as input and expected output.
-
-{% highlight scala %}
-val trainingData: DataSet[DenseVector] = ...
-
-val mean = MeanTransformer.setMean(1.0)
-val polyFeatures = PolynomialFeatures().setDegree(3)
-
-val pipeline = mean.chainTransformer(polyFeatures)
-
-pipeline.fit(trainingData)
-{% endhighlight %}
-
-It is noteworthy that there is no additional code needed to enable chaining.
-The system automatically constructs the pipeline logic using the operations of the individual components.
-
-So far everything works fine with `DenseVector`.
-But what happens, if we call our transformer with `LabeledVector` instead?
-{% highlight scala %}
-val trainingData: DataSet[LabeledVector] = ...
-
-val mean = MeanTransformer()
-
-mean.fit(trainingData)
-{% endhighlight %}
-
-As before we see the following exception upon execution of the program: `"There is no FitOperation defined for class MeanTransformer which trains on a DataSet[org.apache.flink.ml.common.LabeledVector]"`.
-It is noteworthy, that this exception is thrown in the pre-flight phase, which means that the job has not been submitted to the runtime system.
-This has the advantage that you won't see a job which runs for a couple of days and then fails because of an incompatible pipeline component.
-Type compatibility is, thus, checked at the very beginning for the complete job.
-
-In order to make the `MeanTransformer` work on `LabeledVector` as well, we have to provide the corresponding operations.
-Consequently, we have to define a `FitOperation[MeanTransformer, LabeledVector]` and `TransformOperation[MeanTransformer, LabeledVector, LabeledVector]` as implicit values in the scope of `MeanTransformer`'s companion object.
-
-{% highlight scala %}
-object MeanTransformer {
-  implicit val labeledVectorFitOperation = new FitOperation[MeanTransformer, LabeledVector] ...
-
-  implicit val labeledVectorTransformOperation = new TransformOperation[MeanTransformer, LabeledVector, LabeledVector] ...
-}
-{% endhighlight %}
-
-If we wanted to implement a `Predictor` instead of a `Transformer`, then we would have to provide a `FitOperation`, too.
-Moreover, a `Predictor` requires a `PredictOperation` which implements how predictions are calculated from testing data.
-
-{% top %}
diff --git a/docs/dev/libs/ml/polynomial_features.md b/docs/dev/libs/ml/polynomial_features.md
deleted file mode 100644
index 5654ec7..0000000
--- a/docs/dev/libs/ml/polynomial_features.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-mathjax: include
-title: Polynomial Features
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-The polynomial features transformer maps a vector into the polynomial feature space of degree $d$.
-The dimension of the input vector determines the number of polynomial factors whose values are the respective vector entries.
-Given a vector $(x, y, z, \ldots)^T$ the resulting feature vector looks like:
-
-$$\left(x, y, z, x^2, xy, y^2, yz, z^2, x^3, x^2y, x^2z, xy^2, xyz, xz^2, y^3, \ldots\right)^T$$
-
-Flink's implementation orders the polynomials in decreasing order of their degree.
-
-Given the vector $\left(3,2\right)^T$, the polynomial features vector of degree 3 would look like
-
- $$\left(3^3, 3^2\cdot2, 3\cdot2^2, 2^3, 3^2, 3\cdot2, 2^2, 3, 2\right)^T$$
-
-This transformer can be prepended to all `Transformer` and `Predictor` implementations which expect an input of type `LabeledVector` or any sub-type of `Vector`.
-
-## Operations
-
-`PolynomialFeatures` is a `Transformer`.
-As such, it supports the `fit` and `transform` operation.
-
-### Fit
-
-PolynomialFeatures is not trained on data and, thus, supports all types of input data.
-
-### Transform
-
-PolynomialFeatures transforms all subtypes of `Vector` and `LabeledVector` into their respective types:
-
-* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
-* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
-
-## Parameters
-
-The polynomial features transformer can be controlled by the following parameters:
-
-<table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Degree</strong></td>
-        <td>
-          <p>
-            The maximum polynomial degree.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-// Obtain the training data set
-val trainingDS: DataSet[LabeledVector] = ...
-
-// Setup polynomial feature transformer of degree 3
-val polyFeatures = PolynomialFeatures()
-.setDegree(3)
-
-// Setup the multiple linear regression learner
-val mlr = MultipleLinearRegression()
-
-// Control the learner via the parameter map
-val parameters = ParameterMap()
-.add(MultipleLinearRegression.Iterations, 20)
-.add(MultipleLinearRegression.Stepsize, 0.5)
-
-// Create pipeline PolynomialFeatures -> MultipleLinearRegression
-val pipeline = polyFeatures.chainPredictor(mlr)
-
-// train the model
-pipeline.fit(trainingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/polynomial_features.zh.md b/docs/dev/libs/ml/polynomial_features.zh.md
deleted file mode 100644
index 5654ec7..0000000
--- a/docs/dev/libs/ml/polynomial_features.zh.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-mathjax: include
-title: Polynomial Features
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-The polynomial features transformer maps a vector into the polynomial feature space of degree $d$.
-The dimension of the input vector determines the number of polynomial factors whose values are the respective vector entries.
-Given a vector $(x, y, z, \ldots)^T$ the resulting feature vector looks like:
-
-$$\left(x, y, z, x^2, xy, y^2, yz, z^2, x^3, x^2y, x^2z, xy^2, xyz, xz^2, y^3, \ldots\right)^T$$
-
-Flink's implementation orders the polynomials in decreasing order of their degree.
-
-Given the vector $\left(3,2\right)^T$, the polynomial features vector of degree 3 would look like
-
- $$\left(3^3, 3^2\cdot2, 3\cdot2^2, 2^3, 3^2, 3\cdot2, 2^2, 3, 2\right)^T$$
-
-This transformer can be prepended to all `Transformer` and `Predictor` implementations which expect an input of type `LabeledVector` or any sub-type of `Vector`.
-
-## Operations
-
-`PolynomialFeatures` is a `Transformer`.
-As such, it supports the `fit` and `transform` operation.
-
-### Fit
-
-PolynomialFeatures is not trained on data and, thus, supports all types of input data.
-
-### Transform
-
-PolynomialFeatures transforms all subtypes of `Vector` and `LabeledVector` into their respective types:
-
-* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
-* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
-
-## Parameters
-
-The polynomial features transformer can be controlled by the following parameters:
-
-<table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Degree</strong></td>
-        <td>
-          <p>
-            The maximum polynomial degree.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-## Examples
-
-{% highlight scala %}
-// Obtain the training data set
-val trainingDS: DataSet[LabeledVector] = ...
-
-// Setup polynomial feature transformer of degree 3
-val polyFeatures = PolynomialFeatures()
-.setDegree(3)
-
-// Setup the multiple linear regression learner
-val mlr = MultipleLinearRegression()
-
-// Control the learner via the parameter map
-val parameters = ParameterMap()
-.add(MultipleLinearRegression.Iterations, 20)
-.add(MultipleLinearRegression.Stepsize, 0.5)
-
-// Create pipeline PolynomialFeatures -> MultipleLinearRegression
-val pipeline = polyFeatures.chainPredictor(mlr)
-
-// train the model
-pipeline.fit(trainingDS)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/quickstart.md b/docs/dev/libs/ml/quickstart.md
deleted file mode 100644
index 3f4d980..0000000
--- a/docs/dev/libs/ml/quickstart.md
+++ /dev/null
@@ -1,262 +0,0 @@
----
-mathjax: include
-title: Quickstart Guide
-nav-title: Quickstart
-nav-parent_id: ml
-nav-pos: 0
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Introduction
-
-FlinkML is designed to make learning from your data a straight-forward process, abstracting away
-the complexities that usually come with big data learning tasks. In this
-quick-start guide we will show just how easy it is to solve a simple supervised learning problem
-using FlinkML. But first some basics, feel free to skip the next few lines if you're already
-familiar with Machine Learning (ML).
-
-As defined by Murphy [[1]](#murphy) ML deals with detecting patterns in data, and using those
-learned patterns to make predictions about the future. We can categorize most ML algorithms into
-two major categories: Supervised and Unsupervised Learning.
-
-* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
-(features) to a set of outputs. The learning is done using a *training set* of (input,
-output) pairs that we use to approximate the mapping function. Supervised learning problems are
-further divided into classification and regression problems. In classification problems we try to
-predict the *class* that an example belongs to, for example whether a user is going to click on
-an ad or not. Regression problems one the other hand, are about predicting (real) numerical
-values, often called the dependent variable, for example what the temperature will be tomorrow.
-
-* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
-of this would be *clustering*, where we try to discover groupings of the data from the
-descriptive features. Unsupervised learning can also be used for feature selection, for example
-through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
-
-## Linking with FlinkML
-
-In order to use FlinkML in your project, first you have to
-[set up a Flink program]({{ site.baseurl }}/dev/projectsetup/dependencies.html).
-Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
-
-{% highlight xml %}
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-ml{{ site.scala_version_suffix }}</artifactId>
-  <version>{{site.version }}</version>
-</dependency>
-{% endhighlight %}
-
-## Loading data
-
-To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
-functions for formatted data, such as the LibSVM format. For supervised learning problems it is
-common to use the `LabeledVector` class to represent the `(label, features)` examples. A `LabeledVector`
-object will have a FlinkML `Vector` member representing the features of the example and a `Double`
-member which represents the label, which could be the class in a classification problem, or the dependent
-variable for a regression problem.
-
-As an example, we can use Haberman's Survival Data Set , which you can
-[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data).
-This dataset *"contains cases from a study conducted on the survival of patients who had undergone
-surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
-are the features and last column is the class, and the 4th column indicates whether the patient
-survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
-page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
-
-We can load the data as a `DataSet[String]` first:
-
-{% highlight scala %}
-
-import org.apache.flink.api.scala._
-
-val env = ExecutionEnvironment.getExecutionEnvironment
-
-val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
-
-{% endhighlight %}
-
-We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
-dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
-is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
-
-{% highlight scala %}
-
-import org.apache.flink.ml.common.LabeledVector
-import org.apache.flink.ml.math.DenseVector
-
-val survivalLV = survival
-  .map{tuple =>
-    val list = tuple.productIterator.toList
-    val numList = list.map(_.asInstanceOf[String].toDouble)
-    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
-  }
-
-{% endhighlight %}
-
-We can then use this data to train a learner. We will however use another dataset to exemplify
-building a learner; that will allow us to show how we can import other dataset formats.
-
-**LibSVM files**
-
-A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
-found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
-datasets using the LibSVM format through the `readLibSVM` function available through the `MLUtils`
-object.
-You can also save datasets in the LibSVM format using the `writeLibSVM` function.
-Let's import the svmguide1 dataset. You can download the
-[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
-and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
-This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their
-practical Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.
-
-We can simply import the dataset using:
-
-{% highlight scala %}
-
-import org.apache.flink.ml.MLUtils
-
-val astroTrainLibSVM: DataSet[LabeledVector] = MLUtils.readLibSVM(env, "/path/to/svmguide1")
-val astroTestLibSVM: DataSet[LabeledVector] = MLUtils.readLibSVM(env, "/path/to/svmguide1.t")
-
-{% endhighlight %}
-
-This gives us two `DataSet` objects that we will use in the following section to
-create a classifier.
-
-## Classification
-
-After importing the training and test dataset, they need to be prepared for the classification. 
-Since Flink SVM only supports threshold binary values of `+1.0` and `-1.0`, a conversion is 
-needed after loading the LibSVM dataset because it is labelled using `1`s and `0`s.
-
-A conversion can be done using a simple normalizer mapping function:
- 
-{% highlight scala %}
-
-import org.apache.flink.ml.math.Vector
-
-def normalizer : LabeledVector => LabeledVector = { 
-    lv => LabeledVector(if (lv.label > 0.0) 1.0 else -1.0, lv.vector)
-}
-val astroTrain: DataSet[LabeledVector] = astroTrainLibSVM.map(normalizer)
-val astroTest: DataSet[(Vector, Double)] = astroTestLibSVM.map(normalizer).map(x => (x.vector, x.label))
-
-{% endhighlight %}
-
-Once we have converted the dataset we can train a `Predictor` such as a linear SVM classifier.
-We can set a number of parameters for the classifier. Here we set the `Blocks` parameter,
-which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The
-regularization parameter determines the amount of $l_2$ regularization applied, which is used
-to avoid overfitting. The step size determines the contribution of the weight vector updates to
-the next weight vector value. This parameter sets the initial step size.
-
-{% highlight scala %}
-
-import org.apache.flink.ml.classification.SVM
-
-val svm = SVM()
-  .setBlocks(env.getParallelism)
-  .setIterations(100)
-  .setRegularization(0.001)
-  .setStepsize(0.1)
-  .setSeed(42)
-
-svm.fit(astroTrain)
-
-{% endhighlight %}
-
-We can now make predictions on the test set, and use the `evaluate` function to create (truth, prediction) pairs.
-
-{% highlight scala %}
-
-val evaluationPairs: DataSet[(Double, Double)] = svm.evaluate(astroTest)
-
-{% endhighlight %}
-
-Next we will see how we can pre-process our data, and use the ML pipelines capabilities of FlinkML.
-
-## Data pre-processing and pipelines
-
-A pre-processing step that is often encouraged [[3]](#hsu) when using SVM classification is scaling
-the input features to the [0, 1] range, in order to avoid features with extreme values
-dominating the rest.
-FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to pre-process data,
-and a key feature is the ability to chain `Transformers` and `Predictors` together. This allows
-us to run the same pipeline of transformations and make predictions on the train and test data in
-a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML
-[in the pipelines documentation](pipelines.html).
-
-Let us first create a normalizing transformer for the features in our dataset, and chain it to a
-new SVM classifier.
-
-{% highlight scala %}
-
-import org.apache.flink.ml.preprocessing.MinMaxScaler
-
-val scaler = MinMaxScaler()
-
-val scaledSVM = scaler.chainPredictor(svm)
-
-{% endhighlight %}
-
-We can now use our newly created pipeline to make predictions on the test set.
-First we call fit again, to train the scaler and the SVM classifier.
-The data of the test set will then be automatically scaled before being passed on to the SVM to
-make predictions.
-
-{% highlight scala %}
-
-scaledSVM.fit(astroTrain)
-
-val evaluationPairsScaled: DataSet[(Double, Double)] = scaledSVM.evaluate(astroTest)
-
-{% endhighlight %}
-
-The scaled inputs should give us better prediction performance.
-
-## Where to go from here
-
-This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot
-more you can do.
-We recommend going through the [FlinkML documentation]({{ site.baseurl }}/dev/libs/ml/index.html), and trying out the different
-algorithms.
-A very good way to get started is to play around with interesting datasets from the UCI ML
-repository and the LibSVM datasets.
-Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or
-[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other
-data scientists.
-If you would like to contribute some new algorithms take a look at our
-[contribution guide](contribution_guide.html).
-
-**References**
-
-<a name="murphy"></a>[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT
-press, 2012.
-
-<a name="jaggi"></a>[2] Jaggi, Martin, et al. *Communication-efficient distributed dual
-coordinate ascent.* Advances in Neural Information Processing Systems. 2014.
-
-<a name="hsu"></a>[3] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin.
- *A practical guide to support vector classification.* 2003.
-
-{% top %}
diff --git a/docs/dev/libs/ml/quickstart.zh.md b/docs/dev/libs/ml/quickstart.zh.md
deleted file mode 100644
index 3f4d980..0000000
--- a/docs/dev/libs/ml/quickstart.zh.md
+++ /dev/null
@@ -1,262 +0,0 @@
----
-mathjax: include
-title: Quickstart Guide
-nav-title: Quickstart
-nav-parent_id: ml
-nav-pos: 0
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Introduction
-
-FlinkML is designed to make learning from your data a straight-forward process, abstracting away
-the complexities that usually come with big data learning tasks. In this
-quick-start guide we will show just how easy it is to solve a simple supervised learning problem
-using FlinkML. But first some basics, feel free to skip the next few lines if you're already
-familiar with Machine Learning (ML).
-
-As defined by Murphy [[1]](#murphy) ML deals with detecting patterns in data, and using those
-learned patterns to make predictions about the future. We can categorize most ML algorithms into
-two major categories: Supervised and Unsupervised Learning.
-
-* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
-(features) to a set of outputs. The learning is done using a *training set* of (input,
-output) pairs that we use to approximate the mapping function. Supervised learning problems are
-further divided into classification and regression problems. In classification problems we try to
-predict the *class* that an example belongs to, for example whether a user is going to click on
-an ad or not. Regression problems one the other hand, are about predicting (real) numerical
-values, often called the dependent variable, for example what the temperature will be tomorrow.
-
-* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
-of this would be *clustering*, where we try to discover groupings of the data from the
-descriptive features. Unsupervised learning can also be used for feature selection, for example
-through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
-
-## Linking with FlinkML
-
-In order to use FlinkML in your project, first you have to
-[set up a Flink program]({{ site.baseurl }}/dev/projectsetup/dependencies.html).
-Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
-
-{% highlight xml %}
-<dependency>
-  <groupId>org.apache.flink</groupId>
-  <artifactId>flink-ml{{ site.scala_version_suffix }}</artifactId>
-  <version>{{site.version }}</version>
-</dependency>
-{% endhighlight %}
-
-## Loading data
-
-To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
-functions for formatted data, such as the LibSVM format. For supervised learning problems it is
-common to use the `LabeledVector` class to represent the `(label, features)` examples. A `LabeledVector`
-object will have a FlinkML `Vector` member representing the features of the example and a `Double`
-member which represents the label, which could be the class in a classification problem, or the dependent
-variable for a regression problem.
-
-As an example, we can use Haberman's Survival Data Set , which you can
-[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data).
-This dataset *"contains cases from a study conducted on the survival of patients who had undergone
-surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
-are the features and last column is the class, and the 4th column indicates whether the patient
-survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
-page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
-
-We can load the data as a `DataSet[String]` first:
-
-{% highlight scala %}
-
-import org.apache.flink.api.scala._
-
-val env = ExecutionEnvironment.getExecutionEnvironment
-
-val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
-
-{% endhighlight %}
-
-We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
-dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
-is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
-
-{% highlight scala %}
-
-import org.apache.flink.ml.common.LabeledVector
-import org.apache.flink.ml.math.DenseVector
-
-val survivalLV = survival
-  .map{tuple =>
-    val list = tuple.productIterator.toList
-    val numList = list.map(_.asInstanceOf[String].toDouble)
-    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
-  }
-
-{% endhighlight %}
-
-We can then use this data to train a learner. We will however use another dataset to exemplify
-building a learner; that will allow us to show how we can import other dataset formats.
-
-**LibSVM files**
-
-A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
-found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
-datasets using the LibSVM format through the `readLibSVM` function available through the `MLUtils`
-object.
-You can also save datasets in the LibSVM format using the `writeLibSVM` function.
-Let's import the svmguide1 dataset. You can download the
-[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
-and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
-This is an astroparticle binary classification dataset, used by Hsu et al. [[3]](#hsu) in their
-practical Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.
-
-We can simply import the dataset using:
-
-{% highlight scala %}
-
-import org.apache.flink.ml.MLUtils
-
-val astroTrainLibSVM: DataSet[LabeledVector] = MLUtils.readLibSVM(env, "/path/to/svmguide1")
-val astroTestLibSVM: DataSet[LabeledVector] = MLUtils.readLibSVM(env, "/path/to/svmguide1.t")
-
-{% endhighlight %}
-
-This gives us two `DataSet` objects that we will use in the following section to
-create a classifier.
-
-## Classification
-
-After importing the training and test dataset, they need to be prepared for the classification. 
-Since Flink SVM only supports threshold binary values of `+1.0` and `-1.0`, a conversion is 
-needed after loading the LibSVM dataset because it is labelled using `1`s and `0`s.
-
-A conversion can be done using a simple normalizer mapping function:
- 
-{% highlight scala %}
-
-import org.apache.flink.ml.math.Vector
-
-def normalizer : LabeledVector => LabeledVector = { 
-    lv => LabeledVector(if (lv.label > 0.0) 1.0 else -1.0, lv.vector)
-}
-val astroTrain: DataSet[LabeledVector] = astroTrainLibSVM.map(normalizer)
-val astroTest: DataSet[(Vector, Double)] = astroTestLibSVM.map(normalizer).map(x => (x.vector, x.label))
-
-{% endhighlight %}
-
-Once we have converted the dataset we can train a `Predictor` such as a linear SVM classifier.
-We can set a number of parameters for the classifier. Here we set the `Blocks` parameter,
-which is used to split the input by the underlying CoCoA algorithm [[2]](#jaggi) uses. The
-regularization parameter determines the amount of $l_2$ regularization applied, which is used
-to avoid overfitting. The step size determines the contribution of the weight vector updates to
-the next weight vector value. This parameter sets the initial step size.
-
-{% highlight scala %}
-
-import org.apache.flink.ml.classification.SVM
-
-val svm = SVM()
-  .setBlocks(env.getParallelism)
-  .setIterations(100)
-  .setRegularization(0.001)
-  .setStepsize(0.1)
-  .setSeed(42)
-
-svm.fit(astroTrain)
-
-{% endhighlight %}
-
-We can now make predictions on the test set, and use the `evaluate` function to create (truth, prediction) pairs.
-
-{% highlight scala %}
-
-val evaluationPairs: DataSet[(Double, Double)] = svm.evaluate(astroTest)
-
-{% endhighlight %}
-
-Next we will see how we can pre-process our data, and use the ML pipelines capabilities of FlinkML.
-
-## Data pre-processing and pipelines
-
-A pre-processing step that is often encouraged [[3]](#hsu) when using SVM classification is scaling
-the input features to the [0, 1] range, in order to avoid features with extreme values
-dominating the rest.
-FlinkML has a number of `Transformers` such as `MinMaxScaler` that are used to pre-process data,
-and a key feature is the ability to chain `Transformers` and `Predictors` together. This allows
-us to run the same pipeline of transformations and make predictions on the train and test data in
-a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML
-[in the pipelines documentation](pipelines.html).
-
-Let us first create a normalizing transformer for the features in our dataset, and chain it to a
-new SVM classifier.
-
-{% highlight scala %}
-
-import org.apache.flink.ml.preprocessing.MinMaxScaler
-
-val scaler = MinMaxScaler()
-
-val scaledSVM = scaler.chainPredictor(svm)
-
-{% endhighlight %}
-
-We can now use our newly created pipeline to make predictions on the test set.
-First we call fit again, to train the scaler and the SVM classifier.
-The data of the test set will then be automatically scaled before being passed on to the SVM to
-make predictions.
-
-{% highlight scala %}
-
-scaledSVM.fit(astroTrain)
-
-val evaluationPairsScaled: DataSet[(Double, Double)] = scaledSVM.evaluate(astroTest)
-
-{% endhighlight %}
-
-The scaled inputs should give us better prediction performance.
-
-## Where to go from here
-
-This quickstart guide can act as an introduction to the basic concepts of FlinkML, but there's a lot
-more you can do.
-We recommend going through the [FlinkML documentation]({{ site.baseurl }}/dev/libs/ml/index.html), and trying out the different
-algorithms.
-A very good way to get started is to play around with interesting datasets from the UCI ML
-repository and the LibSVM datasets.
-Tackling an interesting problem from a website like [Kaggle](https://www.kaggle.com) or
-[DrivenData](http://www.drivendata.org/) is also a great way to learn by competing with other
-data scientists.
-If you would like to contribute some new algorithms take a look at our
-[contribution guide](contribution_guide.html).
-
-**References**
-
-<a name="murphy"></a>[1] Murphy, Kevin P. *Machine learning: a probabilistic perspective.* MIT
-press, 2012.
-
-<a name="jaggi"></a>[2] Jaggi, Martin, et al. *Communication-efficient distributed dual
-coordinate ascent.* Advances in Neural Information Processing Systems. 2014.
-
-<a name="hsu"></a>[3] Hsu, Chih-Wei, Chih-Chung Chang, and Chih-Jen Lin.
- *A practical guide to support vector classification.* 2003.
-
-{% top %}
diff --git a/docs/dev/libs/ml/sos.md b/docs/dev/libs/ml/sos.md
deleted file mode 100644
index 6f117e0..0000000
--- a/docs/dev/libs/ml/sos.md
+++ /dev/null
@@ -1,122 +0,0 @@
----
-mathjax: include
-title: Stochastic Outlier Selection
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-
-## Description
-
-An outlier is one or multiple observations that deviates quantitatively from the majority of the data set and may be the subject of further investigation.
-Stochastic Outlier Selection (SOS) developed by Jeroen Janssens[[1]](#janssens) is an unsupervised outlier-selection algorithm that takes as input a set of 
-vectors. The algorithm applies affinity-based outlier selection and outputs for each data point an outlier probability. 
-Intuitively, a data point is considered to be an outlier when the other data points have insufficient affinity with it.
-
-Outlier detection has its application in a number of field, for example, log analysis, fraud detection, noise removal, novelty detection, quality control,
- sensor monitoring, etc. If a sensor turns faulty, it is likely that it will output values that deviate markedly from the majority.
- 
-For more information, please consult the [PhD Thesis of Jeroens Janssens](https://github.com/jeroenjanssens/phd-thesis) on 
-Outlier Selection and One-Class Classification which introduces the algorithm.                                                                        
-
-## Parameters
-
-The stochastic outlier selection algorithm implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Perplexity</strong></td>
-        <td>
-          <p>
-            Perplexity can be interpreted as the k in k-nearest neighbor algorithms. The difference with SOS being a neighbor
-            is not a binary property, but a probabilistic one, and therefore it a real number. Must be between 1 and n-1, 
-            where n is the number of points. A good starting point can be obtained by using the square root of the number of observations. 
-            (Default value: <strong>30</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>ErrorTolerance</strong></td>
-        <td>
-          <p>
-            The accepted error tolerance to reduce computational time when approximating the affinity. It will 
-            sacrifice accuracy in return for reduced computational time.
-            (Default value: <strong>1e-20</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>MaxIterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations to approximate the affinity of the algorithm.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-
-## Example
-
-{% highlight scala %}
-val data = env.fromCollection(List(
-  LabeledVector(0.0, DenseVector(1.0, 1.0)),
-  LabeledVector(1.0, DenseVector(2.0, 1.0)),
-  LabeledVector(2.0, DenseVector(1.0, 2.0)),
-  LabeledVector(3.0, DenseVector(2.0, 2.0)),
-  LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier!
-))
-
-val sos = new StochasticOutlierSelection().setPerplexity(3)
-
-val outputVector = sos
-  .transform(data)
-  .collect()
-
-val expectedOutputVector = Map(
-  0 -> 0.2790094479202896,
-  1 -> 0.25775014551682535,
-  2 -> 0.22136130977995766,
-  3 -> 0.12707053787018444,
-  4 -> 0.9922779902453757 // The outlier!
-)
-
-outputVector.foreach(output => expectedOutputVector(output._1) should be(output._2))
-{% endhighlight %}
-
-**References**
-
-<a name="janssens"></a>[1]J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. 
-*Stochastic Outlier Selection*. Technical Report TiCC TR 2012-001, Tilburg University, Tilburg, the Netherlands, 2012.
-
-{% top %}
diff --git a/docs/dev/libs/ml/sos.zh.md b/docs/dev/libs/ml/sos.zh.md
deleted file mode 100644
index 6f117e0..0000000
--- a/docs/dev/libs/ml/sos.zh.md
+++ /dev/null
@@ -1,122 +0,0 @@
----
-mathjax: include
-title: Stochastic Outlier Selection
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-
-## Description
-
-An outlier is one or multiple observations that deviates quantitatively from the majority of the data set and may be the subject of further investigation.
-Stochastic Outlier Selection (SOS) developed by Jeroen Janssens[[1]](#janssens) is an unsupervised outlier-selection algorithm that takes as input a set of 
-vectors. The algorithm applies affinity-based outlier selection and outputs for each data point an outlier probability. 
-Intuitively, a data point is considered to be an outlier when the other data points have insufficient affinity with it.
-
-Outlier detection has its application in a number of field, for example, log analysis, fraud detection, noise removal, novelty detection, quality control,
- sensor monitoring, etc. If a sensor turns faulty, it is likely that it will output values that deviate markedly from the majority.
- 
-For more information, please consult the [PhD Thesis of Jeroens Janssens](https://github.com/jeroenjanssens/phd-thesis) on 
-Outlier Selection and One-Class Classification which introduces the algorithm.                                                                        
-
-## Parameters
-
-The stochastic outlier selection algorithm implementation can be controlled by the following parameters:
-
-   <table class="table table-bordered">
-    <thead>
-      <tr>
-        <th class="text-left" style="width: 20%">Parameters</th>
-        <th class="text-center">Description</th>
-      </tr>
-    </thead>
-
-    <tbody>
-      <tr>
-        <td><strong>Perplexity</strong></td>
-        <td>
-          <p>
-            Perplexity can be interpreted as the k in k-nearest neighbor algorithms. The difference with SOS being a neighbor
-            is not a binary property, but a probabilistic one, and therefore it a real number. Must be between 1 and n-1, 
-            where n is the number of points. A good starting point can be obtained by using the square root of the number of observations. 
-            (Default value: <strong>30</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>ErrorTolerance</strong></td>
-        <td>
-          <p>
-            The accepted error tolerance to reduce computational time when approximating the affinity. It will 
-            sacrifice accuracy in return for reduced computational time.
-            (Default value: <strong>1e-20</strong>)
-          </p>
-        </td>
-      </tr>
-      <tr>
-        <td><strong>MaxIterations</strong></td>
-        <td>
-          <p>
-            The maximum number of iterations to approximate the affinity of the algorithm.
-            (Default value: <strong>10</strong>)
-          </p>
-        </td>
-      </tr>
-    </tbody>
-  </table>
-
-
-## Example
-
-{% highlight scala %}
-val data = env.fromCollection(List(
-  LabeledVector(0.0, DenseVector(1.0, 1.0)),
-  LabeledVector(1.0, DenseVector(2.0, 1.0)),
-  LabeledVector(2.0, DenseVector(1.0, 2.0)),
-  LabeledVector(3.0, DenseVector(2.0, 2.0)),
-  LabeledVector(4.0, DenseVector(5.0, 8.0)) // The outlier!
-))
-
-val sos = new StochasticOutlierSelection().setPerplexity(3)
-
-val outputVector = sos
-  .transform(data)
-  .collect()
-
-val expectedOutputVector = Map(
-  0 -> 0.2790094479202896,
-  1 -> 0.25775014551682535,
-  2 -> 0.22136130977995766,
-  3 -> 0.12707053787018444,
-  4 -> 0.9922779902453757 // The outlier!
-)
-
-outputVector.foreach(output => expectedOutputVector(output._1) should be(output._2))
-{% endhighlight %}
-
-**References**
-
-<a name="janssens"></a>[1]J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. 
-*Stochastic Outlier Selection*. Technical Report TiCC TR 2012-001, Tilburg University, Tilburg, the Netherlands, 2012.
-
-{% top %}
diff --git a/docs/dev/libs/ml/standard_scaler.md b/docs/dev/libs/ml/standard_scaler.md
deleted file mode 100644
index cdfc6a0..0000000
--- a/docs/dev/libs/ml/standard_scaler.md
+++ /dev/null
@@ -1,115 +0,0 @@
----
-mathjax: include
-title: Standard Scaler
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- The standard scaler scales the given data set, so that all features will have a user specified mean and variance.
- In case the user does not provide a specific mean and standard deviation, the standard scaler transforms the features of the input data set to have mean equal to 0 and standard deviation equal to 1.
- Given a set of input data $x_1, x_2,... x_n$, with mean:
-
- $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$
-
- and standard deviation:
-
- $$\sigma_{x}=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}$$
-
-The scaled data set $z_1, z_2,...,z_n$ will be:
-
- $$z_{i}= std \left (\frac{x_{i} - \bar{x}  }{\sigma_{x}}\right ) + mean$$
-
-where $\textit{std}$ and $\textit{mean}$ are the user specified values for the standard deviation and mean.
-
-## Operations
-
-`StandardScaler` is a `Transformer`.
-As such, it supports the `fit` and `transform` operation.
-
-### Fit
-
-StandardScaler is trained on all subtypes of `Vector` or `LabeledVector`:
-
-* `fit[T <: Vector]: DataSet[T] => Unit`
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Transform
-
-StandardScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
-
-* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
-* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
-
-## Parameters
-
-The standard scaler implementation can be controlled by the following two parameters:
-
- <table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Parameters</th>
-      <th class="text-center">Description</th>
-    </tr>
-  </thead>
-
-  <tbody>
-    <tr>
-      <td><strong>Mean</strong></td>
-      <td>
-        <p>
-          The mean of the scaled data set. (Default value: <strong>0.0</strong>)
-        </p>
-      </td>
-    </tr>
-    <tr>
-      <td><strong>Std</strong></td>
-      <td>
-        <p>
-          The standard deviation of the scaled data set. (Default value: <strong>1.0</strong>)
-        </p>
-      </td>
-    </tr>
-  </tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-// Create standard scaler transformer
-val scaler = StandardScaler()
-.setMean(10.0)
-.setStd(2.0)
-
-// Obtain data set to be scaled
-val dataSet: DataSet[Vector] = ...
-
-// Learn the mean and standard deviation of the training data
-scaler.fit(dataSet)
-
-// Scale the provided data set to have mean=10.0 and std=2.0
-val scaledDS = scaler.transform(dataSet)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/standard_scaler.zh.md b/docs/dev/libs/ml/standard_scaler.zh.md
deleted file mode 100644
index cdfc6a0..0000000
--- a/docs/dev/libs/ml/standard_scaler.zh.md
+++ /dev/null
@@ -1,115 +0,0 @@
----
-mathjax: include
-title: Standard Scaler
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
- The standard scaler scales the given data set, so that all features will have a user specified mean and variance.
- In case the user does not provide a specific mean and standard deviation, the standard scaler transforms the features of the input data set to have mean equal to 0 and standard deviation equal to 1.
- Given a set of input data $x_1, x_2,... x_n$, with mean:
-
- $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_{i}$$
-
- and standard deviation:
-
- $$\sigma_{x}=\sqrt{ \frac{1}{n} \sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}$$
-
-The scaled data set $z_1, z_2,...,z_n$ will be:
-
- $$z_{i}= std \left (\frac{x_{i} - \bar{x}  }{\sigma_{x}}\right ) + mean$$
-
-where $\textit{std}$ and $\textit{mean}$ are the user specified values for the standard deviation and mean.
-
-## Operations
-
-`StandardScaler` is a `Transformer`.
-As such, it supports the `fit` and `transform` operation.
-
-### Fit
-
-StandardScaler is trained on all subtypes of `Vector` or `LabeledVector`:
-
-* `fit[T <: Vector]: DataSet[T] => Unit`
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Transform
-
-StandardScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
-
-* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
-* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
-
-## Parameters
-
-The standard scaler implementation can be controlled by the following two parameters:
-
- <table class="table table-bordered">
-  <thead>
-    <tr>
-      <th class="text-left" style="width: 20%">Parameters</th>
-      <th class="text-center">Description</th>
-    </tr>
-  </thead>
-
-  <tbody>
-    <tr>
-      <td><strong>Mean</strong></td>
-      <td>
-        <p>
-          The mean of the scaled data set. (Default value: <strong>0.0</strong>)
-        </p>
-      </td>
-    </tr>
-    <tr>
-      <td><strong>Std</strong></td>
-      <td>
-        <p>
-          The standard deviation of the scaled data set. (Default value: <strong>1.0</strong>)
-        </p>
-      </td>
-    </tr>
-  </tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-// Create standard scaler transformer
-val scaler = StandardScaler()
-.setMean(10.0)
-.setStd(2.0)
-
-// Obtain data set to be scaled
-val dataSet: DataSet[Vector] = ...
-
-// Learn the mean and standard deviation of the training data
-scaler.fit(dataSet)
-
-// Scale the provided data set to have mean=10.0 and std=2.0
-val scaledDS = scaler.transform(dataSet)
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/svm.md b/docs/dev/libs/ml/svm.md
deleted file mode 100644
index 2fa9e0a..0000000
--- a/docs/dev/libs/ml/svm.md
+++ /dev/null
@@ -1,222 +0,0 @@
----
-mathjax: include
-title: SVM using CoCoA
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-Implements an SVM with soft-margin using the communication-efficient distributed dual coordinate
-ascent algorithm with hinge-loss function.
-The algorithm solves the following minimization problem:
-
-$$\min_{\mathbf{w} \in \mathbb{R}^d} \frac{\lambda}{2} \left\lVert \mathbf{w} \right\rVert^2 + \frac{1}{n} \sum_{i=1}^n l_{i}\left(\mathbf{w}^T\mathbf{x}_i\right)$$
-
-with $\mathbf{w}$ being the weight vector, $\lambda$ being the regularization constant,
-$$\mathbf{x}_i \in \mathbb{R}^d$$ being the data points and $$l_{i}$$ being the convex loss
-functions, which can also depend on the labels $$y_{i} \in \mathbb{R}$$.
-In the current implementation the regularizer is the $\ell_2$-norm and the loss functions are the hinge-loss functions:
-
-  $$l_{i} = \max\left(0, 1 - y_{i} \mathbf{w}^T\mathbf{x}_i \right)$$
-
-With these choices, the problem definition is equivalent to a SVM with soft-margin.
-Thus, the algorithm allows us to train a SVM with soft-margin.
-
-The minimization problem is solved by applying stochastic dual coordinate ascent (SDCA).
-In order to make the algorithm efficient in a distributed setting, the CoCoA algorithm calculates
-several iterations of SDCA locally on a data block before merging the local updates into a
-valid global state.
-This state is redistributed to the different data partitions where the next round of local SDCA
-iterations is then executed.
-The number of outer iterations and local SDCA iterations control the overall network costs, because
-there is only network communication required for each outer iteration.
-The local SDCA iterations are embarrassingly parallel once the individual data partitions have been
-distributed across the cluster.
-
-The implementation of this algorithm is based on the work of
-[Jaggi et al.](http://arxiv.org/abs/1409.1458)
-
-## Operations
-
-`SVM` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-SVM is trained given a set of `LabeledVector`:
-
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Predict
-
-SVM predicts for all subtypes of FlinkML's `Vector` the corresponding class label:
-
-* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Double)]`, where the `(T, Double)` tuple
-  corresponds to (original_features, label)
-
-If we call evaluate with a `DataSet[(Vector, Double)]`, we make a prediction on the class label
-for each example, and return a `DataSet[(Double, Double)]`. In each tuple the first element
-is the true value, as was provided from the input `DataSet[(Vector, Double)]` and the second element
-is the predicted value. You can then use these `(truth, prediction)` tuples to evaluate
-the algorithm's performance.
-
-* `predict: DataSet[(Vector, Double)] => DataSet[(Double, Double)]`
-
-## Parameters
-
-The SVM implementation can be controlled by the following parameters:
-
-<table class="table table-bordered">
-<thead>
-  <tr>
-    <th class="text-left" style="width: 20%">Parameters</th>
-    <th class="text-center">Description</th>
-  </tr>
-</thead>
-
-<tbody>
-  <tr>
-    <td><strong>Blocks</strong></td>
-    <td>
-      <p>
-        Sets the number of blocks into which the input data will be split.
-        On each block the local stochastic dual coordinate ascent method is executed.
-        This number should be set at least to the degree of parallelism.
-        If no value is specified, then the parallelism of the input DataSet is used as the number of blocks.
-        (Default value: <strong>None</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>Iterations</strong></td>
-    <td>
-      <p>
-        Defines the maximum number of iterations of the outer loop method.
-        In other words, it defines how often the SDCA method is applied to the blocked data.
-        After each iteration, the locally computed weight vector updates have to be reduced to update the global weight vector value.
-        The new weight vector is broadcast to all SDCA tasks at the beginning of each iteration.
-        (Default value: <strong>10</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>LocalIterations</strong></td>
-    <td>
-      <p>
-        Defines the maximum number of SDCA iterations.
-        In other words, it defines how many data points are drawn from each local data block to calculate the stochastic dual coordinate ascent.
-        (Default value: <strong>10</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>Regularization</strong></td>
-    <td>
-      <p>
-        Defines the regularization constant of the SVM algorithm.
-        The higher the value, the smaller will the 2-norm of the weight vector be.
-        In case of a SVM with hinge loss this means that the SVM margin will be wider even though it might contain some false classifications.
-        (Default value: <strong>1.0</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>Stepsize</strong></td>
-    <td>
-      <p>
-        Defines the initial step size for the updates of the weight vector.
-        The larger the step size is, the larger will be the contribution of the weight vector updates to the next weight vector value.
-        The effective scaling of the updates is $\frac{stepsize}{blocks}$.
-        This value has to be tuned in case that the algorithm becomes unstable.
-        (Default value: <strong>1.0</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>ThresholdValue</strong></td>
-    <td>
-      <p>
-        Defines the limiting value for the decision function above which examples are labeled as
-        positive (+1.0). Examples with a decision function value below this value are classified
-        as negative (-1.0). In order to get the raw decision function values you need to indicate it by
-        using the OutputDecisionFunction parameter.  (Default value: <strong>0.0</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>OutputDecisionFunction</strong></td>
-    <td>
-      <p>
-        Determines whether the predict and evaluate functions of the SVM should return the distance
-        to the separating hyperplane, or binary class labels. Setting this to true will
-        return the raw distance to the hyperplane for each example. Setting it to false will
-        return the binary class label (+1.0, -1.0) (Default value: <strong>false</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-  <td><strong>Seed</strong></td>
-  <td>
-    <p>
-      Defines the seed to initialize the random number generator.
-      The seed directly controls which data points are chosen for the SDCA method.
-      (Default value: <strong>Random Long Integer</strong>)
-    </p>
-  </td>
-</tr>
-</tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-import org.apache.flink.api.scala._
-import org.apache.flink.ml.math.Vector
-import org.apache.flink.ml.common.LabeledVector
-import org.apache.flink.ml.classification.SVM
-import org.apache.flink.ml.RichExecutionEnvironment
-
-val pathToTrainingFile: String = ???
-val pathToTestingFile: String = ???
-val env = ExecutionEnvironment.getExecutionEnvironment
-
-// Read the training data set, from a LibSVM formatted file
-val trainingDS: DataSet[LabeledVector] = env.readLibSVM(pathToTrainingFile)
-
-// Create the SVM learner
-val svm = SVM()
-  .setBlocks(10)
-
-// Learn the SVM model
-svm.fit(trainingDS)
-
-// Read the testing data set
-val testingDS: DataSet[Vector] = env.readLibSVM(pathToTestingFile).map(_.vector)
-
-// Calculate the predictions for the testing data set
-val predictionDS: DataSet[(Vector, Double)] = svm.predict(testingDS)
-
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/dev/libs/ml/svm.zh.md b/docs/dev/libs/ml/svm.zh.md
deleted file mode 100644
index 2fa9e0a..0000000
--- a/docs/dev/libs/ml/svm.zh.md
+++ /dev/null
@@ -1,222 +0,0 @@
----
-mathjax: include
-title: SVM using CoCoA
-nav-parent_id: ml
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-* This will be replaced by the TOC
-{:toc}
-
-## Description
-
-Implements an SVM with soft-margin using the communication-efficient distributed dual coordinate
-ascent algorithm with hinge-loss function.
-The algorithm solves the following minimization problem:
-
-$$\min_{\mathbf{w} \in \mathbb{R}^d} \frac{\lambda}{2} \left\lVert \mathbf{w} \right\rVert^2 + \frac{1}{n} \sum_{i=1}^n l_{i}\left(\mathbf{w}^T\mathbf{x}_i\right)$$
-
-with $\mathbf{w}$ being the weight vector, $\lambda$ being the regularization constant,
-$$\mathbf{x}_i \in \mathbb{R}^d$$ being the data points and $$l_{i}$$ being the convex loss
-functions, which can also depend on the labels $$y_{i} \in \mathbb{R}$$.
-In the current implementation the regularizer is the $\ell_2$-norm and the loss functions are the hinge-loss functions:
-
-  $$l_{i} = \max\left(0, 1 - y_{i} \mathbf{w}^T\mathbf{x}_i \right)$$
-
-With these choices, the problem definition is equivalent to a SVM with soft-margin.
-Thus, the algorithm allows us to train a SVM with soft-margin.
-
-The minimization problem is solved by applying stochastic dual coordinate ascent (SDCA).
-In order to make the algorithm efficient in a distributed setting, the CoCoA algorithm calculates
-several iterations of SDCA locally on a data block before merging the local updates into a
-valid global state.
-This state is redistributed to the different data partitions where the next round of local SDCA
-iterations is then executed.
-The number of outer iterations and local SDCA iterations control the overall network costs, because
-there is only network communication required for each outer iteration.
-The local SDCA iterations are embarrassingly parallel once the individual data partitions have been
-distributed across the cluster.
-
-The implementation of this algorithm is based on the work of
-[Jaggi et al.](http://arxiv.org/abs/1409.1458)
-
-## Operations
-
-`SVM` is a `Predictor`.
-As such, it supports the `fit` and `predict` operation.
-
-### Fit
-
-SVM is trained given a set of `LabeledVector`:
-
-* `fit: DataSet[LabeledVector] => Unit`
-
-### Predict
-
-SVM predicts for all subtypes of FlinkML's `Vector` the corresponding class label:
-
-* `predict[T <: Vector]: DataSet[T] => DataSet[(T, Double)]`, where the `(T, Double)` tuple
-  corresponds to (original_features, label)
-
-If we call evaluate with a `DataSet[(Vector, Double)]`, we make a prediction on the class label
-for each example, and return a `DataSet[(Double, Double)]`. In each tuple the first element
-is the true value, as was provided from the input `DataSet[(Vector, Double)]` and the second element
-is the predicted value. You can then use these `(truth, prediction)` tuples to evaluate
-the algorithm's performance.
-
-* `predict: DataSet[(Vector, Double)] => DataSet[(Double, Double)]`
-
-## Parameters
-
-The SVM implementation can be controlled by the following parameters:
-
-<table class="table table-bordered">
-<thead>
-  <tr>
-    <th class="text-left" style="width: 20%">Parameters</th>
-    <th class="text-center">Description</th>
-  </tr>
-</thead>
-
-<tbody>
-  <tr>
-    <td><strong>Blocks</strong></td>
-    <td>
-      <p>
-        Sets the number of blocks into which the input data will be split.
-        On each block the local stochastic dual coordinate ascent method is executed.
-        This number should be set at least to the degree of parallelism.
-        If no value is specified, then the parallelism of the input DataSet is used as the number of blocks.
-        (Default value: <strong>None</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>Iterations</strong></td>
-    <td>
-      <p>
-        Defines the maximum number of iterations of the outer loop method.
-        In other words, it defines how often the SDCA method is applied to the blocked data.
-        After each iteration, the locally computed weight vector updates have to be reduced to update the global weight vector value.
-        The new weight vector is broadcast to all SDCA tasks at the beginning of each iteration.
-        (Default value: <strong>10</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>LocalIterations</strong></td>
-    <td>
-      <p>
-        Defines the maximum number of SDCA iterations.
-        In other words, it defines how many data points are drawn from each local data block to calculate the stochastic dual coordinate ascent.
-        (Default value: <strong>10</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>Regularization</strong></td>
-    <td>
-      <p>
-        Defines the regularization constant of the SVM algorithm.
-        The higher the value, the smaller will the 2-norm of the weight vector be.
-        In case of a SVM with hinge loss this means that the SVM margin will be wider even though it might contain some false classifications.
-        (Default value: <strong>1.0</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>Stepsize</strong></td>
-    <td>
-      <p>
-        Defines the initial step size for the updates of the weight vector.
-        The larger the step size is, the larger will be the contribution of the weight vector updates to the next weight vector value.
-        The effective scaling of the updates is $\frac{stepsize}{blocks}$.
-        This value has to be tuned in case that the algorithm becomes unstable.
-        (Default value: <strong>1.0</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>ThresholdValue</strong></td>
-    <td>
-      <p>
-        Defines the limiting value for the decision function above which examples are labeled as
-        positive (+1.0). Examples with a decision function value below this value are classified
-        as negative (-1.0). In order to get the raw decision function values you need to indicate it by
-        using the OutputDecisionFunction parameter.  (Default value: <strong>0.0</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-    <td><strong>OutputDecisionFunction</strong></td>
-    <td>
-      <p>
-        Determines whether the predict and evaluate functions of the SVM should return the distance
-        to the separating hyperplane, or binary class labels. Setting this to true will
-        return the raw distance to the hyperplane for each example. Setting it to false will
-        return the binary class label (+1.0, -1.0) (Default value: <strong>false</strong>)
-      </p>
-    </td>
-  </tr>
-  <tr>
-  <td><strong>Seed</strong></td>
-  <td>
-    <p>
-      Defines the seed to initialize the random number generator.
-      The seed directly controls which data points are chosen for the SDCA method.
-      (Default value: <strong>Random Long Integer</strong>)
-    </p>
-  </td>
-</tr>
-</tbody>
-</table>
-
-## Examples
-
-{% highlight scala %}
-import org.apache.flink.api.scala._
-import org.apache.flink.ml.math.Vector
-import org.apache.flink.ml.common.LabeledVector
-import org.apache.flink.ml.classification.SVM
-import org.apache.flink.ml.RichExecutionEnvironment
-
-val pathToTrainingFile: String = ???
-val pathToTestingFile: String = ???
-val env = ExecutionEnvironment.getExecutionEnvironment
-
-// Read the training data set, from a LibSVM formatted file
-val trainingDS: DataSet[LabeledVector] = env.readLibSVM(pathToTrainingFile)
-
-// Create the SVM learner
-val svm = SVM()
-  .setBlocks(10)
-
-// Learn the SVM model
-svm.fit(trainingDS)
-
-// Read the testing data set
-val testingDS: DataSet[Vector] = env.readLibSVM(pathToTestingFile).map(_.vector)
-
-// Calculate the predictions for the testing data set
-val predictionDS: DataSet[(Vector, Double)] = svm.predict(testingDS)
-
-{% endhighlight %}
-
-{% top %}
diff --git a/docs/internals/components.md b/docs/internals/components.md
index c4e3270..4d74610 100644
--- a/docs/internals/components.md
+++ b/docs/internals/components.md
@@ -36,7 +36,7 @@ the DataStream API uses a stream builder.
 remote, YARN, etc)
 
 - Libraries and APIs that are bundled with Flink generate DataSet or DataStream API programs. These are
-Table for queries on logical tables, FlinkML for Machine Learning, and Gelly for graph processing.
+Table for queries on logical tables, complex event processing, and Gelly for graph processing.
 
 You can click on the components in the figure to learn more.
 
@@ -47,7 +47,6 @@ You can click on the components in the figure to learn more.
 <map name="overview-stack">
 <area id="lib-datastream-cep" title="CEP: Complex Event Processing" href="{{ site.baseurl }}/dev/libs/cep.html" shape="rect" coords="63,0,143,177" />
 <area id="lib-datastream-table" title="Table: Relational DataStreams" href="{{ site.baseurl }}/dev/table/index.html" shape="rect" coords="143,0,223,177" />
-<area id="lib-dataset-ml" title="FlinkML: Machine Learning" href="{{ site.baseurl }}/dev/libs/ml/index.html" shape="rect" coords="382,2,462,176" />
 <area id="lib-dataset-gelly" title="Gelly: Graph Processing" href="{{ site.baseurl }}/dev/libs/gelly/index.html" shape="rect" coords="461,0,541,177" />
 <area id="lib-dataset-table" title="Table API and SQL" href="{{ site.baseurl }}/dev/table/index.html" shape="rect" coords="544,0,624,177" />
 <area id="datastream" title="DataStream API" href="{{ site.baseurl }}/dev/datastream_api.html" shape="rect" coords="64,177,379,255" />
diff --git a/docs/internals/components.zh.md b/docs/internals/components.zh.md
index 15a0f2c..0df3065 100644
--- a/docs/internals/components.zh.md
+++ b/docs/internals/components.zh.md
@@ -36,7 +36,7 @@ the DataStream API uses a stream builder.
 remote, YARN, etc)
 
 - Libraries and APIs that are bundled with Flink generate DataSet or DataStream API programs. These are
-Table for queries on logical tables, FlinkML for Machine Learning, and Gelly for graph processing.
+Table for queries on logical tables, complex event processing, and Gelly for graph processing.
 
 You can click on the components in the figure to learn more.
 
@@ -47,7 +47,6 @@ You can click on the components in the figure to learn more.
 <map name="overview-stack">
 <area id="lib-datastream-cep" title="CEP: Complex Event Processing" href="{{ site.baseurl }}/dev/libs/cep.html" shape="rect" coords="63,0,143,177" />
 <area id="lib-datastream-table" title="Table: Relational DataStreams" href="{{ site.baseurl }}/dev/table/index.html" shape="rect" coords="143,0,223,177" />
-<area id="lib-dataset-ml" title="FlinkML: Machine Learning" href="{{ site.baseurl }}/dev/libs/ml/index.html" shape="rect" coords="382,2,462,176" />
 <area id="lib-dataset-gelly" title="Gelly: Graph Processing" href="{{ site.baseurl }}/dev/libs/gelly/index.html" shape="rect" coords="461,0,541,177" />
 <area id="lib-dataset-table" title="Table API and SQL" href="{{ site.baseurl }}/dev/table/index.html" shape="rect" coords="544,0,624,177" />
 <area id="datastream" title="DataStream API" href="{{ site.baseurl }}/dev/datastream_api.html" shape="rect" coords="64,177,379,255" />
diff --git a/docs/redirects/ml.md b/docs/redirects/als.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/als.md
index 9e1bd85..65762bc 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/als.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Alternating Least Squares
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/als.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/contribution_guide.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/contribution_guide.md
index 9e1bd85..cb2f2bf 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/contribution_guide.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: How to Contribute
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/contribution_guide.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/cross_validation.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/cross_validation.md
index 9e1bd85..62e4c6f 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/cross_validation.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Cross Validation
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/cross_validation.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/distance_metrics.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/distance_metrics.md
index 9e1bd85..5a72119 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/distance_metrics.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Distance Metrics
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/distance_metrics.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/flinkml_quickstart.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/flinkml_quickstart.md
index 9e1bd85..ad67dd2 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/flinkml_quickstart.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Quickstart Guide
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/quickstart.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/knn.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/knn.md
index 9e1bd85..fc7f2c5 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/knn.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: k-Nearest Neighbors Join
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/knn.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/min_max_scaler.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/min_max_scaler.md
index 9e1bd85..9bd62ad 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/min_max_scaler.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: MinMax Scaler
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/min_max_scaler.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/ml.md
index 9e1bd85..ac38e2b 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/ml.md
@@ -1,8 +1,8 @@
 ---
 title: "ML"
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/index.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/multiple_linear_regression.md
similarity index 87%
copy from docs/redirects/ml.md
copy to docs/redirects/multiple_linear_regression.md
index 9e1bd85..8555d7b 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/multiple_linear_regression.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Multiple Linear Regression
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/multiple_linear_regression.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/optimization.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/optimization.md
index 9e1bd85..9bca323 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/optimization.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Optimization
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/optimization.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/pipelines.md
similarity index 88%
copy from docs/redirects/ml.md
copy to docs/redirects/pipelines.md
index 9e1bd85..a8660ab 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/pipelines.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Looking under the hood of pipelines
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/pipelines.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/polynomial_features.md
similarity index 88%
copy from docs/redirects/ml.md
copy to docs/redirects/polynomial_features.md
index 9e1bd85..e812dee 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/polynomial_features.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Polynomial Features
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/polynomial_features.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/sos.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/sos.md
index 9e1bd85..b839b73 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/sos.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Stochastic Outlier Selection
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/sos.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/standard_scaler.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/standard_scaler.md
index 9e1bd85..96db633 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/standard_scaler.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: Standard Scaler
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/standard_scaler.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/redirects/ml.md b/docs/redirects/svm.md
similarity index 89%
copy from docs/redirects/ml.md
copy to docs/redirects/svm.md
index 9e1bd85..d84d270 100644
--- a/docs/redirects/ml.md
+++ b/docs/redirects/svm.md
@@ -1,8 +1,8 @@
 ---
-title: "ML"
+title: SVM using CoCoA
 layout: redirect
-redirect: /dev/libs/ml/index.html
-permalink: /apis/batch/libs/ml/index.html
+redirect: /index.html
+permalink: /dev/libs/ml/svm.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one