You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by thvasilo <gi...@git.apache.org> on 2015/06/05 11:14:27 UTC

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

GitHub user thvasilo opened a pull request:

    https://github.com/apache/flink/pull/792

    [FLINK-2072] [ml]  [docs] Add a quickstart guide for FlinkML

    This is an initial version of the quickstart guide. There are some issues that still need to be addressed such as the validity of standardizing the data, and whether the complete code example should be included in an examples package for FlinkML.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thvasilo/flink quickstart-ml

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/792.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #792
    
----
commit 27487ec6089adbea77266f194582ae476e50e928
Author: Theodore Vasiloudis <tv...@sics.se>
Date:   2015-06-05T09:09:11Z

    Initial version of quickstart guide

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896076
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    --- End diff --
    
    period missing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197086

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

The members of `LabeledVector` are actually `(label, features)`.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197636

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

-Coming soon.
+## Introduction
+
+FlinkML is designed to make learning from your data a straight-forward process, abstracting away
+the complexities that usually come with having to deal with big data learning tasks. In this
+quick-start guide we will show just how easy it is to solve a simple supervised learning problem
+using FlinkML. But first some basics, feel free to skip the next few lines if you're already
+familiar with Machine Learning (ML).
+
+As defined by Murphy [1] ML deals with detecting patterns in data, and using those
+learned patterns to make predictions about the future. We can categorize most ML algorithms into
+two major categories: Supervised and Unsupervised Learning.
+
+* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
+(features) to a set of outputs. The learning is done using a *training set* of (input,
+output) pairs that we use to approximate the mapping function. Supervised learning problems are
+further divided into classification and regression problems. In classification problems we try to
+predict the *class* that an example belongs to, for example whether a user is going to click on
+an ad or not. Regression problems one the other hand, are about predicting (real) numerical
+values, often called the dependent variable, for example what the temperature will be tomorrow.
+
+* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
+of this would be *clustering*, where we try to discover groupings of the data from the
+descriptive features. Unsupervised learning can also be used for feature selection, for example
+through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
+
+## Linking with FlinkML
+
+In order to use FlinkML in you project, first you have to
+[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
+Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
+
+{% highlight xml %}
+<dependency>
+ <groupId>org.apache.flink</groupId>
+ <artifactId>flink-ml</artifactId>
+ <version>{{site.version }}</version>
+</dependency>
+{% endhighlight %}
+
+## Loading data
+
+To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
+functions for formatted data, such as the LibSVM format. For supervised learning problems it is
+common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
+object will have a FlinkML `Vector` member representing the features of the example and a `Double`
+member which represents the label, which could be the class in a classification problem, or the dependent
+variable for a regression problem.
+
+As an example, we can use Haberman's Survival Data Set , which you can
+[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
+This dataset *"contains cases from study conducted on the survival of patients who had undergone
--- End diff --

This copy copied verbatime from the dataset description, I will change it to "a study"

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/792#issuecomment-111035938
  
    Great work @thvasilo. I really like the quickstart guide. I had only some minor comments. Once they are addressed, it's good to be merged :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897648
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    --- End diff --
    
    Yes because SparseVectors cause some troubles when I tried. I could test again and submit a JIRA if the problem persists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/792#issuecomment-109920114
  
    Great work @thvasilo. I like the quickstart guide a lot. There are only some minor comments I had. Then it's good to be merged :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197690
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
    +are the features and last column is the class, and the 4th column indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    --- End diff --
    
    :+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31925630
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    --- End diff --
    
    Hmm for FlinkML it's probably ok to have some example programs which work on a recommended data set but which you can also run with different data sets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31909954
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    --- End diff --
    
    Would that be a generic example for SVMs? Meaning that you can give them any libSVM file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197679

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Good idea, I will use that instead.

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31909835
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    --- End diff --
    
    Fair enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32198348
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
    +are the features and last column is the class, and the 4th column indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the svmguide1 dataset. You can download the
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
    +This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their practical
    --- End diff --
    
    If you do something like this it should work: `[[1]](#[1])` to mark the anchor link and `<a name="[1]"></a>` for the anchor.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197589
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
    +are the features and last column is the class, and the 4th column indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the svmguide1 dataset. You can download the
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
    +This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their practical
    +Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.MLUtils
    +
    +val astroTrain = MLUtils.readLibSVM("/path/to/svmguide1")
    +val astroTest = MLUtils.readLibSVM("/path/to/svmguide1.t")
    +
    +{% endhighlight %}
    +
    +This gives us two `DataSet[LabeledVector]` objects that we will use in the following section to
    +create a classifier.
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +We can set a number of parameters for the classifier. Here we set the `Blocks` parameter,
    +which is used to split the input by the underlying CoCoA algorithm [2] uses. The regularization
    +parameter determines the amount of $l_2$ regularization applied, which is used
    +to avoid overfitting. The step size determines the contribution of the weight vector updates to
    +the next weight vector value. This parameter sets the initial step size.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.classification.SVM
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(astroTrain)
    +
    +{% endhighlight %}
    +
    +We can now make predictions on the test set.
    +
    +{% highlight scala %}
    +
    +val predictionPairs = svm.predict(astroTest)
    +
    +{% endhighlight %}
    +
    +Next we will see how we can pre-process our data, and use the ML pipelines capabilities of FlinkML.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged [3] when using SVM classification is scaling
    --- End diff --
    
    [3] link?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/792


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897731
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    --- End diff --
    
    My thinking was actually to remove the error calculations for now, and include them at a later point when we have proper evaluation facilities.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896684
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    --- End diff --
    
    Shouldn't we divide the error by two? Otherwise each error will add `2` to the absolute error sum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31909877
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.preprocessing.StandardScaler
    +
    +val scaler = StandardScaler()
    +scaler.fit(adultTrain)
    +
    +val scaledSVM = scaler.chainPredictor(svm)
    +
    +{% endhighlight %}
    +
    +We can now use our newly created pipeline to make predictions on the test set. 
    +First we call fit again, to train the scaler and the SVM classifier.
    +The data of the test set will then be automatically scaled before being passed on to the SVM to 
    +make predictions.
    +
    +{% highlight scala %}
    +
    +scaledSVM.fit(adultTrain)
    +
    +val predictionPairsScaled= thresholdPredictions(scaledSVM.predict(predictionsScaled))
    +
    +val absoluteErrorSumScaled = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error with scaled features: $absoluteErrorSumScaled")
    +
    +{% endhighlight %}
    +
    +The effect that the transformation has on the rror for this dataset is a bit unpredictable.
    +In reality the scaling transformation does
    +not fit the dataset we are using, since the features are translated categorical features and as
    +such, operations like normalization and standard scaling do not make much sense.
    --- End diff --
    
    Ok, I'll try to push the `MinMaxScaler`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197502
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
    +are the features and last column is the class, and the 4th column indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the svmguide1 dataset. You can download the
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
    +This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their practical
    --- End diff --
    
    Maybe we can directly link to the paper or at least to the references at the bottom of the page.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32198054
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
    +are the features and last column is the class, and the 4th column indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the svmguide1 dataset. You can download the
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
    +This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their practical
    --- End diff --
    
    That can be done with anchor links, which I tried for the other docs and they didn't seem to work properly. I can try this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197018

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Nicely done with the site version :+1:

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897426

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
* This will be replaced by the TOC
{:toc}

It's more of a statistics terminology, see [synonyms](http://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms). In ML features is more common so I will change it to that.

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32008548
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    --- End diff --
    
    This is already shown in the "Getting Started" section of the landing page, from where this guide is linked.
    Do you think it's worth it to put it here as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31911936
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    --- End diff --
    
    That's a good question. By making it generic we would have to support a few more things to ensure that the provided dataset can be used.
    
    Is writing an example for a specific dataset a bad idea you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197203

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

"from a study" or "from studies"?

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896308

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
* This will be replaced by the TOC
{:toc}

Isnt' the TODO fixed?

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31902170
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    --- End diff --
    
    Will fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897350
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    --- End diff --
    
    Maybe we should also add again how to setup a project using FlinkML. E.g. saying which dependencies users have to add to their `pom.xml`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897818
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    --- End diff --
    
    Will add.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32018559
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    --- End diff --
    
    Yes I think so. Just for the sake of completeness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32198150

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Hmm I was not aware of this ;-)

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197505

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Good catch

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897278
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    --- End diff --
    
    Maybe we should also give the imports the user has to made in order to use these things.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896095
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    --- End diff --
    
    missing link?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by flinkqa <gi...@git.apache.org>.

Github user flinkqa commented on the pull request:

    https://github.com/apache/flink/pull/792#issuecomment-109311974
  
    Tested pull request.Result: 
    From http://git-wip-us.apache.org/repos/asf/flink
       92d3251..9b88184  master     -> origin/master
    Updating 11643c0..9b88184
    Fast-forward
     docs/README.md                                     |   4 +-
     docs/apis/iterations.md                            |   2 +-
     docs/apis/streaming_guide.md                       |  19 +-
     docs/apis/web_client.md                            |   2 +-
     docs/scala_shell.md                                |   2 +-
     .../common/operators/base/JoinOperatorBase.java    |   2 +-
     .../flink/api/common/typeinfo/TypeInformation.java |   3 -
     flink-dist/pom.xml                                 |  10 +
     flink-dist/src/main/flink-bin/bin/flink            |   4 +-
     .../PlanUnwrappingSortedReduceGroupOperator.java   |   8 +-
     .../runtime/io/network/netty/NettyMessage.java     |  20 +
     .../io/network/netty/PartitionRequestClient.java   |  15 +-
     .../io/network/netty/PartitionRequestQueue.java    |   6 +
     .../netty/PartitionRequestServerHandler.java       |   4 +
     .../jobgraph/tasks/CheckpointedOperator.java       |  16 +-
     .../runtime/operators/hash/MutableHashTable.java   |   6 +-
     .../netty/NettyMessageSerializationTest.java       |   7 +
     .../runtime/operators/hash/HashTableITCase.java    |  76 ++++
     flink-staging/flink-hbase/pom.xml                  |   6 +
     .../hbase/example/HBaseWriteStreamExample.java     | 119 ++++++
     .../org/apache/flink/ml/classification/SVM.scala   |   8 +-
     .../org/apache/flink/ml/pipeline/Estimator.scala   |   2 +-
     .../org/apache/flink/ml/pipeline/Predictor.scala   |   2 +-
     .../ml/regression/MultipleLinearRegression.scala   | 362 ++---------------
     .../apache/flink/ml/pipeline/PipelineITSuite.scala |   6 +-
     .../MultipleLinearRegressionITSuite.scala          |  18 +-
     .../flink/ml/regression/RegressionData.scala       |   4 +-
     .../apache/flink/api/scala/ScalaShellITSuite.scala |   8 +-
     .../connectors/kafka/KafkaProducerExample.java     |  65 ++-
     .../connectors/kafka/api/KafkaSource.java          |  91 +++--
     .../api/persistent/PersistentKafkaSource.java      | 105 ++---
     .../streaming/connectors/kafka/KafkaITCase.java    | 252 +++++++-----
     .../streaming/connectors/rabbitmq/RMQSource.java   |  49 +--
     .../connectors/twitter/TwitterSource.java          |  47 +--
     .../connectors/twitter/TwitterStreaming.java       |  32 +-
     .../connectors/twitter/TwitterTopology.java        |  42 +-
     .../flink/streaming/api/datastream/DataStream.java |  56 +--
     .../api/functions/source/ConnectorSource.java      |   4 +-
     .../functions/source/FileMonitoringFunction.java   |  90 ++---
     .../api/functions/source/FileSourceFunction.java   |  43 +-
     .../api/functions/source/FromElementsFunction.java |  26 +-
     .../api/functions/source/FromIteratorFunction.java |  14 +-
     .../source/FromSplittableIteratorFunction.java     |  18 +-
     .../functions/source/ParallelSourceFunction.java   |   2 +-
     .../source/RichParallelSourceFunction.java         |  11 +-
     .../functions/source/SocketTextStreamFunction.java | 149 ++++---
     .../api/functions/source/SourceFunction.java       | 126 +++---
     .../api/operators/AbstractStreamOperator.java      |   2 +-
     .../api/operators/AbstractUdfStreamOperator.java   |   9 +-
     .../streaming/api/operators/StreamSource.java      |  30 +-
     .../streaming/runtime/tasks/SourceStreamTask.java  |  20 +-
     .../flink/streaming/runtime/tasks/StreamTask.java  |  12 +-
     .../streaming/api/ChainedRuntimeContextTest.java   |   6 +-
     .../apache/flink/streaming/api/TypeFillTest.java   |  11 +-
     .../api/complex/ComplexIntegrationTest.java        |  79 ++--
     .../operators/windowing/WindowIntegrationTest.java |  28 +-
     .../streaming/api/streamtask/StreamVertexTest.java |  15 +-
     .../runtime/tasks/SourceStreamTaskTest.java        | 248 ++++++++++++
     .../runtime/tasks/StreamMockEnvironment.java       | 277 +++++++++++++
     .../runtime/tasks/StreamTaskTestBase.java          | 109 +++++
     .../apache/flink/streaming/util/MockSource.java    |  18 +-
     .../streaming/util/SocketProgramITCaseBase.java}   |  15 +-
     .../flink-streaming-examples/pom.xml               |  23 --
     .../examples/iteration/IterateExample.java         |  22 +-
     .../flink/streaming/examples/join/WindowJoin.java  |  45 ++-
     .../examples/ml/IncrementalLearningSkeleton.java   |  77 ++--
     .../examples/windowing/SessionWindowing.java       |  55 +--
     .../streaming/examples/windowing/StockPrices.java  | 438 ---------------------
     .../windowing/TopSpeedWindowingExample.java        | 104 ++---
     .../util/TopSpeedWindowingExampleData.java         |  43 +-
     .../streaming/scala/examples/join/WindowJoin.scala | 124 ++++--
     .../socket/SocketTextStreamWordCount.scala         |   2 +-
     .../scala/examples/windowing/StockPrices.scala     | 228 -----------
     .../examples/windowing/TopSpeedWindowing.scala     |  67 +++-
     .../iteration/IterateExampleITCase.java            |   2 +-
     .../join/WindowJoinITCase.java                     |   2 +-
     .../ml/IncrementalLearningSkeletonITCase.java      |   2 +-
     .../socket/SocketTextStreamWordCountITCase.java    |  30 ++
     .../twitter/TwitterStreamITCase.java               |   2 +-
     .../windowing/SessionWindowingITCase.java          |   2 +-
     .../windowing/TopSpeedWindowingExampleITCase.java  |   2 +-
     .../windowing/WindowWordCountITCase.java           |   2 +-
     .../wordcount/PojoExampleITCase.java               |   2 +-
     .../wordcount/WordCountITCase.java                 |   2 +-
     .../join/WindowJoinITCase.java                     |  50 +++
     .../socket/SocketTextStreamWordCountITCase.java    |  30 ++
     .../windowing/TopSpeedWindowingExampleITCase.java  |  45 +++
     .../flink/streaming/api/scala/DataStream.scala     |  23 +-
     .../api/scala/StreamExecutionEnvironment.scala     |  17 +-
     .../flink/examples/java/JavaTableExample.java      |   4 +-
     .../flink/api/java/table/TableEnvironment.scala    |  22 +-
     .../api/java/table/test/AggregationsITCase.java    |  25 +-
     .../apache/flink/api/java/table/test/AsITCase.java |  25 +-
     .../flink/api/java/table/test/CastingITCase.java   |  13 +-
     .../api/java/table/test/ExpressionsITCase.java     |  25 +-
     .../flink/api/java/table/test/FilterITCase.java    |  21 +-
     .../java/table/test/GroupedAggregationsITCase.java |  13 +-
     .../flink/api/java/table/test/JoinITCase.java      |  43 +-
     .../flink/api/java/table/test/SelectITCase.java    |  25 +-
     .../java/table/test/StringExpressionsITCase.java   |  17 +-
     .../checkpointing/StreamCheckpointingITCase.java   | 147 +++++--
     .../flink/test/javaApiOperators/FirstNITCase.java  |  29 ++
     .../AbstractProcessFailureRecoveryTest.java        |  13 +-
     .../ProcessFailureStreamingRecoveryITCase.java     | 180 ++++-----
     pom.xml                                            |   6 +-
     105 files changed, 2641 insertions(+), 2250 deletions(-)
     create mode 100644 flink-staging/flink-hbase/src/test/java/org/apache/flink/addons/hbase/example/HBaseWriteStreamExample.java
     create mode 100644 flink-staging/flink-streaming/flink-streaming-core/src/test/java/org/apache/flink/streaming/runtime/tasks/SourceStreamTaskTest.java
     create mode 100644 flink-staging/flink-streaming/flink-streaming-core/src/test/java/org/apache/flink/streaming/runtime/tasks/StreamMockEnvironment.java
     create mode 100644 flink-staging/flink-streaming/flink-streaming-core/src/test/java/org/apache/flink/streaming/runtime/tasks/StreamTaskTestBase.java
     rename flink-staging/flink-streaming/{flink-streaming-examples/src/test/java/org/apache/flink/streaming/examples/test/socket/SocketTextStreamWordCountITCase.java => flink-streaming-core/src/test/java/org/apache/flink/streaming/util/SocketProgramITCaseBase.java} (82%)
     delete mode 100644 flink-staging/flink-streaming/flink-streaming-examples/src/main/java/org/apache/flink/streaming/examples/windowing/StockPrices.java
     delete mode 100644 flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples/windowing/StockPrices.scala
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/iteration/IterateExampleITCase.java (95%)
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/join/WindowJoinITCase.java (96%)
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/ml/IncrementalLearningSkeletonITCase.java (95%)
     create mode 100644 flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/test/exampleJavaPrograms/socket/SocketTextStreamWordCountITCase.java
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/twitter/TwitterStreamITCase.java (95%)
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/windowing/SessionWindowingITCase.java (95%)
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/windowing/TopSpeedWindowingExampleITCase.java (96%)
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/windowing/WindowWordCountITCase.java (96%)
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/wordcount/PojoExampleITCase.java (95%)
     rename flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/{examples/test => test/exampleJavaPrograms}/wordcount/WordCountITCase.java (95%)
     create mode 100644 flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/test/exampleScalaPrograms/join/WindowJoinITCase.java
     create mode 100644 flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/test/exampleScalaPrograms/socket/SocketTextStreamWordCountITCase.java
     create mode 100644 flink-staging/flink-streaming/flink-streaming-examples/src/test/java/org/apache/flink/streaming/test/exampleScalaPrograms/windowing/TopSpeedWindowingExampleITCase.java
    Running ./tools/qa-check.sh
    Computing Flink QA-Check results (please be patient).
    :-1: The change increases the number of javadoc errors from      402 to      548
    :-1: The change increases the number of compiler warnings from      702 to      850
    ```diff
    First 100 warnings:
    1,144c1,285
    < [WARNING] bootstrap class path not set in conjunction with -source 1.6
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/util/OperatorUtil.java:[31,45] org.apache.flink.api.common.functions.GenericCollectorMap in org.apache.flink.api.common.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/base/CollectorMapOperatorBase.java:[24,45] org.apache.flink.api.common.functions.GenericCollectorMap in org.apache.flink.api.common.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/ExecutionConfig.java:[579,25] found raw type: org.apache.flink.api.common.ExecutionConfig.Entry
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/ExecutionConfig.java:[613,23] serializable class org.apache.flink.api.common.ExecutionConfig.GlobalJobParameters has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordPairComparator.java:[50,40] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordPairComparator.java:[51,40] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/types/Record.java:[1782,46] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/Operator.java:[231,81] found raw type: org.apache.flink.api.common.operators.Operator
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/AbstractUdfOperator.java:[142,40] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/AbstractUdfOperator.java:[154,40] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/DualInputOperator.java:[229,91] found raw type: org.apache.flink.api.common.operators.Operator
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/DualInputOperator.java:[241,91] found raw type: org.apache.flink.api.common.operators.Operator
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/Ordering.java:[116,47] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/core/memory/MemorySegment.java:[968,38] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/SingleInputOperator.java:[130,83] found raw type: org.apache.flink.api.common.operators.Operator
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/SingleInputOperator.java:[153,89] found raw type: org.apache.flink.api.common.operators.Operator
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/base/GroupReduceOperatorBase.java:[157,50] unchecked cast
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/BasicTypeInfo.java:[56,8] serializable class org.apache.flink.api.common.typeinfo.BasicTypeInfo has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/NumericTypeInfo.java:[27,8] serializable class org.apache.flink.api.common.typeinfo.NumericTypeInfo has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/FractionalTypeInfo.java:[27,8] serializable class org.apache.flink.api.common.typeinfo.FractionalTypeInfo has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/distributions/SimpleDistribution.java:[38,34] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/distributions/SimpleDistribution.java:[56,34] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/distributions/SimpleDistribution.java:[62,45] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/distributions/SimpleDistribution.java:[78,55] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/distributions/SimpleDistribution.java:[154,34] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/distributions/SimpleDistribution.java:[157,55] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/distributions/SimpleDistribution.java:[170,47] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/GenericDataSourceBase.java:[49,17] found raw type: org.apache.flink.api.common.operators.GenericDataSourceBase.SplitDataProperties
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/GenericDataSourceBase.java:[184,28] unchecked conversion
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/GenericDataSinkBase.java:[154,97] found raw type: org.apache.flink.api.common.operators.Operator
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/IntegerTypeInfo.java:[27,8] serializable class org.apache.flink.api.common.typeinfo.IntegerTypeInfo has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/PrimitiveArrayTypeInfo.java:[124,25] found raw type: org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/PrimitiveArrayTypeInfo.java:[42,8] Class org.apache.flink.api.common.typeinfo.PrimitiveArrayTypeInfo overrides equals, but neither it nor any superclass overrides hashCode method
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java:[50,59] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java:[243,76] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java:[277,76] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java:[309,76] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/io/GenericCsvInputFormat.java:[323,48] found raw type: org.apache.flink.types.parser.FieldParser
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordComparatorFactory.java:[116,61] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordComparator.java:[99,39] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordComparator.java:[100,48] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordComparator.java:[163,39] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordComparator.java:[164,48] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordComparator.java:[370,64] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/record/RecordComparator.java:[379,51] found raw type: org.apache.flink.types.Key
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/base/GroupCombineOperatorBase.java:[85,50] unchecked cast
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/core/memory/MemoryUtils.java:[33,37] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/core/memory/MemoryUtils.java:[42,32] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/core/memory/MemoryUtils.java:[44,53] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/core/memory/MemoryUtils.java:[46,41] sun.misc.Unsafe is internal proprietary API and may be removed in a future release
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/operators/base/SortPartitionOperatorBase.java:[74,83] unchecked conversion
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/EnumSerializer.java:[30,14] Class org.apache.flink.api.common.typeutils.base.EnumSerializer overrides equals, but neither it nor any superclass overrides hashCode method
    < [WARNING] bootstrap class path not set in conjunction with -source 1.6
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/ReduceOperator.java:[44,50] org.apache.flink.api.java.record.functions.FunctionAnnotation in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/CrossOperator.java:[34,50] org.apache.flink.api.java.record.functions.FunctionAnnotation in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/CrossWithLargeOperator.java:[26,50] org.apache.flink.api.java.record.functions.CrossFunction in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/io/CsvInputFormat.java:[28,50] org.apache.flink.api.java.record.operators.FileDataSource in org.apache.flink.api.java.record.operators has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/CoGroupOperator.java:[35,50] org.apache.flink.api.java.record.functions.CoGroupFunction in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/CoGroupOperator.java:[36,50] org.apache.flink.api.java.record.functions.FunctionAnnotation in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/JoinOperator.java:[33,50] org.apache.flink.api.java.record.functions.FunctionAnnotation in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/JoinOperator.java:[34,50] org.apache.flink.api.java.record.functions.JoinFunction in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/functions/MapFunction.java:[22,45] org.apache.flink.api.common.functions.GenericCollectorMap in org.apache.flink.api.common.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/CrossWithSmallOperator.java:[26,50] org.apache.flink.api.java.record.functions.CrossFunction in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/MapOperator.java:[29,50] org.apache.flink.api.common.operators.base.CollectorMapOperatorBase in org.apache.flink.api.common.operators.base has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/MapOperator.java:[33,50] org.apache.flink.api.java.record.functions.FunctionAnnotation in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/MapOperator.java:[34,50] org.apache.flink.api.java.record.functions.MapFunction in org.apache.flink.api.java.record.functions has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/io/DelimitedOutputFormat.java:[25,50] org.apache.flink.api.java.record.operators.FileDataSink in org.apache.flink.api.java.record.operators has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/io/CsvOutputFormat.java:[28,50] org.apache.flink.api.java.record.operators.FileDataSink in org.apache.flink.api.java.record.operators has been deprecated
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple9Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple9
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/AvroSerializer.java:[122,50] unchecked cast
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple4Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple4
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple3Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple3
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple2Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple2
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/AvroTypeInfo.java:[49,17] found raw type: org.apache.flink.api.common.typeinfo.TypeInformation
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/AvroTypeInfo.java:[54,17] found raw type: org.apache.flink.api.java.typeutils.PojoTypeInfo
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/AvroTypeInfo.java:[59,25] found raw type: org.apache.flink.api.common.typeinfo.TypeInformation
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/AvroTypeInfo.java:[64,55] found raw type: org.apache.flink.api.java.typeutils.GenericTypeInfo
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/AvroTypeInfo.java:[64,51] unchecked call to GenericTypeInfo(java.lang.Class<T>) as a member of the raw type org.apache.flink.api.java.typeutils.GenericTypeInfo
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/AvroTypeInfo.java:[42,8] serializable class org.apache.flink.api.java.typeutils.AvroTypeInfo has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/typeutils/PojoField.java:[29,8] serializable class org.apache.flink.api.java.typeutils.PojoField has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple1Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple1
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/operators/JoinOperator.java:[1769,63] found raw type: org.apache.flink.api.common.typeinfo.TypeInformation
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/operators/ProjectOperator.java:[605,63] found raw type: org.apache.flink.api.common.typeinfo.TypeInformation
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java:[166,66] found raw type: org.apache.flink.api.java.aggregation.AggregationFunction
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/operators/CrossOperator.java:[1053,63] found raw type: org.apache.flink.api.common.typeinfo.TypeInformation
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/operators/translation/PlanUnwrappingSortedReduceGroupOperator.java:[75,77] unchecked call to combine(java.lang.Iterable<IN>,org.apache.flink.util.Collector<OUT>) as a member of the raw type org.apache.flink.api.common.functions.GroupCombineFunction
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/io/SplitDataProperties.java:[440,23] serializable class org.apache.flink.api.java.io.SplitDataProperties.SourcePartitionerMarker has no definition of serialVersionUID
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/io/SplitDataProperties.java:[440,23] Class org.apache.flink.api.java.io.SplitDataProperties.SourcePartitionerMarker overrides equals, but neither it nor any superclass overrides hashCode method
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/record/operators/ReduceOperator.java:[268,55] found raw type: java.lang.Class
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple12Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple12
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple13Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple13
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple14Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple14
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple15Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple15
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple17Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple17
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple16Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple16
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple19Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple19
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple18Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple18
    < [WARNING] /Users/max/Dev/flink-qa-bot/flink/tools/_qa_workdir/flink/flink-java/src/main/java/org/apache/flink/api/java/tuple/builder/Tuple21Builder.java:[44,43] found raw type: org.apache.flink.api.java.tuple.Tuple21
    ```
    :+1: The number of files in the lib/ folder was        2 before the change and is now        2
    :-1: The change contains @author tags
    QA-Check finished.
    Overall result: :-1:. Some tests failed. Please check messages above



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896535
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    --- End diff --
    
    This will create a dense vector. Thus also `fvec` will be dense. Do we want that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896733
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    --- End diff --
    
    Maybe we have to quickly explain the individual parameters we're setting here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896948
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.preprocessing.StandardScaler
    +
    +val scaler = StandardScaler()
    +scaler.fit(adultTrain)
    --- End diff --
    
    We call `fit` later, so this is not needed here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197381
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
    +are the features and last column is the class, and the 4th column indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    --- End diff --
    
    Maybe putting MLUtils into back ticks: `MLUtils`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197563

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Good catch, will change.

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31898968
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    --- End diff --
    
    Will change wording


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31909808
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    --- End diff --
    
    Hmm yes that would be great. Hope that it also works with `SparseVector`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896845
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    --- End diff --
    
    Does one separate the `here` from the rest with a comma?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32198235

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Good catch

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/792#issuecomment-111068643
  
    Perfect, thanks. Will merge it now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897535
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    --- End diff --
    
    Good catch, will add.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197542

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

I am indeed very good at copy-pasting your code :P

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896965
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.preprocessing.StandardScaler
    +
    +val scaler = StandardScaler()
    +scaler.fit(adultTrain)
    +
    +val scaledSVM = scaler.chainPredictor(svm)
    +
    +{% endhighlight %}
    +
    +We can now use our newly created pipeline to make predictions on the test set. 
    +First we call fit again, to train the scaler and the SVM classifier.
    +The data of the test set will then be automatically scaled before being passed on to the SVM to 
    +make predictions.
    +
    +{% highlight scala %}
    +
    +scaledSVM.fit(adultTrain)
    +
    +val predictionPairsScaled= thresholdPredictions(scaledSVM.predict(predictionsScaled))
    +
    +val absoluteErrorSumScaled = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error with scaled features: $absoluteErrorSumScaled")
    +
    +{% endhighlight %}
    +
    +The effect that the transformation has on the rror for this dataset is a bit unpredictable.
    --- End diff --
    
    rror = error


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197497
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    --- End diff --
    
    Good idea, will rephrase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31899035
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.preprocessing.StandardScaler
    +
    +val scaler = StandardScaler()
    +scaler.fit(adultTrain)
    --- End diff --
    
    Will remove.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31896248

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
* This will be replaced by the TOC
{:toc}

are the inputs called predictors?

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897308
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    --- End diff --
    
    What do I have to import in order to use `.asBreeze`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31897514
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    --- End diff --
    
    Will add reference


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31902243
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    --- End diff --
    
    My thoughts were that we will provide the whole thing as an example program, somewhere in Flink examples. But I can added the imports needed here as well, by section.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32196879

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

your

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896996
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.preprocessing.StandardScaler
    +
    +val scaler = StandardScaler()
    +scaler.fit(adultTrain)
    +
    +val scaledSVM = scaler.chainPredictor(svm)
    +
    +{% endhighlight %}
    +
    +We can now use our newly created pipeline to make predictions on the test set. 
    +First we call fit again, to train the scaler and the SVM classifier.
    +The data of the test set will then be automatically scaled before being passed on to the SVM to 
    +make predictions.
    +
    +{% highlight scala %}
    +
    +scaledSVM.fit(adultTrain)
    +
    +val predictionPairsScaled= thresholdPredictions(scaledSVM.predict(predictionsScaled))
    +
    +val absoluteErrorSumScaled = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error with scaled features: $absoluteErrorSumScaled")
    +
    +{% endhighlight %}
    +
    +The effect that the transformation has on the rror for this dataset is a bit unpredictable.
    +In reality the scaling transformation does
    +not fit the dataset we are using, since the features are translated categorical features and as
    +such, operations like normalization and standard scaling do not make much sense.
    --- End diff --
    
    Hmm, maybe we should try to find a data set where scaling actually helps us. Otherwise users might ask themselves, why using this feature at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31902179
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.preprocessing.StandardScaler
    +
    +val scaler = StandardScaler()
    +scaler.fit(adultTrain)
    +
    +val scaledSVM = scaler.chainPredictor(svm)
    +
    +{% endhighlight %}
    +
    +We can now use our newly created pipeline to make predictions on the test set. 
    +First we call fit again, to train the scaler and the SVM classifier.
    +The data of the test set will then be automatically scaled before being passed on to the SVM to 
    +make predictions.
    +
    +{% highlight scala %}
    +
    +scaledSVM.fit(adultTrain)
    +
    +val predictionPairsScaled= thresholdPredictions(scaledSVM.predict(predictionsScaled))
    +
    +val absoluteErrorSumScaled = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error with scaled features: $absoluteErrorSumScaled")
    +
    +{% endhighlight %}
    +
    +The effect that the transformation has on the rror for this dataset is a bit unpredictable.
    --- End diff --
    
    Good catch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r31897546

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +24,198 @@ under the License.
* This will be replaced by the TOC
{:toc}

Yup, will remove

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896876
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    --- End diff --
    
    Let us?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31896759
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    --- End diff --
    
    Maybe we can also print the results/error here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197169

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Missing closing parenthesis of the link.

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r31900406
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +24,198 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML)
    +
    +As defined by Murphy [cite ML-APP] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* Supervised Learning deals with learning a function (mapping) from a set of inputs
    +(predictors) to a set of outputs. The learning is done using a __training set__ of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the __class__ that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems are about predicting (real) numerical values,  often called the dependent
    +variable, for example what the temperature will be tomorrow.
    +
    +* Unsupervised learning deals with discovering patterns and regularities in the data. An example
    +of this would be __clustering__, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Loading data
    +
    +For loading data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +# TODO: Get dataset that has separate train and test sets
    +As an example, we can use the Breast Cancer Wisconsin (Diagnostic) Data Set, which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data).
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
    +
    +{% endhighlight %}
    +
    +The dataset has some missing values indicated by `?`. We can filter those rows out and
    +then transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms.
    +
    +{% highlight scala %}
    +
    +val cancerLV = cancer
    +  .map(_.productIterator.toList)
    +  .filter(!_.contains("?"))
    +  .map{list =>
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(11), DenseVector(numList.take(10).toArray))
    +    }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner.
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the Adult (a9a) dataset. You can download the 
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/a8a.t).
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +val adultTrain = MLUtils.readLibSVM("path/to/a8a")
    +val adultTest = MLUtils.readLibSVM("path/to/a8a.t")
    +
    +{% endhighlight %}
    +
    +This gives us a `DataSet[LabeledVector]` that we will use in the following section to create a classifier.
    +
    +Due to an error in the test dataset we have to adjust the test data using the following code, to 
    +ensure that the dimensionality of all test examples is 123, as with the training set:
    +
    +{% highlight scala %}
    +
    +val adjustedTest = adultTest.map{lv =>
    +      val vec = lv.vector.asBreeze
    +      val padded = vec.padTo(123, 0.0).toDenseVector
    +      val fvec = padded.fromBreeze
    +      LabeledVector(lv.label, fvec)
    +    }
    +
    +{% endhighlight %}
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +
    +{% highlight scala %}
    +
    +val svm = SVM()
    +  .setBlocks(env.getParallelism)
    +  .setIterations(100)
    +  .setRegularization(0.001)
    +  .setStepsize(0.1)
    +  .setSeed(42)
    +
    +svm.fit(adultTrain)
    +
    +{% endhighlight %}
    +
    +Let's now make predictions on the test set and see how well we do in terms of absolute error
    +We will also create a function that thresholds the predictions to the {-1, 1} scale that the
    +dataset uses.
    +
    +{% highlight scala %}
    +
    +def thresholdPredictions(predictions: DataSet[(Double, Double)])
    +: DataSet[(Double, Double)] = {
    +  predictions.map {
    +    truthPrediction =>
    +      val truth = truthPrediction._1
    +      val prediction = truthPrediction._2
    +      val thresholdedPrediction = if (prediction > 0.0) 1.0 else -1.0
    +      (truth, thresholdedPrediction)
    +  }
    +}
    +
    +val predictionPairs = thresholdPredictions(svm.predict(adjustedTest))
    +
    +val absoluteErrorSum = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error: $absoluteErrorSum")
    +
    +{% endhighlight %}
    +
    +Next we will see if we can improve the performance by pre-processing our data.
    +
    +## Data pre-processing and pipelines
    +
    +A pre-processing step that is often encouraged when using SVM classification is scaling
    +the input features to the [0, 1] range, in order to avoid features with extreme values dominating the rest.
    +FlinkML has a number of `Transformers` such as `StandardScaler` that are used to pre-process data, and a key feature is the ability to
    +chain `Transformers` and `Predictors` together. This allows us to run the same pipeline of transformations and make predictions
    +on the train and test data in a straight-forward and type-safe manner. You can read more on the pipeline system of FlinkML,
    +[here](pipelines.html).
    +
    +Let first create a scaling transformer for the features in our dataset, and chain it to a new SVM classifier.
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.preprocessing.StandardScaler
    +
    +val scaler = StandardScaler()
    +scaler.fit(adultTrain)
    +
    +val scaledSVM = scaler.chainPredictor(svm)
    +
    +{% endhighlight %}
    +
    +We can now use our newly created pipeline to make predictions on the test set. 
    +First we call fit again, to train the scaler and the SVM classifier.
    +The data of the test set will then be automatically scaled before being passed on to the SVM to 
    +make predictions.
    +
    +{% highlight scala %}
    +
    +scaledSVM.fit(adultTrain)
    +
    +val predictionPairsScaled= thresholdPredictions(scaledSVM.predict(predictionsScaled))
    +
    +val absoluteErrorSumScaled = predictionPairs.collect().map{
    +  case (truth, prediction) => Math.abs(truth - prediction)}.sum
    +
    +println(s"Absolute error with scaled features: $absoluteErrorSumScaled")
    +
    +{% endhighlight %}
    +
    +The effect that the transformation has on the rror for this dataset is a bit unpredictable.
    +In reality the scaling transformation does
    +not fit the dataset we are using, since the features are translated categorical features and as
    +such, operations like normalization and standard scaling do not make much sense.
    --- End diff --
    
    My thoughts were on using the astroparticle dataset from the "Practical guide to SVM classification". If we merge the MinMaxScaler soon I can use that one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32197549
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    +quick-start guide we will show just how easy it is to solve a simple supervised learning problem
    +using FlinkML. But first some basics, feel free to skip the next few lines if you're already
    +familiar with Machine Learning (ML).
    +
    +As defined by Murphy [1] ML deals with detecting patterns in data, and using those
    +learned patterns to make predictions about the future. We can categorize most ML algorithms into
    +two major categories: Supervised and Unsupervised Learning.
    +
    +* **Supervised Learning** deals with learning a function (mapping) from a set of inputs
    +(features) to a set of outputs. The learning is done using a *training set* of (input,
    +output) pairs that we use to approximate the mapping function. Supervised learning problems are
    +further divided into classification and regression problems. In classification problems we try to
    +predict the *class* that an example belongs to, for example whether a user is going to click on
    +an ad or not. Regression problems one the other hand, are about predicting (real) numerical
    +values, often called the dependent variable, for example what the temperature will be tomorrow.
    +
    +* **Unsupervised Learning** deals with discovering patterns and regularities in the data. An example
    +of this would be *clustering*, where we try to discover groupings of the data from the
    +descriptive features. Unsupervised learning can also be used for feature selection, for example
    +through [principal components analysis](https://en.wikipedia.org/wiki/Principal_component_analysis).
    +
    +## Linking with FlinkML
    +
    +In order to use FlinkML in you project, first you have to
    +[set up a Flink program](http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#linking-with-flink).
    +Next, you have to add the FlinkML dependency to the `pom.xml` of your project:
    +
    +{% highlight xml %}
    +<dependency>
    +  <groupId>org.apache.flink</groupId>
    +  <artifactId>flink-ml</artifactId>
    +  <version>{{site.version }}</version>
    +</dependency>
    +{% endhighlight %}
    +
    +## Loading data
    +
    +To load data to be used with FlinkML we can use the ETL capabilities of Flink, or specialized
    +functions for formatted data, such as the LibSVM format. For supervised learning problems it is
    +common to use the `LabeledVector` class to represent the `(features, label)` examples. A `LabeledVector`
    +object will have a FlinkML `Vector` member representing the features of the example and a `Double`
    +member which represents the label, which could be the class in a classification problem, or the dependent
    +variable for a regression problem.
    +
    +As an example, we can use Haberman's Survival Data Set , which you can
    +[download from the UCI ML repository](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data.
    +This dataset *"contains cases from study conducted on the survival of patients who had undergone
    +surgery for breast cancer"*. The data comes in a comma-separated file, where the first 3 columns
    +are the features and last column is the class, and the 4th column indicates whether the patient
    +survived 5 years or longer (label 1), or died within 5 years (label 2). You can check the [UCI
    +page](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival) for more information on the data.
    +
    +We can load the data as a `DataSet[String]` first:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.api.scala.ExecutionEnvironment
    +
    +val env = ExecutionEnvironment.createLocalEnvironment(2)
    +
    +val survival = env.readCsvFile[(String, String, String, String)]("/path/to/haberman.data")
    +
    +{% endhighlight %}
    +
    +We can now transform the data into a `DataSet[LabeledVector]`. This will allow us to use the
    +dataset with the FlinkML classification algorithms. We know that the 4th element of the dataset
    +is the class label, and the rest are features, so we can build `LabeledVector` elements like this:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.DenseVector
    +
    +val survivalLV = survival
    +  .map{tuple =>
    +    val list = tuple.productIterator.toList
    +    val numList = list.map(_.asInstanceOf[String].toDouble)
    +    LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
    +  }
    +
    +{% endhighlight %}
    +
    +We can then use this data to train a learner. We will however use another dataset to exemplify
    +building a learner; that will allow us to show how we can import other dataset formats.
    +
    +**LibSVM files**
    +
    +A common format for ML datasets is the LibSVM format and a number of datasets using that format can be
    +found [in the LibSVM datasets website](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). FlinkML provides utilities for loading
    +datasets using the LibSVM format through the `readLibSVM` function available through the MLUtils object.
    +You can also save datasets in the LibSVM format using the `writeLibSVM` function.
    +Let's import the svmguide1 dataset. You can download the
    +[training set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1)
    +and the [test set here](http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/svmguide1.t).
    +This is an astroparticle binary classification dataset, used by Hsu et al. [3] in their practical
    +Support Vector Machine (SVM) guide. It contains 4 numerical features, and the class label.
    +
    +We can simply import the dataset then using:
    +
    +{% highlight scala %}
    +
    +import org.apache.flink.ml.MLUtils
    +
    +val astroTrain = MLUtils.readLibSVM("/path/to/svmguide1")
    +val astroTest = MLUtils.readLibSVM("/path/to/svmguide1.t")
    +
    +{% endhighlight %}
    +
    +This gives us two `DataSet[LabeledVector]` objects that we will use in the following section to
    +create a classifier.
    +
    +## Classification
    +
    +Once we have imported the dataset we can train a `Predictor` such as a linear SVM classifier.
    +We can set a number of parameters for the classifier. Here we set the `Blocks` parameter,
    +which is used to split the input by the underlying CoCoA algorithm [2] uses. The regularization
    --- End diff --
    
    Same here with [2]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/792#discussion_r32196821
  
    --- Diff: docs/libs/ml/quickstart.md ---
    @@ -24,4 +25,214 @@ under the License.
     * This will be replaced by the TOC
     {:toc}
     
    -Coming soon.
    +## Introduction
    +
    +FlinkML is designed to make learning from your data a straight-forward process, abstracting away
    +the complexities that usually come with having to deal with big data learning tasks. In this
    --- End diff --
    
    Maybe only "come with big data learning tasks"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by thvasilo <gi...@git.apache.org>.

Github user thvasilo commented on the pull request:

    https://github.com/apache/flink/pull/792#issuecomment-111043422
  
    Addressed the last PR comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-2072] [ml] [docs] Add a quickstart guid...

Posted by tillrohrmann <gi...@git.apache.org>.

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/792#discussion_r32197272

--- Diff: docs/libs/ml/quickstart.md ---
@@ -24,4 +25,214 @@ under the License.
* This will be replaced by the TOC
{:toc}

Why creating a local environment? Why not using `ExecutionEnvironment.getExecutionEnvironment`?