You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by fobeligi <gi...@git.apache.org> on 2015/06/06 00:25:59 UTC

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

GitHub user fobeligi opened a pull request:

    https://github.com/apache/flink/pull/798

    [Flink 1844] Add Normaliser to ML library

    Adds a MinMaxScaler to the ML preprocessing package. MinMax scaler scales the values to a user-specified range.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/fobeligi/incubator-flink FLINK-1844

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/798.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #798
    
----
commit 802b9da07a2c3f7c055b4c024aaecbbe647db1cd
Author: fobeligi <fa...@gmail.com>
Date:   2015-06-05T21:12:43Z

    [FLINK-1844] Add MinMaxScaler implementation in the proprocessing package, test for the for the corresponding functionality and documentation.

commit e639185108f9bda253e296bae4c6c4269a30d1d0
Author: fobeligi <fa...@gmail.com>
Date:   2015-06-05T22:12:33Z

    [FLINK-1844] Change second test to use LabeledVectors instead of Vectors

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31895095
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,255 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = (0,1).
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxScaler, T] {
    +    override def fit(instance: MinMaxScaler, fitParameters: ParameterMap, input: DataSet[T])
    +    : Unit = {
    +      val metrics = extractFeatureMinMaxVectors(input)
    +
    +      instance.metricsOption = Some(metrics)
    +    }
    +  }
    +
    +  /** Trains the [[MinMaxScaler]] by learning the minimum and maximum of the features of the
    +    * training data which is of type [[LabeledVector]]. The minimum and maximum are used to
    +    * transform the given input data.
    +    *
    +    */
    +  implicit val fitLabeledVectorMinMaxScaler = {
    +    new FitOperation[MinMaxScaler, LabeledVector] {
    +      override def fit(
    +        instance: MinMaxScaler,
    +        fitParameters: ParameterMap,
    +        input: DataSet[LabeledVector])
    +      : Unit = {
    +        val vectorDS = input.map(_.vector)
    +        val metrics = extractFeatureMinMaxVectors(vectorDS)
    +
    +        instance.metricsOption = Some(metrics)
    +      }
    +    }
    +  }
    +
    +  /** Calculates in one pass over the data the features' minimum and maximum values.
    +    *
    +    *
    +    * @param dataSet The data set for which we want to calculate the minimum and maximum values.
    +    * @return  DataSet containing a single tuple of two vectors (minVector, maxVector).
    +    *          The first vector represents the minimum values vector and the second is the maximum
    +    *          values vector.
    +    */
    +  private def extractFeatureMinMaxVectors[T <: Vector](dataSet: DataSet[T])
    +  : DataSet[(linalg.Vector[Double], linalg.Vector[Double])] = {
    +
    +    val minMax = dataSet.map {
    +      v => (v.asBreeze, v.asBreeze)
    +    }.reduce {
    +      (minMax1, minMax2) => {
    +
    +        val tempMinimum = linalg.Vector.zeros[Double](minMax1._1.length)
    +
    +        for (i <- 0 until minMax1._1.length) {
    +          tempMinimum(i) = if (minMax1._1(i) < minMax2._1(i)) {
    +            minMax1._1(i)
    +          } else {
    +            minMax2._1(i)
    +          }
    +        }
    --- End diff --
    
    Using Breeze you can simply write `tempMin = min(minMax1, minMax2)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31894783
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,255 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = (0,1).
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxScaler, T] {
    +    override def fit(instance: MinMaxScaler, fitParameters: ParameterMap, input: DataSet[T])
    +    : Unit = {
    +      val metrics = extractFeatureMinMaxVectors(input)
    +
    +      instance.metricsOption = Some(metrics)
    +    }
    +  }
    +
    +  /** Trains the [[MinMaxScaler]] by learning the minimum and maximum of the features of the
    +    * training data which is of type [[LabeledVector]]. The minimum and maximum are used to
    +    * transform the given input data.
    +    *
    +    */
    +  implicit val fitLabeledVectorMinMaxScaler = {
    +    new FitOperation[MinMaxScaler, LabeledVector] {
    +      override def fit(
    +        instance: MinMaxScaler,
    +        fitParameters: ParameterMap,
    +        input: DataSet[LabeledVector])
    +      : Unit = {
    +        val vectorDS = input.map(_.vector)
    +        val metrics = extractFeatureMinMaxVectors(vectorDS)
    +
    +        instance.metricsOption = Some(metrics)
    +      }
    +    }
    +  }
    +
    +  /** Calculates in one pass over the data the features' minimum and maximum values.
    +    *
    +    *
    --- End diff --
    
    two line breaks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31921435
  
    --- Diff: docs/libs/ml/minMax_scaler.md ---
    @@ -0,0 +1,112 @@
    +---
    +mathjax: include
    +htmlTitle: FlinkML - MinMax Scaler
    +title: <a href="../ml">FlinkML</a> - MinMax Scaler
    +---
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +-->
    +
    +* This will be replaced by the TOC
    +{:toc}
    +
    +## Description
    +
    + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
    + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
    + Given a set of input data $x_1, x_2,... x_n$, with minimum value:
    +
    + $$x_{min} = min({x_1, x_2,..., x_n})$$
    +
    + and maximum value:
    +
    + $$x_{max} = max({x_1, x_2,..., x_n})$$
    +
    +The scaled data set $z_1, z_2,...,z_n$ will be:
    +
    + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$
    +
    +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
    +
    +## Operations
    +
    +`MinMaxScaler` is a `Transformer`.
    +As such, it supports the `fit` and `transform` operation.
    +
    +### Fit
    +
    +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
    +
    +* `fit[T <: Vector]: DataSet[T] => Unit`
    +* `fit: DataSet[LabeledVector] => Unit`
    +
    +### Transform
    +
    +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
    +
    +* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
    +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
    +
    +## Parameters
    +
    +The MinMax scaler implementation can be controlled by the following two parameters:
    +
    + <table class="table table-bordered">
    +  <thead>
    +    <tr>
    +      <th class="text-left" style="width: 20%">Parameters</th>
    +      <th class="text-center">Description</th>
    +    </tr>
    +  </thead>
    +
    +  <tbody>
    +    <tr>
    +      <td><strong>Min</strong></td>
    +      <td>
    +        <p>
    +          The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
    +        </p>
    +      </td>
    +    </tr>
    +    <tr>
    +      <td><strong>Max</strong></td>
    +      <td>
    +        <p>
    +          The maximum value of the range for the scaled data set. (Default value: <strong>1.0</strong>)
    +        </p>
    +      </td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +## Examples
    +
    +{% highlight scala %}
    +// Create MinMax scaler transformer
    +val minMaxscaler = MinMaxScaler()
    +.setMin(-1.0)
    --- End diff --
    
    Will address this when merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by fobeligi <gi...@git.apache.org>.
Github user fobeligi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31913634
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    I am using metricsOption vectors internally in the transformer in elementwise subtraction and divisions, so instead of transforming to/from Breeze to flink.ml.math.Vector I have it as breeze.linalg.Vector. 
    Can I perform the same operations with flink.ml.math.Vector, or do you believe that it would be better to perform the transformations (to/from breeze vectors) in the functions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31921716
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    --- End diff --
    
    Well the function's return type is a `FitOperation`. Thus, we should add a description for the return statement. Will do it when merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31895797
  
    --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala ---
    @@ -0,0 +1,180 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{DenseVector, Vector}
    +import org.apache.flink.test.util.FlinkTestBase
    +import org.scalatest.{FlatSpec, Matchers}
    +
    +
    +class MinMaxScalerITSuite
    +  extends FlatSpec
    +  with Matchers
    +  with FlinkTestBase {
    +
    +  behavior of "Flink's MinMax Scaler"
    +
    +  import MinMaxScalerData._
    +
    +  it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in {
    +
    +    val env = ExecutionEnvironment.getExecutionEnvironment
    +
    +    val dataSet = env.fromCollection(data)
    +    val minMaxScaler = MinMaxScaler()
    +    minMaxScaler.fit(dataSet)
    +    val scaledVectors = minMaxScaler.transform(dataSet).collect
    +
    +    scaledVectors.length should equal(data.length)
    +
    +    for (vector <- scaledVectors) {
    +      val test = vector.asBreeze.forall(fv => {
    +        fv >= 0.0 && fv <= 1.0
    +      })
    +      test shouldEqual (true)
    +    }
    +
    +  }
    +
    +  it should "scale vectors' values in the (-1.0,1.0) range" in {
    +
    +    val env = ExecutionEnvironment.getExecutionEnvironment
    +
    +    val dataSet = env.fromCollection(data2)
    +    val minMaxScaler = MinMaxScaler().setMin(-1.0).setMax(1.0)
    +    minMaxScaler.fit(dataSet)
    +    val scaledVectors = minMaxScaler.transform(dataSet).collect
    +
    +    scaledVectors.length should equal(data2.length)
    +
    +    for (labeledVector <- scaledVectors) {
    +      val test = labeledVector.vector.asBreeze.forall(lv => {
    +        lv >= -1.0 && lv <= 1.0
    --- End diff --
    
    The same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31894690
  
    --- Diff: docs/libs/ml/minMax_scaler.md ---
    @@ -0,0 +1,113 @@
    +---
    +mathjax: include
    +htmlTitle: FlinkML - MinMax Scaler
    +title: <a href="../ml">FlinkML</a> - MinMax Scaler
    +---
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +-->
    +
    +* This will be replaced by the TOC
    +{:toc}
    +
    +## Description
    +
    + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
    + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
    + Given a set of input data $x_1, x_2,... x_n$, with minimum value:
    +
    + $$x_{min} = min({x_1, x_2,..., x_n})$$
    +
    + and maximum value:
    +
    + $$x_{max} = max({x_1, x_2,..., x_n})$$
    +
    +The scaled data set $z_1, z_2,...,z_n$ will be:
    +
    + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$
    +
    +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
    +
    +## Operations
    +
    +`MinMaxScaler` is a `Transformer`.
    +As such, it supports the `fit` and `transform` operation.
    +
    +### Fit
    +
    +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
    +
    +* `fit[T <: Vector]: DataSet[T] => Unit`
    +* `fit: DataSet[LabeledVector] => Unit`
    +
    +### Transform
    +
    +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
    +
    +* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
    +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
    +
    +## Parameters
    +
    +The MinMax scaler implementation can be controlled by the following two parameters:
    +
    + <table class="table table-bordered">
    +  <thead>
    +    <tr>
    +      <th class="text-left" style="width: 20%">Parameters</th>
    +      <th class="text-center">Description</th>
    +    </tr>
    +  </thead>
    +
    +  <tbody>
    +    <tr>
    +      <td><strong>Min</strong></td>
    +      <td>
    +        <p>
    +          The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
    +        </p>
    +      </td>
    +    </tr>
    +    <tr>
    +      <td><strong>Std</strong></td>
    --- End diff --
    
    Shouldn't this be called `Max`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31911162
  
    --- Diff: docs/libs/ml/minMax_scaler.md ---
    @@ -0,0 +1,112 @@
    +---
    +mathjax: include
    +htmlTitle: FlinkML - MinMax Scaler
    +title: <a href="../ml">FlinkML</a> - MinMax Scaler
    +---
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +-->
    +
    +* This will be replaced by the TOC
    +{:toc}
    +
    +## Description
    +
    + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
    + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
    + Given a set of input data $x_1, x_2,... x_n$, with minimum value:
    +
    + $$x_{min} = min({x_1, x_2,..., x_n})$$
    +
    + and maximum value:
    +
    + $$x_{max} = max({x_1, x_2,..., x_n})$$
    +
    +The scaled data set $z_1, z_2,...,z_n$ will be:
    +
    + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$
    +
    +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
    +
    +## Operations
    +
    +`MinMaxScaler` is a `Transformer`.
    +As such, it supports the `fit` and `transform` operation.
    +
    +### Fit
    +
    +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
    +
    +* `fit[T <: Vector]: DataSet[T] => Unit`
    +* `fit: DataSet[LabeledVector] => Unit`
    +
    +### Transform
    +
    +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
    +
    +* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
    +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
    +
    +## Parameters
    +
    +The MinMax scaler implementation can be controlled by the following two parameters:
    +
    + <table class="table table-bordered">
    +  <thead>
    +    <tr>
    +      <th class="text-left" style="width: 20%">Parameters</th>
    +      <th class="text-center">Description</th>
    +    </tr>
    +  </thead>
    +
    +  <tbody>
    +    <tr>
    +      <td><strong>Min</strong></td>
    +      <td>
    +        <p>
    +          The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
    +        </p>
    +      </td>
    +    </tr>
    +    <tr>
    +      <td><strong>Max</strong></td>
    +      <td>
    +        <p>
    +          The maximum value of the range for the scaled data set. (Default value: <strong>1.0</strong>)
    +        </p>
    +      </td>
    +    </tr>
    +  </tbody>
    +</table>
    +
    +## Examples
    +
    +{% highlight scala %}
    +// Create MinMax scaler transformer
    +val minMaxscaler = MinMaxScaler()
    +.setMin(-1.0)
    --- End diff --
    
    Indent 2 spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by fobeligi <gi...@git.apache.org>.
Github user fobeligi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31924947
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    Hey, if the {{metricsOption}} field is package private then my tests will fail, cause I am also testing in the {{MinMaxScalerITSuite}} if the min, max of each feature has been calculated correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by fobeligi <gi...@git.apache.org>.
Github user fobeligi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31895883
  
    --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala ---
    @@ -0,0 +1,180 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{DenseVector, Vector}
    +import org.apache.flink.test.util.FlinkTestBase
    +import org.scalatest.{FlatSpec, Matchers}
    +
    +
    +class MinMaxScalerITSuite
    +  extends FlatSpec
    +  with Matchers
    +  with FlinkTestBase {
    +
    +  behavior of "Flink's MinMax Scaler"
    +
    +  import MinMaxScalerData._
    +
    +  it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in {
    +
    +    val env = ExecutionEnvironment.getExecutionEnvironment
    +
    +    val dataSet = env.fromCollection(data)
    +    val minMaxScaler = MinMaxScaler()
    +    minMaxScaler.fit(dataSet)
    +    val scaledVectors = minMaxScaler.transform(dataSet).collect
    +
    +    scaledVectors.length should equal(data.length)
    +
    +    for (vector <- scaledVectors) {
    +      val test = vector.asBreeze.forall(fv => {
    +        fv >= 0.0 && fv <= 1.0
    --- End diff --
    
    In this case I will use the same method as in the implementation of the transformer.
    Calculating the min and max of each feature and then applying the formula which I explain in the documentation. Is that OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31895117
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,255 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = (0,1).
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxScaler, T] {
    +    override def fit(instance: MinMaxScaler, fitParameters: ParameterMap, input: DataSet[T])
    +    : Unit = {
    +      val metrics = extractFeatureMinMaxVectors(input)
    +
    +      instance.metricsOption = Some(metrics)
    +    }
    +  }
    +
    +  /** Trains the [[MinMaxScaler]] by learning the minimum and maximum of the features of the
    +    * training data which is of type [[LabeledVector]]. The minimum and maximum are used to
    +    * transform the given input data.
    +    *
    +    */
    +  implicit val fitLabeledVectorMinMaxScaler = {
    +    new FitOperation[MinMaxScaler, LabeledVector] {
    +      override def fit(
    +        instance: MinMaxScaler,
    +        fitParameters: ParameterMap,
    +        input: DataSet[LabeledVector])
    +      : Unit = {
    +        val vectorDS = input.map(_.vector)
    +        val metrics = extractFeatureMinMaxVectors(vectorDS)
    +
    +        instance.metricsOption = Some(metrics)
    +      }
    +    }
    +  }
    +
    +  /** Calculates in one pass over the data the features' minimum and maximum values.
    +    *
    +    *
    +    * @param dataSet The data set for which we want to calculate the minimum and maximum values.
    +    * @return  DataSet containing a single tuple of two vectors (minVector, maxVector).
    +    *          The first vector represents the minimum values vector and the second is the maximum
    +    *          values vector.
    +    */
    +  private def extractFeatureMinMaxVectors[T <: Vector](dataSet: DataSet[T])
    +  : DataSet[(linalg.Vector[Double], linalg.Vector[Double])] = {
    +
    +    val minMax = dataSet.map {
    +      v => (v.asBreeze, v.asBreeze)
    +    }.reduce {
    +      (minMax1, minMax2) => {
    +
    +        val tempMinimum = linalg.Vector.zeros[Double](minMax1._1.length)
    +
    +        for (i <- 0 until minMax1._1.length) {
    +          tempMinimum(i) = if (minMax1._1(i) < minMax2._1(i)) {
    +            minMax1._1(i)
    +          } else {
    +            minMax2._1(i)
    +          }
    +        }
    +
    +        val tempMaximum = linalg.Vector.zeros[Double](minMax1._2.length)
    +
    +        for (i <- 0 until minMax1._2.length) {
    +          tempMaximum(i) = if (minMax1._2(i) > minMax2._2(i)) {
    +            minMax1._2(i)
    +          } else {
    +            minMax2._2(i)
    +          }
    +        }
    --- End diff --
    
    Using Breeze you can simply write `tempMax = max(minMax1, minMax2)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31911306
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    Are these of breeze.linag.Vector type? If yes why not use flink.ml.math.Vector?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31921419
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    --- End diff --
    
    You're right. Will add it when I merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31895445
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,255 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = (0,1).
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxScaler, T] {
    +    override def fit(instance: MinMaxScaler, fitParameters: ParameterMap, input: DataSet[T])
    +    : Unit = {
    +      val metrics = extractFeatureMinMaxVectors(input)
    +
    +      instance.metricsOption = Some(metrics)
    +    }
    +  }
    +
    +  /** Trains the [[MinMaxScaler]] by learning the minimum and maximum of the features of the
    +    * training data which is of type [[LabeledVector]]. The minimum and maximum are used to
    +    * transform the given input data.
    +    *
    +    */
    +  implicit val fitLabeledVectorMinMaxScaler = {
    +    new FitOperation[MinMaxScaler, LabeledVector] {
    +      override def fit(
    +        instance: MinMaxScaler,
    +        fitParameters: ParameterMap,
    +        input: DataSet[LabeledVector])
    +      : Unit = {
    +        val vectorDS = input.map(_.vector)
    +        val metrics = extractFeatureMinMaxVectors(vectorDS)
    +
    +        instance.metricsOption = Some(metrics)
    +      }
    +    }
    +  }
    +
    +  /** Calculates in one pass over the data the features' minimum and maximum values.
    +    *
    +    *
    +    * @param dataSet The data set for which we want to calculate the minimum and maximum values.
    +    * @return  DataSet containing a single tuple of two vectors (minVector, maxVector).
    +    *          The first vector represents the minimum values vector and the second is the maximum
    +    *          values vector.
    +    */
    +  private def extractFeatureMinMaxVectors[T <: Vector](dataSet: DataSet[T])
    +  : DataSet[(linalg.Vector[Double], linalg.Vector[Double])] = {
    +
    +    val minMax = dataSet.map {
    +      v => (v.asBreeze, v.asBreeze)
    +    }.reduce {
    +      (minMax1, minMax2) => {
    +
    +        val tempMinimum = linalg.Vector.zeros[Double](minMax1._1.length)
    +
    +        for (i <- 0 until minMax1._1.length) {
    +          tempMinimum(i) = if (minMax1._1(i) < minMax2._1(i)) {
    +            minMax1._1(i)
    +          } else {
    +            minMax2._1(i)
    +          }
    +        }
    +
    +        val tempMaximum = linalg.Vector.zeros[Double](minMax1._2.length)
    +
    +        for (i <- 0 until minMax1._2.length) {
    +          tempMaximum(i) = if (minMax1._2(i) > minMax2._2(i)) {
    +            minMax1._2(i)
    +          } else {
    +            minMax2._2(i)
    +          }
    +        }
    +        (tempMinimum, tempMaximum)
    +      }
    +    }
    +    minMax
    +  }
    +
    +  /** [[TransformOperation]] which scales input data of subtype of [[Vector]] with respect to
    +    * the calculated minimum and maximum of the training data. The minimum and maximum
    +    * values of the resulting data is configurable.
    +    *
    +    * @tparam T Type of the input and output data which has to be a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def transformVectors[T <: Vector : BreezeVectorConverter : TypeInformation : ClassTag]
    +  = {
    +    new TransformOperation[MinMaxScaler, T, T] {
    +      override def transform(
    +        instance: MinMaxScaler,
    +        transformParameters: ParameterMap,
    +        input: DataSet[T])
    +      : DataSet[T] = {
    +
    +        val resultingParameters = instance.parameters ++ transformParameters
    +        val min = resultingParameters(Min)
    +        val max = resultingParameters(Max)
    +
    +        instance.metricsOption match {
    +          case Some(metrics) => {
    +            input.mapWithBcVariable(metrics) {
    +              (vector, metrics) => {
    +                val (broadcastMin, broadcastMax) = metrics
    +                var myVector = vector.asBreeze
    +
    +                myVector -= broadcastMin
    +                myVector :/= (broadcastMax - broadcastMin)
    +                myVector = (myVector :* (max - min)) + min
    +                myVector.fromBreeze
    +              }
    +            }
    +          }
    +
    +          case None =>
    +            throw new RuntimeException("The MinMaxScaler has not been fitted to the data. " +
    +              "This is necessary to estimate the minimum and maximum of the data.")
    +        }
    +      }
    +    }
    +  }
    +
    +  implicit val transformLabeledVectors = {
    +    new TransformOperation[MinMaxScaler, LabeledVector, LabeledVector] {
    +      override def transform(instance: MinMaxScaler,
    +        transformParameters: ParameterMap,
    +        input: DataSet[LabeledVector]): DataSet[LabeledVector] = {
    +        val resultingParameters = instance.parameters ++ transformParameters
    +        val min = resultingParameters(Min)
    +        val max = resultingParameters(Max)
    +
    +        instance.metricsOption match {
    +          case Some(metrics) => {
    +            input.mapWithBcVariable(metrics) {
    +              (labeledVector, metrics) => {
    +                val (broadcastMin, broadcastMax) = metrics
    +                val LabeledVector(label, vector) = labeledVector
    +                var breezeVector = vector.asBreeze
    +
    +                breezeVector -= broadcastMin
    +                breezeVector :/= (broadcastMax - broadcastMin)
    +                breezeVector = (breezeVector :* (max - min)) + min
    +                LabeledVector(label, breezeVector.fromBreeze)
    +
    --- End diff --
    
    line break.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31922171
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    As private state, the developer should be able to choose any type. Thus, a `BreezeVector` should be fine here. I was just wondering, whether a `DenseVector` does not make more sense here. Is it safe to assume that every feature has at least 2 non-zero values?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31921747
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxScaler, T] {
    +    override def fit(instance: MinMaxScaler, fitParameters: ParameterMap, input: DataSet[T])
    +    : Unit = {
    +      val metrics = extractFeatureMinMaxVectors(input)
    +
    +      instance.metricsOption = Some(metrics)
    +    }
    +  }
    +
    +  /** Trains the [[MinMaxScaler]] by learning the minimum and maximum of the features of the
    +    * training data which is of type [[LabeledVector]]. The minimum and maximum are used to
    +    * transform the given input data.
    +    *
    +    */
    +  implicit val fitLabeledVectorMinMaxScaler = {
    +    new FitOperation[MinMaxScaler, LabeledVector] {
    +      override def fit(
    +        instance: MinMaxScaler,
    +        fitParameters: ParameterMap,
    +        input: DataSet[LabeledVector])
    +      : Unit = {
    +        val vectorDS = input.map(_.vector)
    +        val metrics = extractFeatureMinMaxVectors(vectorDS)
    +
    +        instance.metricsOption = Some(metrics)
    +      }
    +    }
    +  }
    +
    +  /** Calculates in one pass over the data the features' minimum and maximum values.
    +    *
    +    * @param dataSet The data set for which we want to calculate the minimum and maximum values.
    +    * @return  DataSet containing a single tuple of two vectors (minVector, maxVector).
    +    *          The first vector represents the minimum values vector and the second is the maximum
    +    *          values vector.
    +    */
    +  private def extractFeatureMinMaxVectors[T <: Vector](dataSet: DataSet[T])
    +  : DataSet[(linalg.Vector[Double], linalg.Vector[Double])] = {
    +
    +    val minMax = dataSet.map {
    +      v => (v.asBreeze, v.asBreeze)
    +    }.reduce {
    +      (minMax1, minMax2) => {
    +
    +        val tempMinimum = min(minMax1._1, minMax2._1)
    +        val tempMaximum = max(minMax1._2, minMax2._2)
    +
    +        (tempMinimum, tempMaximum)
    +      }
    +    }
    +    minMax
    +  }
    +
    +  /** [[TransformOperation]] which scales input data of subtype of [[Vector]] with respect to
    +    * the calculated minimum and maximum of the training data. The minimum and maximum
    +    * values of the resulting data is configurable.
    +    *
    +    * @tparam T Type of the input and output data which has to be a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def transformVectors[T <: Vector : BreezeVectorConverter : TypeInformation : ClassTag]
    +  = {
    +    new TransformOperation[MinMaxScaler, T, T] {
    +      override def transform(
    +        instance: MinMaxScaler,
    +        transformParameters: ParameterMap,
    +        input: DataSet[T])
    +      : DataSet[T] = {
    +
    +        val resultingParameters = instance.parameters ++ transformParameters
    +        val min = resultingParameters(Min)
    +        val max = resultingParameters(Max)
    +
    +        instance.metricsOption match {
    +          case Some(metrics) => {
    +            input.mapWithBcVariable(metrics) {
    +              (vector, metrics) => {
    +                val (broadcastMin, broadcastMax) = metrics
    +                var myVector = vector.asBreeze
    --- End diff --
    
    I agree. Will make the changes when merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on the pull request:

    https://github.com/apache/flink/pull/798#issuecomment-109983940
  
    The documentation must also change index.html (FlinkML landing site) so that it is linked from somewhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by fobeligi <gi...@git.apache.org>.
Github user fobeligi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31894794
  
    --- Diff: docs/libs/ml/minMax_scaler.md ---
    @@ -0,0 +1,113 @@
    +---
    +mathjax: include
    +htmlTitle: FlinkML - MinMax Scaler
    +title: <a href="../ml">FlinkML</a> - MinMax Scaler
    +---
    +<!--
    +Licensed to the Apache Software Foundation (ASF) under one
    +or more contributor license agreements.  See the NOTICE file
    +distributed with this work for additional information
    +regarding copyright ownership.  The ASF licenses this file
    +to you under the Apache License, Version 2.0 (the
    +"License"); you may not use this file except in compliance
    +with the License.  You may obtain a copy of the License at
    +
    +  http://www.apache.org/licenses/LICENSE-2.0
    +
    +Unless required by applicable law or agreed to in writing,
    +software distributed under the License is distributed on an
    +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +KIND, either express or implied.  See the License for the
    +specific language governing permissions and limitations
    +under the License.
    +-->
    +
    +* This will be replaced by the TOC
    +{:toc}
    +
    +## Description
    +
    + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max].
    + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval.
    + Given a set of input data $x_1, x_2,... x_n$, with minimum value:
    +
    + $$x_{min} = min({x_1, x_2,..., x_n})$$
    +
    + and maximum value:
    +
    + $$x_{max} = max({x_1, x_2,..., x_n})$$
    +
    +The scaled data set $z_1, z_2,...,z_n$ will be:
    +
    + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$
    +
    +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale.
    +
    +## Operations
    +
    +`MinMaxScaler` is a `Transformer`.
    +As such, it supports the `fit` and `transform` operation.
    +
    +### Fit
    +
    +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
    +
    +* `fit[T <: Vector]: DataSet[T] => Unit`
    +* `fit: DataSet[LabeledVector] => Unit`
    +
    +### Transform
    +
    +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type:
    +
    +* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
    +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
    +
    +## Parameters
    +
    +The MinMax scaler implementation can be controlled by the following two parameters:
    +
    + <table class="table table-bordered">
    +  <thead>
    +    <tr>
    +      <th class="text-left" style="width: 20%">Parameters</th>
    +      <th class="text-center">Description</th>
    +    </tr>
    +  </thead>
    +
    +  <tbody>
    +    <tr>
    +      <td><strong>Min</strong></td>
    +      <td>
    +        <p>
    +          The minimum value of the range for the scaled data set. (Default value: <strong>0.0</strong>)
    +        </p>
    +      </td>
    +    </tr>
    +    <tr>
    +      <td><strong>Std</strong></td>
    --- End diff --
    
    Yes, you are right!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31914466
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    Not right now, so these can remain. I was mostly concerned that this parameter was user-facing, meaning the user had to provide Breeze vectors as parameters, but that is not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31911384
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    --- End diff --
    
    Missing return type


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/798#issuecomment-109982291
  
    LGTM. Will merge once Travis gives green light.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31913806
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    +
    +  /** Sets the minimum for the range of the transformed data
    +    *
    +    * @param min the user-specified minimum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMin(min: Double): MinMaxScaler = {
    +    parameters.add(Min, min)
    +    this
    +  }
    +
    +  /** Sets the maximum for the range of the transformed data
    +    *
    +    * @param max the user-specified maximum value.
    +    * @return the MinMaxScaler instance with its minimum value set to the user-specified value.
    +    */
    +  def setMax(max: Double): MinMaxScaler = {
    +    parameters.add(Max, max)
    +    this
    +  }
    +}
    +
    +object MinMaxScaler {
    +
    +  // ====================================== Parameters =============================================
    +
    +  case object Min extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(0.0)
    +  }
    +
    +  case object Max extends Parameter[Double] {
    +    override val defaultValue: Option[Double] = Some(1.0)
    +  }
    +
    +  // ==================================== Factory methods ==========================================
    +
    +  def apply(): MinMaxScaler = {
    +    new MinMaxScaler()
    +  }
    +
    +  // ====================================== Operations =============================================
    +
    +  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and
    +    * maximum of each feature of the training data. These values are used in the transform step
    +    * to transform the given input data.
    +    *
    +    * @tparam T Input data type which is a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxScaler, T] {
    +    override def fit(instance: MinMaxScaler, fitParameters: ParameterMap, input: DataSet[T])
    +    : Unit = {
    +      val metrics = extractFeatureMinMaxVectors(input)
    +
    +      instance.metricsOption = Some(metrics)
    +    }
    +  }
    +
    +  /** Trains the [[MinMaxScaler]] by learning the minimum and maximum of the features of the
    +    * training data which is of type [[LabeledVector]]. The minimum and maximum are used to
    +    * transform the given input data.
    +    *
    +    */
    +  implicit val fitLabeledVectorMinMaxScaler = {
    +    new FitOperation[MinMaxScaler, LabeledVector] {
    +      override def fit(
    +        instance: MinMaxScaler,
    +        fitParameters: ParameterMap,
    +        input: DataSet[LabeledVector])
    +      : Unit = {
    +        val vectorDS = input.map(_.vector)
    +        val metrics = extractFeatureMinMaxVectors(vectorDS)
    +
    +        instance.metricsOption = Some(metrics)
    +      }
    +    }
    +  }
    +
    +  /** Calculates in one pass over the data the features' minimum and maximum values.
    +    *
    +    * @param dataSet The data set for which we want to calculate the minimum and maximum values.
    +    * @return  DataSet containing a single tuple of two vectors (minVector, maxVector).
    +    *          The first vector represents the minimum values vector and the second is the maximum
    +    *          values vector.
    +    */
    +  private def extractFeatureMinMaxVectors[T <: Vector](dataSet: DataSet[T])
    +  : DataSet[(linalg.Vector[Double], linalg.Vector[Double])] = {
    +
    +    val minMax = dataSet.map {
    +      v => (v.asBreeze, v.asBreeze)
    +    }.reduce {
    +      (minMax1, minMax2) => {
    +
    +        val tempMinimum = min(minMax1._1, minMax2._1)
    +        val tempMaximum = max(minMax1._2, minMax2._2)
    +
    +        (tempMinimum, tempMaximum)
    +      }
    +    }
    +    minMax
    +  }
    +
    +  /** [[TransformOperation]] which scales input data of subtype of [[Vector]] with respect to
    +    * the calculated minimum and maximum of the training data. The minimum and maximum
    +    * values of the resulting data is configurable.
    +    *
    +    * @tparam T Type of the input and output data which has to be a subtype of [[Vector]]
    +    * @return
    +    */
    +  implicit def transformVectors[T <: Vector : BreezeVectorConverter : TypeInformation : ClassTag]
    +  = {
    +    new TransformOperation[MinMaxScaler, T, T] {
    +      override def transform(
    +        instance: MinMaxScaler,
    +        transformParameters: ParameterMap,
    +        input: DataSet[T])
    +      : DataSet[T] = {
    +
    +        val resultingParameters = instance.parameters ++ transformParameters
    +        val min = resultingParameters(Min)
    +        val max = resultingParameters(Max)
    +
    +        instance.metricsOption match {
    +          case Some(metrics) => {
    +            input.mapWithBcVariable(metrics) {
    +              (vector, metrics) => {
    +                val (broadcastMin, broadcastMax) = metrics
    +                var myVector = vector.asBreeze
    --- End diff --
    
    Can avoid some code duplication here by having a transformVector(vector: flink.ml.Vector, metrics: (bMin, bMax)): flink.ml.Vector function. That can then get called inside the map for both the Vector and LabedledVector case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31911138
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    --- End diff --
    
    Doesn't LabedledVector apply here as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on the pull request:

    https://github.com/apache/flink/pull/798#issuecomment-109913485
  
    Really good work @fobeligi. The code is really well structured and documented. I had only some minor comments. When you have them addressed, I think it's good to be merged :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by fobeligi <gi...@git.apache.org>.
Github user fobeligi commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31927083
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    Yes ^^


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31926077
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    Package private should be ok, since the test is in the same package, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/798


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by thvasilo <gi...@git.apache.org>.
Github user thvasilo commented on the pull request:

    https://github.com/apache/flink/pull/798#issuecomment-109937398
  
    Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser to ML library* so that JIRA picks up on the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31895785
  
    --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala ---
    @@ -0,0 +1,180 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{DenseVector, Vector}
    +import org.apache.flink.test.util.FlinkTestBase
    +import org.scalatest.{FlatSpec, Matchers}
    +
    +
    +class MinMaxScalerITSuite
    +  extends FlatSpec
    +  with Matchers
    +  with FlinkTestBase {
    +
    +  behavior of "Flink's MinMax Scaler"
    +
    +  import MinMaxScalerData._
    +
    +  it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in {
    +
    +    val env = ExecutionEnvironment.getExecutionEnvironment
    +
    +    val dataSet = env.fromCollection(data)
    +    val minMaxScaler = MinMaxScaler()
    +    minMaxScaler.fit(dataSet)
    +    val scaledVectors = minMaxScaler.transform(dataSet).collect
    +
    +    scaledVectors.length should equal(data.length)
    +
    +    for (vector <- scaledVectors) {
    +      val test = vector.asBreeze.forall(fv => {
    +        fv >= 0.0 && fv <= 1.0
    --- End diff --
    
    Maybe we could not only compare whether the data lies between `0` and `1` but also whether the vectors have been correctly scaled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31899145
  
    --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala ---
    @@ -0,0 +1,180 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml.common.LabeledVector
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{DenseVector, Vector}
    +import org.apache.flink.test.util.FlinkTestBase
    +import org.scalatest.{FlatSpec, Matchers}
    +
    +
    +class MinMaxScalerITSuite
    +  extends FlatSpec
    +  with Matchers
    +  with FlinkTestBase {
    +
    +  behavior of "Flink's MinMax Scaler"
    +
    +  import MinMaxScalerData._
    +
    +  it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in {
    +
    +    val env = ExecutionEnvironment.getExecutionEnvironment
    +
    +    val dataSet = env.fromCollection(data)
    +    val minMaxScaler = MinMaxScaler()
    +    minMaxScaler.fit(dataSet)
    +    val scaledVectors = minMaxScaler.transform(dataSet).collect
    +
    +    scaledVectors.length should equal(data.length)
    +
    +    for (vector <- scaledVectors) {
    +      val test = vector.asBreeze.forall(fv => {
    +        fv >= 0.0 && fv <= 1.0
    --- End diff --
    
    Yes it is. This assures that if someone changes something of the `Transformer` logic, then he will see an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-1844] [ml] Add Normaliser to ML library

Posted by tillrohrmann <gi...@git.apache.org>.
Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/798#discussion_r31922838
  
    --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala ---
    @@ -0,0 +1,254 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.ml.preprocessing
    +
    +import breeze.linalg
    +import breeze.linalg.{max, min}
    +import org.apache.flink.api.common.typeinfo.TypeInformation
    +import org.apache.flink.api.scala._
    +import org.apache.flink.ml._
    +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
    +import org.apache.flink.ml.math.Breeze._
    +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
    +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer}
    +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
    +
    +import scala.reflect.ClassTag
    +
    +/** Scales observations, so that all features are in a user-specified range.
    +  * By default for [[MinMaxScaler]] transformer range = [0,1].
    +  *
    +  * This transformer takes a subtype of  [[Vector]] of values and maps it to a
    +  * scaled subtype of [[Vector]] such that each feature lies between a user-specified range.
    +  *
    +  * This transformer can be prepended to all [[Transformer]] and
    +  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype
    +  * of [[Vector]].
    +  *
    +  * @example
    +  * {{{
    +  *               val trainingDS: DataSet[Vector] = env.fromCollection(data)
    +  *               val transformer = MinMaxScaler().setMin(-1.0)
    +  *
    +  *               transformer.fit(trainingDS)
    +  *               val transformedDS = transformer.transform(trainingDS)
    +  * }}}
    +  *
    +  * =Parameters=
    +  *
    +  * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0
    +  * - [[Max]]: The maximum value of the range of the transformed data set; by default
    +  * equal to 1
    +  */
    +class MinMaxScaler extends Transformer[MinMaxScaler] {
    +
    +  var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None
    --- End diff --
    
    Will make the field package private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---